Search and classify topics in a corpus of text using the latent dirichlet allocation model

Descripción del Articulo

This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and...

Descripción completa

Detalles Bibliográficos
Autores: Pucuhuayla Revatta, Félix Rogelio, Iparraguirre-Villanueva, Orlando, Sierra-Liñan, Fernando, Herrera Salazar, Jose Luis, Beltozar-Clemente, Saul, Zapata-Paulini, Joselyn, Cabanillas-Carbonell, Michael
Formato: artículo
Fecha de Publicación:2023
Institución:Universidad Tecnológica del Perú
Repositorio:UTP-Institucional
Lenguaje:inglés
OAI Identifier:oai:repositorio.utp.edu.pe:20.500.12867/6686
Enlace del recurso:https://hdl.handle.net/20.500.12867/6686
https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
Nivel de acceso:acceso abierto
Materia:Latent dirichlet allocation
Topic modeling
Mathematical statistics
https://purl.org/pe-repo/ocde/ford#1.01.03
id UTPD_db075e071c18e72fca0c2148900ea814
oai_identifier_str oai:repositorio.utp.edu.pe:20.500.12867/6686
network_acronym_str UTPD
network_name_str UTP-Institucional
repository_id_str 4782
dc.title.es_PE.fl_str_mv Search and classify topics in a corpus of text using the latent dirichlet allocation model
title Search and classify topics in a corpus of text using the latent dirichlet allocation model
spellingShingle Search and classify topics in a corpus of text using the latent dirichlet allocation model
Pucuhuayla Revatta, Félix Rogelio
Latent dirichlet allocation
Topic modeling
Mathematical statistics
https://purl.org/pe-repo/ocde/ford#1.01.03
title_short Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_full Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_fullStr Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_full_unstemmed Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_sort Search and classify topics in a corpus of text using the latent dirichlet allocation model
author Pucuhuayla Revatta, Félix Rogelio
author_facet Pucuhuayla Revatta, Félix Rogelio
Iparraguirre-Villanueva, Orlando
Sierra-Liñan, Fernando
Herrera Salazar, Jose Luis
Beltozar-Clemente, Saul
Zapata-Paulini, Joselyn
Cabanillas-Carbonell, Michael
author_role author
author2 Iparraguirre-Villanueva, Orlando
Sierra-Liñan, Fernando
Herrera Salazar, Jose Luis
Beltozar-Clemente, Saul
Zapata-Paulini, Joselyn
Cabanillas-Carbonell, Michael
author2_role author
author
author
author
author
author
dc.contributor.author.fl_str_mv Pucuhuayla Revatta, Félix Rogelio
Iparraguirre-Villanueva, Orlando
Sierra-Liñan, Fernando
Herrera Salazar, Jose Luis
Beltozar-Clemente, Saul
Zapata-Paulini, Joselyn
Cabanillas-Carbonell, Michael
dc.subject.es_PE.fl_str_mv Latent dirichlet allocation
Topic modeling
Mathematical statistics
topic Latent dirichlet allocation
Topic modeling
Mathematical statistics
https://purl.org/pe-repo/ocde/ford#1.01.03
dc.subject.ocde.es_PE.fl_str_mv https://purl.org/pe-repo/ocde/ford#1.01.03
description This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.
publishDate 2023
dc.date.accessioned.none.fl_str_mv 2023-03-03T15:59:03Z
dc.date.available.none.fl_str_mv 2023-03-03T15:59:03Z
dc.date.issued.fl_str_mv 2023
dc.type.es_PE.fl_str_mv info:eu-repo/semantics/article
dc.type.version.es_PE.fl_str_mv info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.issn.none.fl_str_mv 2502-4760
dc.identifier.uri.none.fl_str_mv https://hdl.handle.net/20.500.12867/6686
dc.identifier.journal.es_PE.fl_str_mv Indonesian Journal of Electrical Engineering and Computer Science
dc.identifier.doi.none.fl_str_mv https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
identifier_str_mv 2502-4760
Indonesian Journal of Electrical Engineering and Computer Science
url https://hdl.handle.net/20.500.12867/6686
https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
dc.language.iso.es_PE.fl_str_mv eng
language eng
dc.relation.ispartofseries.none.fl_str_mv Indonesian Journal of Electrical Engineering and Computer Science;vol.30, n°1, pp. 246~256
dc.rights.es_PE.fl_str_mv info:eu-repo/semantics/openAccess
dc.rights.uri.es_PE.fl_str_mv http://creativecommons.org/licenses/by-sa/4.0/
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-sa/4.0/
dc.format.es_PE.fl_str_mv application/pdf
dc.publisher.es_PE.fl_str_mv Institute of Advanced Engineering and Science
dc.publisher.country.es_PE.fl_str_mv ID
dc.source.es_PE.fl_str_mv Repositorio Institucional - UTP
Universidad Tecnológica del Perú
dc.source.none.fl_str_mv reponame:UTP-Institucional
instname:Universidad Tecnológica del Perú
instacron:UTP
instname_str Universidad Tecnológica del Perú
instacron_str UTP
institution UTP
reponame_str UTP-Institucional
collection UTP-Institucional
bitstream.url.fl_str_mv http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/3/F.Pucuhuayla_Articulo.pdf.txt
http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/4/F.Pucuhuayla_Articulo.pdf.jpg
http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/1/F.Pucuhuayla_Articulo.pdf
http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/2/license.txt
bitstream.checksum.fl_str_mv 5ecebb7582100c3bbc167d7bc3d68902
419dd69407b5ad2b879172c44a78310b
9612a19922a6b02e74c30e5467962abb
8a4605be74aa9ea9d79846c1fba20a33
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositorio Institucional de la Universidad Tecnológica del Perú
repository.mail.fl_str_mv repositorio@utp.edu.pe
_version_ 1817984916861747200
spelling Pucuhuayla Revatta, Félix RogelioIparraguirre-Villanueva, OrlandoSierra-Liñan, FernandoHerrera Salazar, Jose LuisBeltozar-Clemente, SaulZapata-Paulini, JoselynCabanillas-Carbonell, Michael2023-03-03T15:59:03Z2023-03-03T15:59:03Z20232502-4760https://hdl.handle.net/20.500.12867/6686Indonesian Journal of Electrical Engineering and Computer Sciencehttps://doi.org/10.11591/ijeecs.v30.i1.pp246-256This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.Campus Lima Centroapplication/pdfengInstitute of Advanced Engineering and ScienceIDIndonesian Journal of Electrical Engineering and Computer Science;vol.30, n°1, pp. 246~256info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-sa/4.0/Repositorio Institucional - UTPUniversidad Tecnológica del Perúreponame:UTP-Institucionalinstname:Universidad Tecnológica del Perúinstacron:UTPLatent dirichlet allocationTopic modelingMathematical statisticshttps://purl.org/pe-repo/ocde/ford#1.01.03Search and classify topics in a corpus of text using the latent dirichlet allocation modelinfo:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionTEXTF.Pucuhuayla_Articulo.pdf.txtF.Pucuhuayla_Articulo.pdf.txtExtracted texttext/plain44833http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/3/F.Pucuhuayla_Articulo.pdf.txt5ecebb7582100c3bbc167d7bc3d68902MD53THUMBNAILF.Pucuhuayla_Articulo.pdf.jpgF.Pucuhuayla_Articulo.pdf.jpgGenerated Thumbnailimage/jpeg20111http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/4/F.Pucuhuayla_Articulo.pdf.jpg419dd69407b5ad2b879172c44a78310bMD54ORIGINALF.Pucuhuayla_Articulo.pdfF.Pucuhuayla_Articulo.pdfapplication/pdf646288http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/1/F.Pucuhuayla_Articulo.pdf9612a19922a6b02e74c30e5467962abbMD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/2/license.txt8a4605be74aa9ea9d79846c1fba20a33MD5220.500.12867/6686oai:repositorio.utp.edu.pe:20.500.12867/66862023-03-03 11:20:12.659Repositorio Institucional de la Universidad Tecnológica del Perúrepositorio@utp.edu.peTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
score 13.971837
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).