Search and classify topics in a corpus of text using the latent dirichlet allocation model

Descripción del Articulo

This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and...

Descripción completa

Detalles Bibliográficos
Autores: Pucuhuayla Revatta, Félix Rogelio, Iparraguirre-Villanueva, Orlando, Sierra-Liñan, Fernando, Herrera Salazar, Jose Luis, Beltozar-Clemente, Saul, Zapata-Paulini, Joselyn, Cabanillas-Carbonell, Michael
Formato: artículo
Fecha de Publicación:2023
Institución:Universidad Tecnológica del Perú
Repositorio:UTP-Institucional
Lenguaje:inglés
OAI Identifier:oai:repositorio.utp.edu.pe:20.500.12867/6686
Enlace del recurso:https://hdl.handle.net/20.500.12867/6686
https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
Nivel de acceso:acceso abierto
Materia:Latent dirichlet allocation
Topic modeling
Mathematical statistics
https://purl.org/pe-repo/ocde/ford#1.01.03
Descripción
Sumario:This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).