Search and classify topics in a corpus of text using the latent dirichlet allocation model

Iparraguirre-Villanueva, Orlando; Sierra-Liñan, Fernando; Herrera Salazar, Jose Luis; Beltozar-Clemente, Saul; Pucuhuayla-Revatta, Félix; Zapata-Paulini, Joselyn; Cabanillas-Carbonell, Michael

Search and classify topics in a corpus of text using the latent dirichlet allocation model

Descripción del Articulo

This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and...

Descripción completa

Detalles Bibliográficos
Autores:	Iparraguirre-Villanueva, Orlando, Sierra-Liñan, Fernando, Herrera Salazar, Jose Luis, Beltozar-Clemente, Saul, Pucuhuayla-Revatta, Félix, Zapata-Paulini, Joselyn, Cabanillas-Carbonell, Michael
Formato:	artículo
Fecha de Publicación:	2023
Institución:	Universidad Autónoma del Perú
Repositorio:	AUTONOMA-Institucional
Lenguaje:	inglés
OAI Identifier:	oai:repositorio.autonoma.edu.pe:20.500.13067/2829
Enlace del recurso:	https://hdl.handle.net/20.500.13067/2829 https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
Nivel de acceso:	acceso abierto
Materia:	Classify Discovering Latent dirichlet allocation Text corpus Topics https://purl.org/pe-repo/ocde/ford#2.02.04

id	AUTO_74017a75fb384f1d6e4d6cdb83924b9f
oai_identifier_str	oai:repositorio.autonoma.edu.pe:20.500.13067/2829
network_acronym_str	AUTO
network_name_str	AUTONOMA-Institucional
repository_id_str	4774
spelling	Iparraguirre-Villanueva, OrlandoSierra-Liñan, FernandoHerrera Salazar, Jose LuisBeltozar-Clemente, SaulPucuhuayla-Revatta, FélixZapata-Paulini, JoselynCabanillas-Carbonell, Michael2023-11-30T16:01:47Z2023-11-30T16:01:47Z2023https://hdl.handle.net/20.500.13067/2829https://doi.org/10.11591/ijeecs.v30.i1.pp246-256This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.application/pdfengIndonesian Journal of Electrical Engineering and Computer Scienceinfo:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/4.0/ClassifyDiscoveringLatent dirichlet allocationText corpusTopicshttps://purl.org/pe-repo/ocde/ford#2.02.04Search and classify topics in a corpus of text using the latent dirichlet allocation modelinfo:eu-repo/semantics/article301246256reponame:AUTONOMA-Institucionalinstname:Universidad Autónoma del Perúinstacron:AUTONOMATEXT6_2023.pdf.txt6_2023.pdf.txtExtracted texttext/plain44833http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/3/6_2023.pdf.txt5ecebb7582100c3bbc167d7bc3d68902MD53THUMBNAIL6_2023.pdf.jpg6_2023.pdf.jpgGenerated Thumbnailimage/jpeg6489http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/4/6_2023.pdf.jpg2c669bddaa25d0d930f11722bdaed6baMD54ORIGINAL6_2023.pdf6_2023.pdfArtículoapplication/pdf646288http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/1/6_2023.pdf9612a19922a6b02e74c30e5467962abbMD51LICENSElicense.txtlicense.txttext/plain; charset=utf-885http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/2/license.txt9243398ff393db1861c890baeaeee5f9MD5220.500.13067/2829oai:repositorio.autonoma.edu.pe:20.500.13067/28292023-12-01 03:00:28.382Repositorio de la Universidad Autonoma del Perúrepositorio@autonoma.peVG9kb3MgbG9zIGRlcmVjaG9zIHJlc2VydmFkb3MgcG9yOg0KVU5JVkVSU0lEQUQgQVVUw5NOT01BIERFTCBQRVLDmg0KQ1JFQVRJVkUgQ09NTU9OUw==
dc.title.es_PE.fl_str_mv	Search and classify topics in a corpus of text using the latent dirichlet allocation model
title	Search and classify topics in a corpus of text using the latent dirichlet allocation model
spellingShingle	Search and classify topics in a corpus of text using the latent dirichlet allocation model Iparraguirre-Villanueva, Orlando Classify Discovering Latent dirichlet allocation Text corpus Topics https://purl.org/pe-repo/ocde/ford#2.02.04
title_short	Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_full	Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_fullStr	Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_full_unstemmed	Search and classify topics in a corpus of text using the latent dirichlet allocation model
title_sort	Search and classify topics in a corpus of text using the latent dirichlet allocation model
author	Iparraguirre-Villanueva, Orlando
author_facet	Iparraguirre-Villanueva, Orlando Sierra-Liñan, Fernando Herrera Salazar, Jose Luis Beltozar-Clemente, Saul Pucuhuayla-Revatta, Félix Zapata-Paulini, Joselyn Cabanillas-Carbonell, Michael
author_role	author
author2	Sierra-Liñan, Fernando Herrera Salazar, Jose Luis Beltozar-Clemente, Saul Pucuhuayla-Revatta, Félix Zapata-Paulini, Joselyn Cabanillas-Carbonell, Michael
author2_role	author author author author author author
dc.contributor.author.fl_str_mv	Iparraguirre-Villanueva, Orlando Sierra-Liñan, Fernando Herrera Salazar, Jose Luis Beltozar-Clemente, Saul Pucuhuayla-Revatta, Félix Zapata-Paulini, Joselyn Cabanillas-Carbonell, Michael
dc.subject.es_PE.fl_str_mv	Classify Discovering Latent dirichlet allocation Text corpus Topics
topic	Classify Discovering Latent dirichlet allocation Text corpus Topics https://purl.org/pe-repo/ocde/ford#2.02.04
dc.subject.ocde.es_PE.fl_str_mv	https://purl.org/pe-repo/ocde/ford#2.02.04
description	This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.
publishDate	2023
dc.date.accessioned.none.fl_str_mv	2023-11-30T16:01:47Z
dc.date.available.none.fl_str_mv	2023-11-30T16:01:47Z
dc.date.issued.fl_str_mv	2023
dc.type.es_PE.fl_str_mv	info:eu-repo/semantics/article
format	article
dc.identifier.uri.none.fl_str_mv	https://hdl.handle.net/20.500.13067/2829
dc.identifier.doi.none.fl_str_mv	https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
url	https://hdl.handle.net/20.500.13067/2829 https://doi.org/10.11591/ijeecs.v30.i1.pp246-256
dc.language.iso.es_PE.fl_str_mv	eng
language	eng
dc.rights.es_PE.fl_str_mv	info:eu-repo/semantics/openAccess
dc.rights.uri.es_PE.fl_str_mv	https://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by/4.0/
dc.format.es_PE.fl_str_mv	application/pdf
dc.publisher.es_PE.fl_str_mv	Indonesian Journal of Electrical Engineering and Computer Science
dc.source.none.fl_str_mv	reponame:AUTONOMA-Institucional instname:Universidad Autónoma del Perú instacron:AUTONOMA
instname_str	Universidad Autónoma del Perú
instacron_str	AUTONOMA
institution	AUTONOMA
reponame_str	AUTONOMA-Institucional
collection	AUTONOMA-Institucional
dc.source.volume.es_PE.fl_str_mv	30
dc.source.issue.es_PE.fl_str_mv	1
dc.source.beginpage.es_PE.fl_str_mv	246
dc.source.endpage.es_PE.fl_str_mv	256
bitstream.url.fl_str_mv	http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/3/6_2023.pdf.txt http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/4/6_2023.pdf.jpg http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/1/6_2023.pdf http://repositorio.autonoma.edu.pe/bitstream/20.500.13067/2829/2/license.txt
bitstream.checksum.fl_str_mv	5ecebb7582100c3bbc167d7bc3d68902 2c669bddaa25d0d930f11722bdaed6ba 9612a19922a6b02e74c30e5467962abb 9243398ff393db1861c890baeaeee5f9
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositorio de la Universidad Autonoma del Perú
repository.mail.fl_str_mv	repositorio@autonoma.pe
_version_	1835915295486640128
score	13.932913

Search and classify topics in a corpus of text using the latent dirichlet allocation model

Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).

Search and classify topics in a corpus of text using the latent dirichlet allocation model

Descripción del Articulo

Ejemplares Similares