Search and classify topics in a corpus of text using the latent dirichlet allocation model
Descripción del Articulo
This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and...
Autores: | , , , , , , |
---|---|
Formato: | artículo |
Fecha de Publicación: | 2023 |
Institución: | Universidad Tecnológica del Perú |
Repositorio: | UTP-Institucional |
Lenguaje: | inglés |
OAI Identifier: | oai:repositorio.utp.edu.pe:20.500.12867/6686 |
Enlace del recurso: | https://hdl.handle.net/20.500.12867/6686 https://doi.org/10.11591/ijeecs.v30.i1.pp246-256 |
Nivel de acceso: | acceso abierto |
Materia: | Latent dirichlet allocation Topic modeling Mathematical statistics https://purl.org/pe-repo/ocde/ford#1.01.03 |
id |
UTPD_db075e071c18e72fca0c2148900ea814 |
---|---|
oai_identifier_str |
oai:repositorio.utp.edu.pe:20.500.12867/6686 |
network_acronym_str |
UTPD |
network_name_str |
UTP-Institucional |
repository_id_str |
4782 |
dc.title.es_PE.fl_str_mv |
Search and classify topics in a corpus of text using the latent dirichlet allocation model |
title |
Search and classify topics in a corpus of text using the latent dirichlet allocation model |
spellingShingle |
Search and classify topics in a corpus of text using the latent dirichlet allocation model Pucuhuayla Revatta, Félix Rogelio Latent dirichlet allocation Topic modeling Mathematical statistics https://purl.org/pe-repo/ocde/ford#1.01.03 |
title_short |
Search and classify topics in a corpus of text using the latent dirichlet allocation model |
title_full |
Search and classify topics in a corpus of text using the latent dirichlet allocation model |
title_fullStr |
Search and classify topics in a corpus of text using the latent dirichlet allocation model |
title_full_unstemmed |
Search and classify topics in a corpus of text using the latent dirichlet allocation model |
title_sort |
Search and classify topics in a corpus of text using the latent dirichlet allocation model |
author |
Pucuhuayla Revatta, Félix Rogelio |
author_facet |
Pucuhuayla Revatta, Félix Rogelio Iparraguirre-Villanueva, Orlando Sierra-Liñan, Fernando Herrera Salazar, Jose Luis Beltozar-Clemente, Saul Zapata-Paulini, Joselyn Cabanillas-Carbonell, Michael |
author_role |
author |
author2 |
Iparraguirre-Villanueva, Orlando Sierra-Liñan, Fernando Herrera Salazar, Jose Luis Beltozar-Clemente, Saul Zapata-Paulini, Joselyn Cabanillas-Carbonell, Michael |
author2_role |
author author author author author author |
dc.contributor.author.fl_str_mv |
Pucuhuayla Revatta, Félix Rogelio Iparraguirre-Villanueva, Orlando Sierra-Liñan, Fernando Herrera Salazar, Jose Luis Beltozar-Clemente, Saul Zapata-Paulini, Joselyn Cabanillas-Carbonell, Michael |
dc.subject.es_PE.fl_str_mv |
Latent dirichlet allocation Topic modeling Mathematical statistics |
topic |
Latent dirichlet allocation Topic modeling Mathematical statistics https://purl.org/pe-repo/ocde/ford#1.01.03 |
dc.subject.ocde.es_PE.fl_str_mv |
https://purl.org/pe-repo/ocde/ford#1.01.03 |
description |
This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology. |
publishDate |
2023 |
dc.date.accessioned.none.fl_str_mv |
2023-03-03T15:59:03Z |
dc.date.available.none.fl_str_mv |
2023-03-03T15:59:03Z |
dc.date.issued.fl_str_mv |
2023 |
dc.type.es_PE.fl_str_mv |
info:eu-repo/semantics/article |
dc.type.version.es_PE.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.issn.none.fl_str_mv |
2502-4760 |
dc.identifier.uri.none.fl_str_mv |
https://hdl.handle.net/20.500.12867/6686 |
dc.identifier.journal.es_PE.fl_str_mv |
Indonesian Journal of Electrical Engineering and Computer Science |
dc.identifier.doi.none.fl_str_mv |
https://doi.org/10.11591/ijeecs.v30.i1.pp246-256 |
identifier_str_mv |
2502-4760 Indonesian Journal of Electrical Engineering and Computer Science |
url |
https://hdl.handle.net/20.500.12867/6686 https://doi.org/10.11591/ijeecs.v30.i1.pp246-256 |
dc.language.iso.es_PE.fl_str_mv |
eng |
language |
eng |
dc.relation.ispartofseries.none.fl_str_mv |
Indonesian Journal of Electrical Engineering and Computer Science;vol.30, n°1, pp. 246~256 |
dc.rights.es_PE.fl_str_mv |
info:eu-repo/semantics/openAccess |
dc.rights.uri.es_PE.fl_str_mv |
http://creativecommons.org/licenses/by-sa/4.0/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-sa/4.0/ |
dc.format.es_PE.fl_str_mv |
application/pdf |
dc.publisher.es_PE.fl_str_mv |
Institute of Advanced Engineering and Science |
dc.publisher.country.es_PE.fl_str_mv |
ID |
dc.source.es_PE.fl_str_mv |
Repositorio Institucional - UTP Universidad Tecnológica del Perú |
dc.source.none.fl_str_mv |
reponame:UTP-Institucional instname:Universidad Tecnológica del Perú instacron:UTP |
instname_str |
Universidad Tecnológica del Perú |
instacron_str |
UTP |
institution |
UTP |
reponame_str |
UTP-Institucional |
collection |
UTP-Institucional |
bitstream.url.fl_str_mv |
http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/3/F.Pucuhuayla_Articulo.pdf.txt http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/4/F.Pucuhuayla_Articulo.pdf.jpg http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/1/F.Pucuhuayla_Articulo.pdf http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/2/license.txt |
bitstream.checksum.fl_str_mv |
5ecebb7582100c3bbc167d7bc3d68902 419dd69407b5ad2b879172c44a78310b 9612a19922a6b02e74c30e5467962abb 8a4605be74aa9ea9d79846c1fba20a33 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositorio Institucional de la Universidad Tecnológica del Perú |
repository.mail.fl_str_mv |
repositorio@utp.edu.pe |
_version_ |
1817984916861747200 |
spelling |
Pucuhuayla Revatta, Félix RogelioIparraguirre-Villanueva, OrlandoSierra-Liñan, FernandoHerrera Salazar, Jose LuisBeltozar-Clemente, SaulZapata-Paulini, JoselynCabanillas-Carbonell, Michael2023-03-03T15:59:03Z2023-03-03T15:59:03Z20232502-4760https://hdl.handle.net/20.500.12867/6686Indonesian Journal of Electrical Engineering and Computer Sciencehttps://doi.org/10.11591/ijeecs.v30.i1.pp246-256This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.Campus Lima Centroapplication/pdfengInstitute of Advanced Engineering and ScienceIDIndonesian Journal of Electrical Engineering and Computer Science;vol.30, n°1, pp. 246~256info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-sa/4.0/Repositorio Institucional - UTPUniversidad Tecnológica del Perúreponame:UTP-Institucionalinstname:Universidad Tecnológica del Perúinstacron:UTPLatent dirichlet allocationTopic modelingMathematical statisticshttps://purl.org/pe-repo/ocde/ford#1.01.03Search and classify topics in a corpus of text using the latent dirichlet allocation modelinfo:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionTEXTF.Pucuhuayla_Articulo.pdf.txtF.Pucuhuayla_Articulo.pdf.txtExtracted texttext/plain44833http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/3/F.Pucuhuayla_Articulo.pdf.txt5ecebb7582100c3bbc167d7bc3d68902MD53THUMBNAILF.Pucuhuayla_Articulo.pdf.jpgF.Pucuhuayla_Articulo.pdf.jpgGenerated Thumbnailimage/jpeg20111http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/4/F.Pucuhuayla_Articulo.pdf.jpg419dd69407b5ad2b879172c44a78310bMD54ORIGINALF.Pucuhuayla_Articulo.pdfF.Pucuhuayla_Articulo.pdfapplication/pdf646288http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/1/F.Pucuhuayla_Articulo.pdf9612a19922a6b02e74c30e5467962abbMD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.utp.edu.pe/bitstream/20.500.12867/6686/2/license.txt8a4605be74aa9ea9d79846c1fba20a33MD5220.500.12867/6686oai:repositorio.utp.edu.pe:20.500.12867/66862023-03-03 11:20:12.659Repositorio Institucional de la Universidad Tecnológica del Perúrepositorio@utp.edu.peTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo= |
score |
13.971837 |
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).