Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés

Peralta Melgar, Daniel Miguel

Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés

Descripción del Articulo

Currently, a growing pressure on organizations to implement Artificial Intelligence tools and other types of Information and Communication Technologies (ICT) is observed. However, the rapid evolution of ICTs and the lack of up-to-date implementation methodologies in several languages hinder progress...

Descripción completa

Detalles Bibliográficos
Autor:	Peralta Melgar, Daniel Miguel
Formato:	tesis de maestría
Fecha de Publicación:	2025
Institución:	Pontificia Universidad Católica del Perú
Repositorio:	PUCP-Tesis
Lenguaje:	español
OAI Identifier:	oai:tesis.pucp.edu.pe:20.500.12404/30393
Enlace del recurso:	http://hdl.handle.net/20.500.12404/30393
Nivel de acceso:	acceso abierto
Materia:	Procesamiento de lenguaje natural (Computación) Aprendizaje automático (Inteligencia artificial) Tecnología de la información Minería de textos https://purl.org/pe-repo/ocde/ford#1.02.02

id	PUCP_77bbb92185e416c49c57f3da5fbd796c
oai_identifier_str	oai:tesis.pucp.edu.pe:20.500.12404/30393
network_acronym_str	PUCP
network_name_str	PUCP-Tesis
repository_id_str	.
dc.title.none.fl_str_mv	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés
title	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés
spellingShingle	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés Peralta Melgar, Daniel Miguel Procesamiento de lenguaje natural (Computación) Aprendizaje automático (Inteligencia artificial) Tecnología de la información Minería de textos https://purl.org/pe-repo/ocde/ford#1.02.02
title_short	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés
title_full	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés
title_fullStr	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés
title_full_unstemmed	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés
title_sort	Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés
author	Peralta Melgar, Daniel Miguel
author_facet	Peralta Melgar, Daniel Miguel
author_role	author
dc.contributor.advisor.fl_str_mv	Oncevay Marcos, Félix Arturo
dc.contributor.author.fl_str_mv	Peralta Melgar, Daniel Miguel
dc.subject.none.fl_str_mv	Procesamiento de lenguaje natural (Computación) Aprendizaje automático (Inteligencia artificial) Tecnología de la información Minería de textos
topic	Procesamiento de lenguaje natural (Computación) Aprendizaje automático (Inteligencia artificial) Tecnología de la información Minería de textos https://purl.org/pe-repo/ocde/ford#1.02.02
dc.subject.ocde.none.fl_str_mv	https://purl.org/pe-repo/ocde/ford#1.02.02
description	Currently, a growing pressure on organizations to implement Artificial Intelligence tools and other types of Information and Communication Technologies (ICT) is observed. However, the rapid evolution of ICTs and the lack of up-to-date implementation methodologies in several languages hinder progress. The goal of this work is to make a contribution to facilitate the updating of implementation methodologies. To this end, lists of terms in Spanish and English are created for the implementation of two types of ICT using several models trained in Automatic Term Extraction (ATE). These lists of terms can later on be used to fine- tune text classification, abstracting, and translation models, which in turn can help updating implementation methodologies. Term lists were created using an incremental methodology, combining the use of models and manual reviews. 5 pre-trained BERT models and one XLNet model were tested with results superior to previous research, providing support to the possibility of doing ATE in topics and languages for which there is little training data. A method to measure the similarity between lists of terms is proposed. Experiments results indicate that corpora in different languages on the same topic could have different approaches, suggesting that knowledge would be enriched if publications in several languages were used together as sources. A metric proposed to evaluate a model's ability to identify previously unseen terms would be showing that this ability would not depend solely on identifying previously viewed words.
publishDate	2025
dc.date.accessioned.none.fl_str_mv	2025-04-01T17:50:08Z
dc.date.created.none.fl_str_mv	2025
dc.date.issued.fl_str_mv	2025-04-01
dc.type.none.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
dc.identifier.uri.none.fl_str_mv	http://hdl.handle.net/20.500.12404/30393
url	http://hdl.handle.net/20.500.12404/30393
dc.language.iso.none.fl_str_mv	spa
language	spa
dc.relation.ispartof.fl_str_mv	SUNEDU
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess
dc.rights.uri.none.fl_str_mv	http://creativecommons.org/licenses/by/2.5/pe/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by/2.5/pe/
dc.publisher.es_ES.fl_str_mv	Pontificia Universidad Católica del Perú
dc.publisher.country.none.fl_str_mv	PE
dc.source.none.fl_str_mv	reponame:PUCP-Tesis instname:Pontificia Universidad Católica del Perú instacron:PUCP
instname_str	Pontificia Universidad Católica del Perú
instacron_str	PUCP
institution	PUCP
reponame_str	PUCP-Tesis
collection	PUCP-Tesis
bitstream.url.fl_str_mv	https://tesis.pucp.edu.pe/bitstreams/2f0ba325-5781-4635-9f6e-9ecd9b26d659/download https://tesis.pucp.edu.pe/bitstreams/36af01a9-3cfb-4698-86ee-8949315486fc/download https://tesis.pucp.edu.pe/bitstreams/24dc48c7-cfe7-4853-90b8-97c0df07c40e/download https://tesis.pucp.edu.pe/bitstreams/e5bde566-5127-4f27-9a1f-e6e24d06260f/download https://tesis.pucp.edu.pe/bitstreams/4b507e79-7f04-4e35-8d0f-3602b561affa/download https://tesis.pucp.edu.pe/bitstreams/0c90e343-5fd8-4420-8f0f-d42d7b515399/download https://tesis.pucp.edu.pe/bitstreams/7c5d3cc0-7923-4011-a399-2242de98076f/download https://tesis.pucp.edu.pe/bitstreams/1b2f0ceb-4942-472e-830a-a7792a51d883/download
bitstream.checksum.fl_str_mv	75a337fbd30333f5f1e7b076917bd78f 752dd6ac64f5da7d0e1ccb89b4e2cbee 48725b7f9a634bc551f52084693052d1 bb9bdc0b3349e4284e09149f943790b4 17b579e64852a3287f9558845709259a 7189451cf489665f7f2fad136a3c8280 e362e7c95f4da970b250a8262902ef83 1af6c7f2c7bd99eaeb4421e767fc67d2
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositorio de Tesis PUCP
repository.mail.fl_str_mv	raul.sifuentes@pucp.pe
_version_	1834736801236910080
spelling	Oncevay Marcos, Félix ArturoPeralta Melgar, Daniel Miguel2025-04-01T17:50:08Z20252025-04-01http://hdl.handle.net/20.500.12404/30393Currently, a growing pressure on organizations to implement Artificial Intelligence tools and other types of Information and Communication Technologies (ICT) is observed. However, the rapid evolution of ICTs and the lack of up-to-date implementation methodologies in several languages hinder progress. The goal of this work is to make a contribution to facilitate the updating of implementation methodologies. To this end, lists of terms in Spanish and English are created for the implementation of two types of ICT using several models trained in Automatic Term Extraction (ATE). These lists of terms can later on be used to fine- tune text classification, abstracting, and translation models, which in turn can help updating implementation methodologies. Term lists were created using an incremental methodology, combining the use of models and manual reviews. 5 pre-trained BERT models and one XLNet model were tested with results superior to previous research, providing support to the possibility of doing ATE in topics and languages for which there is little training data. A method to measure the similarity between lists of terms is proposed. Experiments results indicate that corpora in different languages on the same topic could have different approaches, suggesting that knowledge would be enriched if publications in several languages were used together as sources. A metric proposed to evaluate a model's ability to identify previously unseen terms would be showing that this ability would not depend solely on identifying previously viewed words.Actualmente se observa una presión creciente sobre las organizaciones para implementar herramientas de Inteligencia Artificial y otros tipos de Tecnologías de la Información y las Comunicaciones –TIC. Sin embargo, la rápida evolución de las TIC y la carencia de metodologías de implementación actualizadas en varios idiomas dificultan el avance. El objetivo del presente trabajo es facilitar la actualización de las metodologías de implementación. Para esto se elaboran listas de términos, en castellano e inglés, para la implementación de dos tipos de TIC con la ayuda de varios modelos especializados en la Extracción Automática de Términos – EAT. Las listas de términos luego pueden ser usadas para afinar modelos de clasificación de textos, elaboración de resúmenes y traducción, que a su vez pueden ayudar en la actualización de las metodologías de implementación. Las listas de términos fueron elaboradas mediante una metodología incremental, combinando el uso de modelos y revisiones manuales. Se probaron 5 modelos pre- entrenados de tipo BERT y uno XLNet con resultados superiores a los de investigaciones similares, apoyando la posibilidad de hacer EAT en temas e idiomas para los cuales se cuente con pocos datos de entrenamiento. Se plantea una forma de medir la similitud entre listas de términos. Se observa que los corpus en diferentes idiomas sobre un mismo tema pueden tener enfoques diferentes, sugiriendo que el conocimiento se enriquecería si se tomaran como insumo juntas publicaciones en varios idiomas. Una métrica propuesta para evaluar la capacidad de un modelo para identificar términos no vistos antes estaría mostrando que esta capacidad no dependería solamente de identificar palabras vistas anteriormente.spaPontificia Universidad Católica del PerúPEinfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/2.5/pe/Procesamiento de lenguaje natural (Computación)Aprendizaje automático (Inteligencia artificial)Tecnología de la informaciónMinería de textoshttps://purl.org/pe-repo/ocde/ford#1.02.02Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglésinfo:eu-repo/semantics/masterThesisreponame:PUCP-Tesisinstname:Pontificia Universidad Católica del Perúinstacron:PUCPSUNEDUMaestro en Informática con mención en Ciencias de la ComputaciónMaestríaPontificia Universidad Católica del Perú. Escuela de Posgrado.Informática con mención en Ciencias de la Computación46440101https://orcid.org/0000-0001-7675-620808192451611087Gómez Montoya, Héctor ErasmoOncevay Marcos, Félix ArturoSobrevilla Cabezudo, Marco Antoniohttps://purl.org/pe-repo/renati/level#maestrohttps://purl.org/pe-repo/renati/type#tesisORIGINALPERALTA_MELGAR_DANIEL_MIGUEL_EXTRACCION_AUTOMATICA.pdfPERALTA_MELGAR_DANIEL_MIGUEL_EXTRACCION_AUTOMATICA.pdfTexto completoapplication/pdf1817751https://tesis.pucp.edu.pe/bitstreams/2f0ba325-5781-4635-9f6e-9ecd9b26d659/download75a337fbd30333f5f1e7b076917bd78fMD51trueAnonymousREADPERALTA_MELGAR_DANIEL_MIGUEL_T.pdfPERALTA_MELGAR_DANIEL_MIGUEL_T.pdfReporte de originalidadapplication/pdf4929319https://tesis.pucp.edu.pe/bitstreams/36af01a9-3cfb-4698-86ee-8949315486fc/download752dd6ac64f5da7d0e1ccb89b4e2cbeeMD54falseAdministratorREADCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-81025https://tesis.pucp.edu.pe/bitstreams/24dc48c7-cfe7-4853-90b8-97c0df07c40e/download48725b7f9a634bc551f52084693052d1MD53falseAnonymousREADLICENSElicense.txtlicense.txttext/plain; charset=utf-81748https://tesis.pucp.edu.pe/bitstreams/e5bde566-5127-4f27-9a1f-e6e24d06260f/downloadbb9bdc0b3349e4284e09149f943790b4MD55falseAnonymousREADTEXTPERALTA_MELGAR_DANIEL_MIGUEL_EXTRACCION_AUTOMATICA.pdf.txtPERALTA_MELGAR_DANIEL_MIGUEL_EXTRACCION_AUTOMATICA.pdf.txtExtracted texttext/plain215762https://tesis.pucp.edu.pe/bitstreams/4b507e79-7f04-4e35-8d0f-3602b561affa/download17b579e64852a3287f9558845709259aMD56falseAnonymousREADPERALTA_MELGAR_DANIEL_MIGUEL_T.pdf.txtPERALTA_MELGAR_DANIEL_MIGUEL_T.pdf.txtExtracted texttext/plain8903https://tesis.pucp.edu.pe/bitstreams/0c90e343-5fd8-4420-8f0f-d42d7b515399/download7189451cf489665f7f2fad136a3c8280MD58falseAdministratorREADTHUMBNAILPERALTA_MELGAR_DANIEL_MIGUEL_EXTRACCION_AUTOMATICA.pdf.jpgPERALTA_MELGAR_DANIEL_MIGUEL_EXTRACCION_AUTOMATICA.pdf.jpgGenerated Thumbnailimage/jpeg11480https://tesis.pucp.edu.pe/bitstreams/7c5d3cc0-7923-4011-a399-2242de98076f/downloade362e7c95f4da970b250a8262902ef83MD57falseAnonymousREADPERALTA_MELGAR_DANIEL_MIGUEL_T.pdf.jpgPERALTA_MELGAR_DANIEL_MIGUEL_T.pdf.jpgGenerated Thumbnailimage/jpeg11119https://tesis.pucp.edu.pe/bitstreams/1b2f0ceb-4942-472e-830a-a7792a51d883/download1af6c7f2c7bd99eaeb4421e767fc67d2MD59falseAdministratorREAD20.500.12404/30393oai:tesis.pucp.edu.pe:20.500.12404/303932025-04-21 12:21:20.319http://creativecommons.org/licenses/by/2.5/pe/info:eu-repo/semantics/openAccessopen.accesshttps://tesis.pucp.edu.peRepositorio de Tesis PUCPraul.sifuentes@pucp.peTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0IG93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLCB0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZyB0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sIGluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlIHN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yIHB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZSB0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQgdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uIGFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LCB5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZSBjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdCBzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkIHdpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRCBCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUgRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSCBDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZSBzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMgbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
score	13.931421

Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés

Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).

Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés

Descripción del Articulo

Ejemplares Similares