Extracción automática de terminología multilingüe empleada en la implementación de tecnologías de la información y las comunicaciones, aplicada a castellano e inglés

Descripción del Articulo

Currently, a growing pressure on organizations to implement Artificial Intelligence tools and other types of Information and Communication Technologies (ICT) is observed. However, the rapid evolution of ICTs and the lack of up-to-date implementation methodologies in several languages hinder progress...

Descripción completa

Detalles Bibliográficos
Autor: Peralta Melgar, Daniel Miguel
Formato: tesis de maestría
Fecha de Publicación:2025
Institución:Pontificia Universidad Católica del Perú
Repositorio:PUCP-Tesis
Lenguaje:español
OAI Identifier:oai:tesis.pucp.edu.pe:20.500.12404/30393
Enlace del recurso:http://hdl.handle.net/20.500.12404/30393
Nivel de acceso:acceso abierto
Materia:Procesamiento de lenguaje natural (Computación)
Aprendizaje automático (Inteligencia artificial)
Tecnología de la información
Minería de textos
https://purl.org/pe-repo/ocde/ford#1.02.02
Descripción
Sumario:Currently, a growing pressure on organizations to implement Artificial Intelligence tools and other types of Information and Communication Technologies (ICT) is observed. However, the rapid evolution of ICTs and the lack of up-to-date implementation methodologies in several languages hinder progress. The goal of this work is to make a contribution to facilitate the updating of implementation methodologies. To this end, lists of terms in Spanish and English are created for the implementation of two types of ICT using several models trained in Automatic Term Extraction (ATE). These lists of terms can later on be used to fine- tune text classification, abstracting, and translation models, which in turn can help updating implementation methodologies. Term lists were created using an incremental methodology, combining the use of models and manual reviews. 5 pre-trained BERT models and one XLNet model were tested with results superior to previous research, providing support to the possibility of doing ATE in topics and languages for which there is little training data. A method to measure the similarity between lists of terms is proposed. Experiments results indicate that corpora in different languages on the same topic could have different approaches, suggesting that knowledge would be enriched if publications in several languages were used together as sources. A metric proposed to evaluate a model's ability to identify previously unseen terms would be showing that this ability would not depend solely on identifying previously viewed words.
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).