Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages

Descripción del Articulo

Knowing what words of a language are inherited from the ancestor language, which are borrowed from contact languages, which are recently created, and the timing of critical events in the culture, enables modeling of language history including language phylogeny, language contact, and other novel inf...

Descripción completa

Detalles Bibliográficos
Autor: Miller, John Edward
Formato: tesis doctoral
Fecha de Publicación:2024
Institución:Pontificia Universidad Católica del Perú
Repositorio:PUCP-Tesis
Lenguaje:inglés
OAI Identifier:oai:tesis.pucp.edu.pe:20.500.12404/29444
Enlace del recurso:http://hdl.handle.net/20.500.12404/29444
Nivel de acceso:acceso abierto
Materia:Aprendizaje automático (Inteligencia artificial)
Lingüística computacional
Redes neuronales (Computación)
Lingüística histórica
https://purl.org/pe-repo/ocde/ford#2.00.00
Descripción
Sumario:Knowing what words of a language are inherited from the ancestor language, which are borrowed from contact languages, which are recently created, and the timing of critical events in the culture, enables modeling of language history including language phylogeny, language contact, and other novel influences on the culture. However, determining which words or forms are borrowed and from whom is a difficult, time consuming, and often fascinating task, usually performed by historical linguists, which is limited by the time and expertise available. While there are semi-automated methods available to identify borrowed words and their word donors, there is still substantial opportunity for improvement. We construct a new language model based monolingual method, competing cross-entropies, based on word source groupings within monolingual wordlists; improve existing multilingual sequence comparison methods, closest match on language pairs and cognate-based on multiple languages; and construct a classifier based meta-method, combining closest match and cross-entropy functions. We also define an alternative goal of borrowing detection for dominant donor languages, which allows determination of both borrowing and source. We apply monolingual methods to a global dataset of 41 languages, and multilingual and meta methods to a newly constituted dataset of seven Latin American languages. We also initiate work on a dataset of 21 Pano-Tacanan and regional languages with added Spanish, Portuguese, and Quechua donor languages for subsequent application of borrowing detection methods. The competing cross-entropies method establishes a benchmark for automatic borrowing detection for the world online loan database, the dominant donor multiple sequence comparison method improves over the competing cross-entropies method, and the classifier meta-method with sequence comparison and crossentropy functions performs substantially better overall.
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).