Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages

Miller, John Edward

Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages

Descripción del Articulo

Knowing what words of a language are inherited from the ancestor language, which are borrowed from contact languages, which are recently created, and the timing of critical events in the culture, enables modeling of language history including language phylogeny, language contact, and other novel inf...

Descripción completa

Detalles Bibliográficos
Autor:	Miller, John Edward
Formato:	tesis doctoral
Fecha de Publicación:	2024
Institución:	Pontificia Universidad Católica del Perú
Repositorio:	PUCP-Tesis
Lenguaje:	inglés
OAI Identifier:	oai:tesis.pucp.edu.pe:20.500.12404/29444
Enlace del recurso:	http://hdl.handle.net/20.500.12404/29444
Nivel de acceso:	acceso abierto
Materia:	Aprendizaje automático (Inteligencia artificial) Lingüística computacional Redes neuronales (Computación) Lingüística histórica https://purl.org/pe-repo/ocde/ford#2.00.00

id	PUCP_01e88a777e7c95b9e1eb5455bf6e43d0
oai_identifier_str	oai:tesis.pucp.edu.pe:20.500.12404/29444
network_acronym_str	PUCP
network_name_str	PUCP-Tesis
repository_id_str	.
dc.title.es_ES.fl_str_mv	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages
title	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages
spellingShingle	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages Miller, John Edward Aprendizaje automático (Inteligencia artificial) Lingüística computacional Redes neuronales (Computación) Lingüística histórica https://purl.org/pe-repo/ocde/ford#2.00.00
title_short	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages
title_full	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages
title_fullStr	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages
title_full_unstemmed	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages
title_sort	Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages
author	Miller, John Edward
author_facet	Miller, John Edward
author_role	author
dc.contributor.advisor.fl_str_mv	Beltrán Castañón, César Armando Zariquiey Biondi, Roberto Daniel List, Johann-Mattis
dc.contributor.author.fl_str_mv	Miller, John Edward
dc.subject.es_ES.fl_str_mv	Aprendizaje automático (Inteligencia artificial) Lingüística computacional Redes neuronales (Computación) Lingüística histórica
topic	Aprendizaje automático (Inteligencia artificial) Lingüística computacional Redes neuronales (Computación) Lingüística histórica https://purl.org/pe-repo/ocde/ford#2.00.00
dc.subject.ocde.none.fl_str_mv	https://purl.org/pe-repo/ocde/ford#2.00.00
description	Knowing what words of a language are inherited from the ancestor language, which are borrowed from contact languages, which are recently created, and the timing of critical events in the culture, enables modeling of language history including language phylogeny, language contact, and other novel influences on the culture. However, determining which words or forms are borrowed and from whom is a difficult, time consuming, and often fascinating task, usually performed by historical linguists, which is limited by the time and expertise available. While there are semi-automated methods available to identify borrowed words and their word donors, there is still substantial opportunity for improvement. We construct a new language model based monolingual method, competing cross-entropies, based on word source groupings within monolingual wordlists; improve existing multilingual sequence comparison methods, closest match on language pairs and cognate-based on multiple languages; and construct a classifier based meta-method, combining closest match and cross-entropy functions. We also define an alternative goal of borrowing detection for dominant donor languages, which allows determination of both borrowing and source. We apply monolingual methods to a global dataset of 41 languages, and multilingual and meta methods to a newly constituted dataset of seven Latin American languages. We also initiate work on a dataset of 21 Pano-Tacanan and regional languages with added Spanish, Portuguese, and Quechua donor languages for subsequent application of borrowing detection methods. The competing cross-entropies method establishes a benchmark for automatic borrowing detection for the world online loan database, the dominant donor multiple sequence comparison method improves over the competing cross-entropies method, and the classifier meta-method with sequence comparison and crossentropy functions performs substantially better overall.
publishDate	2024
dc.date.accessioned.none.fl_str_mv	2024-11-18T20:01:16Z
dc.date.available.none.fl_str_mv	2024-11-18T20:01:16Z
dc.date.created.none.fl_str_mv	2024
dc.date.issued.fl_str_mv	2024-11-18
dc.type.es_ES.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
dc.identifier.uri.none.fl_str_mv	http://hdl.handle.net/20.500.12404/29444
url	http://hdl.handle.net/20.500.12404/29444
dc.language.iso.none.fl_str_mv	eng
language	eng
dc.relation.ispartof.fl_str_mv	SUNEDU
dc.rights.es_ES.fl_str_mv	info:eu-repo/semantics/openAccess
dc.rights.uri.*.fl_str_mv	http://creativecommons.org/licenses/by/2.5/pe/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by/2.5/pe/
dc.publisher.es_ES.fl_str_mv	Pontificia Universidad Católica del Perú
dc.publisher.country.none.fl_str_mv	PE
dc.source.none.fl_str_mv	reponame:PUCP-Tesis instname:Pontificia Universidad Católica del Perú instacron:PUCP
instname_str	Pontificia Universidad Católica del Perú
instacron_str	PUCP
institution	PUCP
reponame_str	PUCP-Tesis
collection	PUCP-Tesis
bitstream.url.fl_str_mv	https://tesis.pucp.edu.pe/bitstreams/8376dcc4-214b-4ae9-9f64-64430cc15b18/download https://tesis.pucp.edu.pe/bitstreams/a7109a6e-319b-42c0-bf21-0ff67092eb4b/download https://tesis.pucp.edu.pe/bitstreams/47883cda-c76c-46b4-9305-3a9bf495e54e/download https://tesis.pucp.edu.pe/bitstreams/52556b92-38b1-48b5-84cb-2623a28038f9/download https://tesis.pucp.edu.pe/bitstreams/20bb1fbf-9f1e-486d-b92a-6a3132d3b170/download https://tesis.pucp.edu.pe/bitstreams/a61db2c6-1ada-45d3-a9eb-30bf6d76c944/download https://tesis.pucp.edu.pe/bitstreams/61e1241e-9a3c-4d0e-899b-f55995e9b17c/download
bitstream.checksum.fl_str_mv	b7f9ba2c65759697874cb9eac1da8958 414190c2f1ca4ae79a98f749f80ce4c3 5a4ffbc01f1b5eb70a835dac0d501661 8a4605be74aa9ea9d79846c1fba20a33 79bfdacd8307c68e71d0fec2e415d3d3 97025f5ca85c64594c620597fbcdbad1 711cfc79adc891f5e4ff0a9a112c8e6f
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositorio de Tesis PUCP
repository.mail.fl_str_mv	raul.sifuentes@pucp.pe
_version_	1856933045785853952
spelling	Beltrán Castañón, César ArmandoZariquiey Biondi, Roberto DanielList, Johann-MattisMiller, John Edward2024-11-18T20:01:16Z2024-11-18T20:01:16Z20242024-11-18http://hdl.handle.net/20.500.12404/29444Knowing what words of a language are inherited from the ancestor language, which are borrowed from contact languages, which are recently created, and the timing of critical events in the culture, enables modeling of language history including language phylogeny, language contact, and other novel influences on the culture. However, determining which words or forms are borrowed and from whom is a difficult, time consuming, and often fascinating task, usually performed by historical linguists, which is limited by the time and expertise available. While there are semi-automated methods available to identify borrowed words and their word donors, there is still substantial opportunity for improvement. We construct a new language model based monolingual method, competing cross-entropies, based on word source groupings within monolingual wordlists; improve existing multilingual sequence comparison methods, closest match on language pairs and cognate-based on multiple languages; and construct a classifier based meta-method, combining closest match and cross-entropy functions. We also define an alternative goal of borrowing detection for dominant donor languages, which allows determination of both borrowing and source. We apply monolingual methods to a global dataset of 41 languages, and multilingual and meta methods to a newly constituted dataset of seven Latin American languages. We also initiate work on a dataset of 21 Pano-Tacanan and regional languages with added Spanish, Portuguese, and Quechua donor languages for subsequent application of borrowing detection methods. The competing cross-entropies method establishes a benchmark for automatic borrowing detection for the world online loan database, the dominant donor multiple sequence comparison method improves over the competing cross-entropies method, and the classifier meta-method with sequence comparison and crossentropy functions performs substantially better overall.Conocer qué palabras de una lengua son heredadas, cuáles son prestadas, cuáles son de reciente creación y el momento de los eventos culturales críticos permite modelar la historia de la lengua, incluyendo su filogenia, el contacto entre lenguas y otras influencias culturales novedosas. Sin embargo, determinar qué palabras o formas son prestadas y de qué lengua provienen es una tarea compleja y laboriosa, realizada generalmente por lingüistas históricos, que se ven limitados por el tiempo y la experiencia disponibles. Aunque existen métodos semiautomáticos para identificar préstamos y sus lenguas de origen, aún hay margen de mejora. Construimos un nuevo modelo de lenguaje basado en un método monolingüe, entropías cruzadas competitivas, basado en agrupaciones de fuentes de palabras dentro de listas de palabras monolingües; mejoramos los métodos existentes de comparación de secuencias multilingües, la coincidencia más cercana en pares de idiomas y afines basados en múltiples idiomas; y construimos un meta-método basado en clasificadores, combinando funciones de coincidencia más cercana y de entropía cruzada. También definimos un objetivo alternativo de detección de préstamos para idiomas donantes dominantes, que permite determinar tanto el préstamo como la fuente. Aplicamos métodos monolingües a un conjunto de datos global de 41 idiomas (WOLD), y métodos multilingües y meta-métodos a un conjunto de datos recién constituido de siete idiomas latinoamericanos. También iniciamos el trabajo en un conjunto de datos de 21 idiomas pano-tacana y regionales con idiomas donantes agregados de español, portugués y quechua para la posterior aplicación de métodos de detección de préstamos. El método de entropías cruzadas competitivas establece un punto de referencia para la detección automática de préstamos en la base de datos mundial de préstamos en línea (WOLD). El método de comparación de secuencias múltiples del donante dominante mejora los resultados del método de entropías cruzadas competitivas. Finalmente, el meta-método clasificador, que combina la comparación de secuencias y las funciones de entropía cruzada, ofrece el mejor rendimiento general.engPontificia Universidad Católica del PerúPEinfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/2.5/pe/Aprendizaje automático (Inteligencia artificial)Lingüística computacionalRedes neuronales (Computación)Lingüística históricahttps://purl.org/pe-repo/ocde/ford#2.00.00Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languagesinfo:eu-repo/semantics/doctoralThesisreponame:PUCP-Tesisinstname:Pontificia Universidad Católica del Perúinstacron:PUCPSUNEDUDoctor en IngenieríaDoctoradoPontificia Universidad Católica del Perú. Escuela de PosgradoIngeniería2956126040203566https://orcid.org/0000-0002-0173-4140https://orcid.org/0000-0002-1421-1314000436511732028Mccoy, Kathleen FillibenBeltrán Castañón, César ArmandoPardo, ThiagoOncevay Marcos, Félix ArturoVera Zúñiga, Javier Maximilianohttps://purl.org/pe-repo/renati/level#doctorhttps://purl.org/pe-repo/renati/type#tesisORIGINALMILLER_JOHN_EDWARD.pdfMILLER_JOHN_EDWARD.pdfTexto completoapplication/pdf3144075https://tesis.pucp.edu.pe/bitstreams/8376dcc4-214b-4ae9-9f64-64430cc15b18/downloadb7f9ba2c65759697874cb9eac1da8958MD51trueAnonymousREADMILLER_JOHN_EDWARD_T.pdfMILLER_JOHN_EDWARD_T.pdfReporte de originalidadapplication/pdf24391190https://tesis.pucp.edu.pe/bitstreams/a7109a6e-319b-42c0-bf21-0ff67092eb4b/download414190c2f1ca4ae79a98f749f80ce4c3MD52falseAnonymousREAD2500-01-01CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8914https://tesis.pucp.edu.pe/bitstreams/47883cda-c76c-46b4-9305-3a9bf495e54e/download5a4ffbc01f1b5eb70a835dac0d501661MD53falseAnonymousREADLICENSElicense.txtlicense.txttext/plain; charset=utf-81748https://tesis.pucp.edu.pe/bitstreams/52556b92-38b1-48b5-84cb-2623a28038f9/download8a4605be74aa9ea9d79846c1fba20a33MD54falseAnonymousREADTHUMBNAILMILLER_JOHN_EDWARD.pdf.jpgMILLER_JOHN_EDWARD.pdf.jpgIM Thumbnailimage/jpeg18899https://tesis.pucp.edu.pe/bitstreams/20bb1fbf-9f1e-486d-b92a-6a3132d3b170/download79bfdacd8307c68e71d0fec2e415d3d3MD55falseAnonymousREADMILLER_JOHN_EDWARD_T.pdf.jpgMILLER_JOHN_EDWARD_T.pdf.jpgIM Thumbnailimage/jpeg4586https://tesis.pucp.edu.pe/bitstreams/a61db2c6-1ada-45d3-a9eb-30bf6d76c944/download97025f5ca85c64594c620597fbcdbad1MD56falseAnonymousREAD2500-01-01TEXTMILLER_JOHN_EDWARD_T.pdf.txtMILLER_JOHN_EDWARD_T.pdf.txtExtracted texttext/plain4464https://tesis.pucp.edu.pe/bitstreams/61e1241e-9a3c-4d0e-899b-f55995e9b17c/download711cfc79adc891f5e4ff0a9a112c8e6fMD57falseAnonymousREAD2500-01-0120.500.12404/29444oai:tesis.pucp.edu.pe:20.500.12404/294442026-02-03 10:05:08.323http://creativecommons.org/licenses/by/2.5/pe/info:eu-repo/semantics/openAccessopen.accesshttps://tesis.pucp.edu.peRepositorio de Tesis PUCPraul.sifuentes@pucp.peTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=
score	13.918711

Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages

Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).

Towards automatic detection of lexical borrowings in wordlists - with application to Latin American languages

Descripción del Articulo

Ejemplares Similares