Language identification with scarce data: A case study from Peru

Descripción del Articulo

Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of...

Descripción completa

Detalles Bibliográficos
Autores: Espichán-Linares A., Oncevay-Marcos A.
Formato: objeto de conferencia
Fecha de Publicación:2018
Institución:Consejo Nacional de Ciencia Tecnología e Innovación
Repositorio:CONCYTEC-Institucional
Lenguaje:inglés
OAI Identifier:oai:repositorio.concytec.gob.pe:20.500.12390/672
Enlace del recurso:https://hdl.handle.net/20.500.12390/672
https://doi.org/10.1007/978-3-319-90596-9_7
Nivel de acceso:acceso abierto
Materia:The standard model
Deep learning
Information management
Linguistics
Natural language processing systems
Best fit
Complex task
Corpus-based methods
Language identification
Learning approach
Multiple class
State of the art
Big data
https://purl.org/pe-repo/ocde/ford#2.00.00
Descripción
Sumario:Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).