Language identification with scarce data: A case study from Peru

Descripción del Articulo

Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of...

Descripción completa

Detalles Bibliográficos
Autores: Espichán-Linares A., Oncevay-Marcos A.
Formato: objeto de conferencia
Fecha de Publicación:2018
Institución:Consejo Nacional de Ciencia Tecnología e Innovación
Repositorio:CONCYTEC-Institucional
Lenguaje:inglés
OAI Identifier:oai:repositorio.concytec.gob.pe:20.500.12390/672
Enlace del recurso:https://hdl.handle.net/20.500.12390/672
https://doi.org/10.1007/978-3-319-90596-9_7
Nivel de acceso:acceso abierto
Materia:The standard model
Deep learning
Information management
Linguistics
Natural language processing systems
Best fit
Complex task
Corpus-based methods
Language identification
Learning approach
Multiple class
State of the art
Big data
https://purl.org/pe-repo/ocde/ford#2.00.00
id CONC_ca82d10d70d9afe2da44f8344a93ac1d
oai_identifier_str oai:repositorio.concytec.gob.pe:20.500.12390/672
network_acronym_str CONC
network_name_str CONCYTEC-Institucional
repository_id_str 4689
dc.title.none.fl_str_mv Language identification with scarce data: A case study from Peru
title Language identification with scarce data: A case study from Peru
spellingShingle Language identification with scarce data: A case study from Peru
Espichán-Linares A.
The standard model
Deep learning
Information management
Linguistics
Natural language processing systems
Best fit
Complex task
Corpus-based methods
Language identification
Learning approach
Multiple class
State of the art
Big data
https://purl.org/pe-repo/ocde/ford#2.00.00
title_short Language identification with scarce data: A case study from Peru
title_full Language identification with scarce data: A case study from Peru
title_fullStr Language identification with scarce data: A case study from Peru
title_full_unstemmed Language identification with scarce data: A case study from Peru
title_sort Language identification with scarce data: A case study from Peru
author Espichán-Linares A.
author_facet Espichán-Linares A.
Oncevay-Marcos A.
author_role author
author2 Oncevay-Marcos A.
author2_role author
dc.contributor.author.fl_str_mv Espichán-Linares A.
Oncevay-Marcos A.
dc.subject.none.fl_str_mv The standard model
topic The standard model
Deep learning
Information management
Linguistics
Natural language processing systems
Best fit
Complex task
Corpus-based methods
Language identification
Learning approach
Multiple class
State of the art
Big data
https://purl.org/pe-repo/ocde/ford#2.00.00
dc.subject.es_PE.fl_str_mv Deep learning
Information management
Linguistics
Natural language processing systems
Best fit
Complex task
Corpus-based methods
Language identification
Learning approach
Multiple class
State of the art
Big data
dc.subject.ocde.none.fl_str_mv https://purl.org/pe-repo/ocde/ford#2.00.00
description Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.
publishDate 2018
dc.date.accessioned.none.fl_str_mv 2024-05-30T23:13:38Z
dc.date.available.none.fl_str_mv 2024-05-30T23:13:38Z
dc.date.issued.fl_str_mv 2018
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
format conferenceObject
dc.identifier.isbn.none.fl_str_mv urn:isbn:9783319905952
dc.identifier.uri.none.fl_str_mv https://hdl.handle.net/20.500.12390/672
dc.identifier.doi.none.fl_str_mv https://doi.org/10.1007/978-3-319-90596-9_7
dc.identifier.scopus.none.fl_str_mv 2-s2.0-85045991573
identifier_str_mv urn:isbn:9783319905952
2-s2.0-85045991573
url https://hdl.handle.net/20.500.12390/672
https://doi.org/10.1007/978-3-319-90596-9_7
dc.language.iso.none.fl_str_mv eng
language eng
dc.relation.ispartof.none.fl_str_mv Communications in Computer and Information Science
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Springer Verlag
publisher.none.fl_str_mv Springer Verlag
dc.source.none.fl_str_mv reponame:CONCYTEC-Institucional
instname:Consejo Nacional de Ciencia Tecnología e Innovación
instacron:CONCYTEC
instname_str Consejo Nacional de Ciencia Tecnología e Innovación
instacron_str CONCYTEC
institution CONCYTEC
reponame_str CONCYTEC-Institucional
collection CONCYTEC-Institucional
repository.name.fl_str_mv Repositorio Institucional CONCYTEC
repository.mail.fl_str_mv repositorio@concytec.gob.pe
_version_ 1844883129740820480
spelling Publicationrp01513600rp00570500Espichán-Linares A.Oncevay-Marcos A.2024-05-30T23:13:38Z2024-05-30T23:13:38Z2018urn:isbn:9783319905952https://hdl.handle.net/20.500.12390/672https://doi.org/10.1007/978-3-319-90596-9_72-s2.0-85045991573Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - ConcytecengSpringer VerlagCommunications in Computer and Information Scienceinfo:eu-repo/semantics/openAccessThe standard modelDeep learning-1Information management-1Linguistics-1Natural language processing systems-1Best fit-1Complex task-1Corpus-based methods-1Language identification-1Learning approach-1Multiple class-1State of the art-1Big data-1https://purl.org/pe-repo/ocde/ford#2.00.00-1Language identification with scarce data: A case study from Peruinfo:eu-repo/semantics/conferenceObjectreponame:CONCYTEC-Institucionalinstname:Consejo Nacional de Ciencia Tecnología e Innovacióninstacron:CONCYTEC20.500.12390/672oai:repositorio.concytec.gob.pe:20.500.12390/6722024-05-30 15:58:30.646http://purl.org/coar/access_right/c_14cbinfo:eu-repo/semantics/closedAccessmetadata only accesshttps://repositorio.concytec.gob.peRepositorio Institucional CONCYTECrepositorio@concytec.gob.pe#PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE#<Publication xmlns="https://www.openaire.eu/cerif-profile/1.1/" id="f58acfee-5d7b-4235-a0a9-0d2933319358"> <Type xmlns="https://www.openaire.eu/cerif-profile/vocab/COAR_Publication_Types">http://purl.org/coar/resource_type/c_1843</Type> <Language>eng</Language> <Title>Language identification with scarce data: A case study from Peru</Title> <PublishedIn> <Publication> <Title>Communications in Computer and Information Science</Title> </Publication> </PublishedIn> <PublicationDate>2018</PublicationDate> <DOI>https://doi.org/10.1007/978-3-319-90596-9_7</DOI> <SCP-Number>2-s2.0-85045991573</SCP-Number> <ISBN>urn:isbn:9783319905952</ISBN> <Authors> <Author> <DisplayName>Espichán-Linares A.</DisplayName> <Person id="rp01513" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Oncevay-Marcos A.</DisplayName> <Person id="rp00570" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> </Authors> <Editors> </Editors> <Publishers> <Publisher> <DisplayName>Springer Verlag</DisplayName> <OrgUnit /> </Publisher> </Publishers> <Keyword>The standard model</Keyword> <Keyword>Deep learning</Keyword> <Keyword>Information management</Keyword> <Keyword>Linguistics</Keyword> <Keyword>Natural language processing systems</Keyword> <Keyword>Best fit</Keyword> <Keyword>Complex task</Keyword> <Keyword>Corpus-based methods</Keyword> <Keyword>Language identification</Keyword> <Keyword>Learning approach</Keyword> <Keyword>Multiple class</Keyword> <Keyword>State of the art</Keyword> <Keyword>Big data</Keyword> <Abstract>Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.</Abstract> <Access xmlns="http://purl.org/coar/access_right" > </Access> </Publication> -1
score 13.413352
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).