Language identification with scarce data: A case study from Peru
Descripción del Articulo
Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of...
| Autores: | , |
|---|---|
| Formato: | objeto de conferencia |
| Fecha de Publicación: | 2018 |
| Institución: | Consejo Nacional de Ciencia Tecnología e Innovación |
| Repositorio: | CONCYTEC-Institucional |
| Lenguaje: | inglés |
| OAI Identifier: | oai:repositorio.concytec.gob.pe:20.500.12390/672 |
| Enlace del recurso: | https://hdl.handle.net/20.500.12390/672 https://doi.org/10.1007/978-3-319-90596-9_7 |
| Nivel de acceso: | acceso abierto |
| Materia: | The standard model Deep learning Information management Linguistics Natural language processing systems Best fit Complex task Corpus-based methods Language identification Learning approach Multiple class State of the art Big data https://purl.org/pe-repo/ocde/ford#2.00.00 |
| id |
CONC_ca82d10d70d9afe2da44f8344a93ac1d |
|---|---|
| oai_identifier_str |
oai:repositorio.concytec.gob.pe:20.500.12390/672 |
| network_acronym_str |
CONC |
| network_name_str |
CONCYTEC-Institucional |
| repository_id_str |
4689 |
| dc.title.none.fl_str_mv |
Language identification with scarce data: A case study from Peru |
| title |
Language identification with scarce data: A case study from Peru |
| spellingShingle |
Language identification with scarce data: A case study from Peru Espichán-Linares A. The standard model Deep learning Information management Linguistics Natural language processing systems Best fit Complex task Corpus-based methods Language identification Learning approach Multiple class State of the art Big data https://purl.org/pe-repo/ocde/ford#2.00.00 |
| title_short |
Language identification with scarce data: A case study from Peru |
| title_full |
Language identification with scarce data: A case study from Peru |
| title_fullStr |
Language identification with scarce data: A case study from Peru |
| title_full_unstemmed |
Language identification with scarce data: A case study from Peru |
| title_sort |
Language identification with scarce data: A case study from Peru |
| author |
Espichán-Linares A. |
| author_facet |
Espichán-Linares A. Oncevay-Marcos A. |
| author_role |
author |
| author2 |
Oncevay-Marcos A. |
| author2_role |
author |
| dc.contributor.author.fl_str_mv |
Espichán-Linares A. Oncevay-Marcos A. |
| dc.subject.none.fl_str_mv |
The standard model |
| topic |
The standard model Deep learning Information management Linguistics Natural language processing systems Best fit Complex task Corpus-based methods Language identification Learning approach Multiple class State of the art Big data https://purl.org/pe-repo/ocde/ford#2.00.00 |
| dc.subject.es_PE.fl_str_mv |
Deep learning Information management Linguistics Natural language processing systems Best fit Complex task Corpus-based methods Language identification Learning approach Multiple class State of the art Big data |
| dc.subject.ocde.none.fl_str_mv |
https://purl.org/pe-repo/ocde/ford#2.00.00 |
| description |
Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future. |
| publishDate |
2018 |
| dc.date.accessioned.none.fl_str_mv |
2024-05-30T23:13:38Z |
| dc.date.available.none.fl_str_mv |
2024-05-30T23:13:38Z |
| dc.date.issued.fl_str_mv |
2018 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject |
| format |
conferenceObject |
| dc.identifier.isbn.none.fl_str_mv |
urn:isbn:9783319905952 |
| dc.identifier.uri.none.fl_str_mv |
https://hdl.handle.net/20.500.12390/672 |
| dc.identifier.doi.none.fl_str_mv |
https://doi.org/10.1007/978-3-319-90596-9_7 |
| dc.identifier.scopus.none.fl_str_mv |
2-s2.0-85045991573 |
| identifier_str_mv |
urn:isbn:9783319905952 2-s2.0-85045991573 |
| url |
https://hdl.handle.net/20.500.12390/672 https://doi.org/10.1007/978-3-319-90596-9_7 |
| dc.language.iso.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.ispartof.none.fl_str_mv |
Communications in Computer and Information Science |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.publisher.none.fl_str_mv |
Springer Verlag |
| publisher.none.fl_str_mv |
Springer Verlag |
| dc.source.none.fl_str_mv |
reponame:CONCYTEC-Institucional instname:Consejo Nacional de Ciencia Tecnología e Innovación instacron:CONCYTEC |
| instname_str |
Consejo Nacional de Ciencia Tecnología e Innovación |
| instacron_str |
CONCYTEC |
| institution |
CONCYTEC |
| reponame_str |
CONCYTEC-Institucional |
| collection |
CONCYTEC-Institucional |
| repository.name.fl_str_mv |
Repositorio Institucional CONCYTEC |
| repository.mail.fl_str_mv |
repositorio@concytec.gob.pe |
| _version_ |
1844883129740820480 |
| spelling |
Publicationrp01513600rp00570500Espichán-Linares A.Oncevay-Marcos A.2024-05-30T23:13:38Z2024-05-30T23:13:38Z2018urn:isbn:9783319905952https://hdl.handle.net/20.500.12390/672https://doi.org/10.1007/978-3-319-90596-9_72-s2.0-85045991573Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - ConcytecengSpringer VerlagCommunications in Computer and Information Scienceinfo:eu-repo/semantics/openAccessThe standard modelDeep learning-1Information management-1Linguistics-1Natural language processing systems-1Best fit-1Complex task-1Corpus-based methods-1Language identification-1Learning approach-1Multiple class-1State of the art-1Big data-1https://purl.org/pe-repo/ocde/ford#2.00.00-1Language identification with scarce data: A case study from Peruinfo:eu-repo/semantics/conferenceObjectreponame:CONCYTEC-Institucionalinstname:Consejo Nacional de Ciencia Tecnología e Innovacióninstacron:CONCYTEC20.500.12390/672oai:repositorio.concytec.gob.pe:20.500.12390/6722024-05-30 15:58:30.646http://purl.org/coar/access_right/c_14cbinfo:eu-repo/semantics/closedAccessmetadata only accesshttps://repositorio.concytec.gob.peRepositorio Institucional CONCYTECrepositorio@concytec.gob.pe#PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE#<Publication xmlns="https://www.openaire.eu/cerif-profile/1.1/" id="f58acfee-5d7b-4235-a0a9-0d2933319358"> <Type xmlns="https://www.openaire.eu/cerif-profile/vocab/COAR_Publication_Types">http://purl.org/coar/resource_type/c_1843</Type> <Language>eng</Language> <Title>Language identification with scarce data: A case study from Peru</Title> <PublishedIn> <Publication> <Title>Communications in Computer and Information Science</Title> </Publication> </PublishedIn> <PublicationDate>2018</PublicationDate> <DOI>https://doi.org/10.1007/978-3-319-90596-9_7</DOI> <SCP-Number>2-s2.0-85045991573</SCP-Number> <ISBN>urn:isbn:9783319905952</ISBN> <Authors> <Author> <DisplayName>Espichán-Linares A.</DisplayName> <Person id="rp01513" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Oncevay-Marcos A.</DisplayName> <Person id="rp00570" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> </Authors> <Editors> </Editors> <Publishers> <Publisher> <DisplayName>Springer Verlag</DisplayName> <OrgUnit /> </Publisher> </Publishers> <Keyword>The standard model</Keyword> <Keyword>Deep learning</Keyword> <Keyword>Information management</Keyword> <Keyword>Linguistics</Keyword> <Keyword>Natural language processing systems</Keyword> <Keyword>Best fit</Keyword> <Keyword>Complex task</Keyword> <Keyword>Corpus-based methods</Keyword> <Keyword>Language identification</Keyword> <Keyword>Learning approach</Keyword> <Keyword>Multiple class</Keyword> <Keyword>State of the art</Keyword> <Keyword>Big data</Keyword> <Abstract>Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.</Abstract> <Access xmlns="http://purl.org/coar/access_right" > </Access> </Publication> -1 |
| score |
13.413352 |
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).