No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
Descripción del Articulo
We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario...
Autores: | , , |
---|---|
Formato: | artículo |
Fecha de Publicación: | 2020 |
Institución: | Consejo Nacional de Ciencia Tecnología e Innovación |
Repositorio: | CONCYTEC-Institucional |
Lenguaje: | inglés |
OAI Identifier: | oai:repositorio.concytec.gob.pe:20.500.12390/2648 |
Enlace del recurso: | https://hdl.handle.net/20.500.12390/2648 |
Nivel de acceso: | acceso abierto |
Materia: | Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02 |
id |
CONC_645795c16f5082ecddb84743e00fe086 |
---|---|
oai_identifier_str |
oai:repositorio.concytec.gob.pe:20.500.12390/2648 |
network_acronym_str |
CONC |
network_name_str |
CONCYTEC-Institucional |
repository_id_str |
4689 |
dc.title.none.fl_str_mv |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru |
title |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru |
spellingShingle |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru Bustamante G. Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02 |
title_short |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru |
title_full |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru |
title_fullStr |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru |
title_full_unstemmed |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru |
title_sort |
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru |
author |
Bustamante G. |
author_facet |
Bustamante G. Oncevay A. Zariquiey R. |
author_role |
author |
author2 |
Oncevay A. Zariquiey R. |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Bustamante G. Oncevay A. Zariquiey R. |
dc.subject.none.fl_str_mv |
Yine |
topic |
Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02 |
dc.subject.es_PE.fl_str_mv |
Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha |
dc.subject.ocde.none.fl_str_mv |
https://purl.org/pe-repo/ocde/ford#6.02.02 |
description |
We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays. |
publishDate |
2020 |
dc.date.accessioned.none.fl_str_mv |
2024-05-30T23:13:38Z |
dc.date.available.none.fl_str_mv |
2024-05-30T23:13:38Z |
dc.date.issued.fl_str_mv |
2020 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
dc.identifier.uri.none.fl_str_mv |
https://hdl.handle.net/20.500.12390/2648 |
dc.identifier.scopus.none.fl_str_mv |
2-s2.0-85096526337 |
url |
https://hdl.handle.net/20.500.12390/2648 |
identifier_str_mv |
2-s2.0-85096526337 |
dc.language.iso.none.fl_str_mv |
eng |
language |
eng |
dc.relation.ispartof.none.fl_str_mv |
LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
European Language Resources Association (ELRA) |
publisher.none.fl_str_mv |
European Language Resources Association (ELRA) |
dc.source.none.fl_str_mv |
reponame:CONCYTEC-Institucional instname:Consejo Nacional de Ciencia Tecnología e Innovación instacron:CONCYTEC |
instname_str |
Consejo Nacional de Ciencia Tecnología e Innovación |
instacron_str |
CONCYTEC |
institution |
CONCYTEC |
reponame_str |
CONCYTEC-Institucional |
collection |
CONCYTEC-Institucional |
bitstream.url.fl_str_mv |
https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/download https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/download |
bitstream.checksum.fl_str_mv |
8adb66ddaa97a52917dab478ec1ab9e4 ba84e7545d6a560b23f7e759ec01a984 a6e53cfda3d7b91a660f70ba863ed180 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositorio Institucional CONCYTEC |
repository.mail.fl_str_mv |
repositorio@concytec.gob.pe |
_version_ |
1839175704705499136 |
spelling |
Publicationrp06836600rp00952600rp06837600Bustamante G.Oncevay A.Zariquiey R.2024-05-30T23:13:38Z2024-05-30T23:13:38Z2020https://hdl.handle.net/20.500.12390/26482-s2.0-85096526337We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - ConcytecengEuropean Language Resources Association (ELRA)LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedingsinfo:eu-repo/semantics/openAccessYineAshaninka-1Corpus creation-1Endangered languages-1Indigenous languages-1Low-resource languages-1Monolingual corpus-1Pdf processing-1Shipibo-Konibo-1Yanesha-1https://purl.org/pe-repo/ocde/ford#6.02.02-1No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peruinfo:eu-repo/semantics/articlereponame:CONCYTEC-Institucionalinstname:Consejo Nacional de Ciencia Tecnología e Innovacióninstacron:CONCYTECORIGINALNo data to crawl Monolingual corpus.pdfNo data to crawl Monolingual corpus.pdfapplication/pdf984112https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download8adb66ddaa97a52917dab478ec1ab9e4MD51TEXTNo data to crawl Monolingual corpus.pdf.txtNo data to crawl Monolingual corpus.pdf.txtExtracted texttext/plain42342https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/downloadba84e7545d6a560b23f7e759ec01a984MD52THUMBNAILNo data to crawl Monolingual corpus.pdf.jpgNo data to crawl Monolingual corpus.pdf.jpgGenerated Thumbnailimage/jpeg5806https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/downloada6e53cfda3d7b91a660f70ba863ed180MD5320.500.12390/2648oai:repositorio.concytec.gob.pe:20.500.12390/26482025-01-20 22:00:37.967http://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessopen accesshttps://repositorio.concytec.gob.peRepositorio Institucional CONCYTECrepositorio@concytec.gob.pe#PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE#<Publication xmlns="https://www.openaire.eu/cerif-profile/1.1/" id="b16290ef-44cd-4ca3-b02c-de3e60b23381"> <Type xmlns="https://www.openaire.eu/cerif-profile/vocab/COAR_Publication_Types">http://purl.org/coar/resource_type/c_1843</Type> <Language>eng</Language> <Title>No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru</Title> <PublishedIn> <Publication> <Title>LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings</Title> </Publication> </PublishedIn> <PublicationDate>2020</PublicationDate> <SCP-Number>2-s2.0-85096526337</SCP-Number> <Authors> <Author> <DisplayName>Bustamante G.</DisplayName> <Person id="rp06836" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Oncevay A.</DisplayName> <Person id="rp00952" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Zariquiey R.</DisplayName> <Person id="rp06837" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> </Authors> <Editors> </Editors> <Publishers> <Publisher> <DisplayName>European Language Resources Association (ELRA)</DisplayName> <OrgUnit /> </Publisher> </Publishers> <Keyword>Yine</Keyword> <Keyword>Ashaninka</Keyword> <Keyword>Corpus creation</Keyword> <Keyword>Endangered languages</Keyword> <Keyword>Indigenous languages</Keyword> <Keyword>Low-resource languages</Keyword> <Keyword>Monolingual corpus</Keyword> <Keyword>Pdf processing</Keyword> <Keyword>Shipibo-Konibo</Keyword> <Keyword>Yanesha</Keyword> <Abstract>We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.</Abstract> <Access xmlns="http://purl.org/coar/access_right" > </Access> </Publication> -1 |
score |
13.4481325 |
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).