No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Descripción del Articulo

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario...

Descripción completa

Detalles Bibliográficos
Autores: Bustamante G., Oncevay A., Zariquiey R.
Formato: artículo
Fecha de Publicación:2020
Institución:Consejo Nacional de Ciencia Tecnología e Innovación
Repositorio:CONCYTEC-Institucional
Lenguaje:inglés
OAI Identifier:oai:repositorio.concytec.gob.pe:20.500.12390/2648
Enlace del recurso:https://hdl.handle.net/20.500.12390/2648
Nivel de acceso:acceso abierto
Materia:Yine
Ashaninka
Corpus creation
Endangered languages
Indigenous languages
Low-resource languages
Monolingual corpus
Pdf processing
Shipibo-Konibo
Yanesha
https://purl.org/pe-repo/ocde/ford#6.02.02
id CONC_645795c16f5082ecddb84743e00fe086
oai_identifier_str oai:repositorio.concytec.gob.pe:20.500.12390/2648
network_acronym_str CONC
network_name_str CONCYTEC-Institucional
repository_id_str 4689
dc.title.none.fl_str_mv No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
spellingShingle No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
Bustamante G.
Yine
Ashaninka
Corpus creation
Endangered languages
Indigenous languages
Low-resource languages
Monolingual corpus
Pdf processing
Shipibo-Konibo
Yanesha
https://purl.org/pe-repo/ocde/ford#6.02.02
title_short No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_full No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_fullStr No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_full_unstemmed No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_sort No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
author Bustamante G.
author_facet Bustamante G.
Oncevay A.
Zariquiey R.
author_role author
author2 Oncevay A.
Zariquiey R.
author2_role author
author
dc.contributor.author.fl_str_mv Bustamante G.
Oncevay A.
Zariquiey R.
dc.subject.none.fl_str_mv Yine
topic Yine
Ashaninka
Corpus creation
Endangered languages
Indigenous languages
Low-resource languages
Monolingual corpus
Pdf processing
Shipibo-Konibo
Yanesha
https://purl.org/pe-repo/ocde/ford#6.02.02
dc.subject.es_PE.fl_str_mv Ashaninka
Corpus creation
Endangered languages
Indigenous languages
Low-resource languages
Monolingual corpus
Pdf processing
Shipibo-Konibo
Yanesha
dc.subject.ocde.none.fl_str_mv https://purl.org/pe-repo/ocde/ford#6.02.02
description We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
publishDate 2020
dc.date.accessioned.none.fl_str_mv 2024-05-30T23:13:38Z
dc.date.available.none.fl_str_mv 2024-05-30T23:13:38Z
dc.date.issued.fl_str_mv 2020
dc.type.none.fl_str_mv info:eu-repo/semantics/article
format article
dc.identifier.uri.none.fl_str_mv https://hdl.handle.net/20.500.12390/2648
dc.identifier.scopus.none.fl_str_mv 2-s2.0-85096526337
url https://hdl.handle.net/20.500.12390/2648
identifier_str_mv 2-s2.0-85096526337
dc.language.iso.none.fl_str_mv eng
language eng
dc.relation.ispartof.none.fl_str_mv LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv European Language Resources Association (ELRA)
publisher.none.fl_str_mv European Language Resources Association (ELRA)
dc.source.none.fl_str_mv reponame:CONCYTEC-Institucional
instname:Consejo Nacional de Ciencia Tecnología e Innovación
instacron:CONCYTEC
instname_str Consejo Nacional de Ciencia Tecnología e Innovación
instacron_str CONCYTEC
institution CONCYTEC
reponame_str CONCYTEC-Institucional
collection CONCYTEC-Institucional
bitstream.url.fl_str_mv https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download
https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/download
https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/download
bitstream.checksum.fl_str_mv 8adb66ddaa97a52917dab478ec1ab9e4
ba84e7545d6a560b23f7e759ec01a984
a6e53cfda3d7b91a660f70ba863ed180
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositorio Institucional CONCYTEC
repository.mail.fl_str_mv repositorio@concytec.gob.pe
_version_ 1839175704705499136
spelling Publicationrp06836600rp00952600rp06837600Bustamante G.Oncevay A.Zariquiey R.2024-05-30T23:13:38Z2024-05-30T23:13:38Z2020https://hdl.handle.net/20.500.12390/26482-s2.0-85096526337We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - ConcytecengEuropean Language Resources Association (ELRA)LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedingsinfo:eu-repo/semantics/openAccessYineAshaninka-1Corpus creation-1Endangered languages-1Indigenous languages-1Low-resource languages-1Monolingual corpus-1Pdf processing-1Shipibo-Konibo-1Yanesha-1https://purl.org/pe-repo/ocde/ford#6.02.02-1No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peruinfo:eu-repo/semantics/articlereponame:CONCYTEC-Institucionalinstname:Consejo Nacional de Ciencia Tecnología e Innovacióninstacron:CONCYTECORIGINALNo data to crawl Monolingual corpus.pdfNo data to crawl Monolingual corpus.pdfapplication/pdf984112https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download8adb66ddaa97a52917dab478ec1ab9e4MD51TEXTNo data to crawl Monolingual corpus.pdf.txtNo data to crawl Monolingual corpus.pdf.txtExtracted texttext/plain42342https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/downloadba84e7545d6a560b23f7e759ec01a984MD52THUMBNAILNo data to crawl Monolingual corpus.pdf.jpgNo data to crawl Monolingual corpus.pdf.jpgGenerated Thumbnailimage/jpeg5806https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/downloada6e53cfda3d7b91a660f70ba863ed180MD5320.500.12390/2648oai:repositorio.concytec.gob.pe:20.500.12390/26482025-01-20 22:00:37.967http://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessopen accesshttps://repositorio.concytec.gob.peRepositorio Institucional CONCYTECrepositorio@concytec.gob.pe#PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE#<Publication xmlns="https://www.openaire.eu/cerif-profile/1.1/" id="b16290ef-44cd-4ca3-b02c-de3e60b23381"> <Type xmlns="https://www.openaire.eu/cerif-profile/vocab/COAR_Publication_Types">http://purl.org/coar/resource_type/c_1843</Type> <Language>eng</Language> <Title>No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru</Title> <PublishedIn> <Publication> <Title>LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings</Title> </Publication> </PublishedIn> <PublicationDate>2020</PublicationDate> <SCP-Number>2-s2.0-85096526337</SCP-Number> <Authors> <Author> <DisplayName>Bustamante G.</DisplayName> <Person id="rp06836" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Oncevay A.</DisplayName> <Person id="rp00952" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Zariquiey R.</DisplayName> <Person id="rp06837" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> </Authors> <Editors> </Editors> <Publishers> <Publisher> <DisplayName>European Language Resources Association (ELRA)</DisplayName> <OrgUnit /> </Publisher> </Publishers> <Keyword>Yine</Keyword> <Keyword>Ashaninka</Keyword> <Keyword>Corpus creation</Keyword> <Keyword>Endangered languages</Keyword> <Keyword>Indigenous languages</Keyword> <Keyword>Low-resource languages</Keyword> <Keyword>Monolingual corpus</Keyword> <Keyword>Pdf processing</Keyword> <Keyword>Shipibo-Konibo</Keyword> <Keyword>Yanesha</Keyword> <Abstract>We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.</Abstract> <Access xmlns="http://purl.org/coar/access_right" > </Access> </Publication> -1
score 13.4481325
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).