No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Bustamante G.; Oncevay A.; Zariquiey R.

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Descripción del Articulo

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario...

Descripción completa

Detalles Bibliográficos
Autores:	Bustamante G., Oncevay A., Zariquiey R.
Formato:	artículo
Fecha de Publicación:	2020
Institución:	Consejo Nacional de Ciencia Tecnología e Innovación
Repositorio:	CONCYTEC-Institucional
Lenguaje:	inglés
OAI Identifier:	oai:repositorio.concytec.gob.pe:20.500.12390/2648
Enlace del recurso:	https://hdl.handle.net/20.500.12390/2648
Nivel de acceso:	acceso abierto
Materia:	Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02

id	CONC_645795c16f5082ecddb84743e00fe086
oai_identifier_str	oai:repositorio.concytec.gob.pe:20.500.12390/2648
network_acronym_str	CONC
network_name_str	CONCYTEC-Institucional
repository_id_str	4689
dc.title.none.fl_str_mv	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
spellingShingle	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru Bustamante G. Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02
title_short	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_full	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_fullStr	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_full_unstemmed	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
title_sort	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
author	Bustamante G.
author_facet	Bustamante G. Oncevay A. Zariquiey R.
author_role	author
author2	Oncevay A. Zariquiey R.
author2_role	author author
dc.contributor.author.fl_str_mv	Bustamante G. Oncevay A. Zariquiey R.
dc.subject.none.fl_str_mv	Yine
topic	Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02
dc.subject.es_PE.fl_str_mv	Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha
dc.subject.ocde.none.fl_str_mv	https://purl.org/pe-repo/ocde/ford#6.02.02
description	We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
publishDate	2020
dc.date.accessioned.none.fl_str_mv	2024-05-30T23:13:38Z
dc.date.available.none.fl_str_mv	2024-05-30T23:13:38Z
dc.date.issued.fl_str_mv	2020
dc.type.none.fl_str_mv	info:eu-repo/semantics/article
format	article
dc.identifier.uri.none.fl_str_mv	https://hdl.handle.net/20.500.12390/2648
dc.identifier.scopus.none.fl_str_mv	2-s2.0-85096526337
url	https://hdl.handle.net/20.500.12390/2648
identifier_str_mv	2-s2.0-85096526337
dc.language.iso.none.fl_str_mv	eng
language	eng
dc.relation.ispartof.none.fl_str_mv	LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	European Language Resources Association (ELRA)
publisher.none.fl_str_mv	European Language Resources Association (ELRA)
dc.source.none.fl_str_mv	reponame:CONCYTEC-Institucional instname:Consejo Nacional de Ciencia Tecnología e Innovación instacron:CONCYTEC
instname_str	Consejo Nacional de Ciencia Tecnología e Innovación
instacron_str	CONCYTEC
institution	CONCYTEC
reponame_str	CONCYTEC-Institucional
collection	CONCYTEC-Institucional
bitstream.url.fl_str_mv	https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/download https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/download
bitstream.checksum.fl_str_mv	8adb66ddaa97a52917dab478ec1ab9e4 ba84e7545d6a560b23f7e759ec01a984 a6e53cfda3d7b91a660f70ba863ed180
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositorio Institucional CONCYTEC
repository.mail.fl_str_mv	repositorio@concytec.gob.pe
_version_	1853772967327039488
spelling	Publicationrp06836600rp00952600rp06837600Bustamante G.Oncevay A.Zariquiey R.2024-05-30T23:13:38Z2024-05-30T23:13:38Z2020https://hdl.handle.net/20.500.12390/26482-s2.0-85096526337We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - ConcytecengEuropean Language Resources Association (ELRA)LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedingsinfo:eu-repo/semantics/openAccessYineAshaninka-1Corpus creation-1Endangered languages-1Indigenous languages-1Low-resource languages-1Monolingual corpus-1Pdf processing-1Shipibo-Konibo-1Yanesha-1https://purl.org/pe-repo/ocde/ford#6.02.02-1No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peruinfo:eu-repo/semantics/articlereponame:CONCYTEC-Institucionalinstname:Consejo Nacional de Ciencia Tecnología e Innovacióninstacron:CONCYTECORIGINALNo data to crawl Monolingual corpus.pdfNo data to crawl Monolingual corpus.pdfapplication/pdf984112https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download8adb66ddaa97a52917dab478ec1ab9e4MD51TEXTNo data to crawl Monolingual corpus.pdf.txtNo data to crawl Monolingual corpus.pdf.txtExtracted texttext/plain42342https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/downloadba84e7545d6a560b23f7e759ec01a984MD52THUMBNAILNo data to crawl Monolingual corpus.pdf.jpgNo data to crawl Monolingual corpus.pdf.jpgGenerated Thumbnailimage/jpeg5806https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/downloada6e53cfda3d7b91a660f70ba863ed180MD5320.500.12390/2648oai:repositorio.concytec.gob.pe:20.500.12390/26482025-01-20 22:00:37.967http://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessopen accesshttps://repositorio.concytec.gob.peRepositorio Institucional CONCYTECrepositorio@concytec.gob.pe#PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE#<Publication xmlns="https://www.openaire.eu/cerif-profile/1.1/" id="b16290ef-44cd-4ca3-b02c-de3e60b23381"> <Type xmlns="https://www.openaire.eu/cerif-profile/vocab/COAR_Publication_Types">http://purl.org/coar/resource_type/c_1843</Type> <Language>eng</Language> <Title>No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru</Title> <PublishedIn> <Publication> <Title>LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings</Title> </Publication> </PublishedIn> <PublicationDate>2020</PublicationDate> <SCP-Number>2-s2.0-85096526337</SCP-Number> <Authors> <Author> <DisplayName>Bustamante G.</DisplayName> <Person id="rp06836" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Oncevay A.</DisplayName> <Person id="rp00952" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Zariquiey R.</DisplayName> <Person id="rp06837" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> </Authors> <Editors> </Editors> <Publishers> <Publisher> <DisplayName>European Language Resources Association (ELRA)</DisplayName> <OrgUnit /> </Publisher> </Publishers> <Keyword>Yine</Keyword> <Keyword>Ashaninka</Keyword> <Keyword>Corpus creation</Keyword> <Keyword>Endangered languages</Keyword> <Keyword>Indigenous languages</Keyword> <Keyword>Low-resource languages</Keyword> <Keyword>Monolingual corpus</Keyword> <Keyword>Pdf processing</Keyword> <Keyword>Shipibo-Konibo</Keyword> <Keyword>Yanesha</Keyword> <Abstract>We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.</Abstract> <Access xmlns="http://purl.org/coar/access_right" > </Access> </Publication> -1
score	13.459678

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Descripción del Articulo

Ejemplares Similares