No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
Descripción del Articulo
        We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario...
              
            
    
                        | Autores: | , , | 
|---|---|
| Formato: | artículo | 
| Fecha de Publicación: | 2020 | 
| Institución: | Consejo Nacional de Ciencia Tecnología e Innovación | 
| Repositorio: | CONCYTEC-Institucional | 
| Lenguaje: | inglés | 
| OAI Identifier: | oai:repositorio.concytec.gob.pe:20.500.12390/2648 | 
| Enlace del recurso: | https://hdl.handle.net/20.500.12390/2648 | 
| Nivel de acceso: | acceso abierto | 
| Materia: | Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02 | 
| id | CONC_645795c16f5082ecddb84743e00fe086 | 
|---|---|
| oai_identifier_str | oai:repositorio.concytec.gob.pe:20.500.12390/2648 | 
| network_acronym_str | CONC | 
| network_name_str | CONCYTEC-Institucional | 
| repository_id_str | 4689 | 
| dc.title.none.fl_str_mv | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | 
| title | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | 
| spellingShingle | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru Bustamante G. Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02 | 
| title_short | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | 
| title_full | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | 
| title_fullStr | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | 
| title_full_unstemmed | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | 
| title_sort | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | 
| author | Bustamante G. | 
| author_facet | Bustamante G. Oncevay A. Zariquiey R. | 
| author_role | author | 
| author2 | Oncevay A. Zariquiey R. | 
| author2_role | author author | 
| dc.contributor.author.fl_str_mv | Bustamante G. Oncevay A. Zariquiey R. | 
| dc.subject.none.fl_str_mv | Yine | 
| topic | Yine Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha https://purl.org/pe-repo/ocde/ford#6.02.02 | 
| dc.subject.es_PE.fl_str_mv | Ashaninka Corpus creation Endangered languages Indigenous languages Low-resource languages Monolingual corpus Pdf processing Shipibo-Konibo Yanesha | 
| dc.subject.ocde.none.fl_str_mv | https://purl.org/pe-repo/ocde/ford#6.02.02 | 
| description | We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays. | 
| publishDate | 2020 | 
| dc.date.accessioned.none.fl_str_mv | 2024-05-30T23:13:38Z | 
| dc.date.available.none.fl_str_mv | 2024-05-30T23:13:38Z | 
| dc.date.issued.fl_str_mv | 2020 | 
| dc.type.none.fl_str_mv | info:eu-repo/semantics/article | 
| format | article | 
| dc.identifier.uri.none.fl_str_mv | https://hdl.handle.net/20.500.12390/2648 | 
| dc.identifier.scopus.none.fl_str_mv | 2-s2.0-85096526337 | 
| url | https://hdl.handle.net/20.500.12390/2648 | 
| identifier_str_mv | 2-s2.0-85096526337 | 
| dc.language.iso.none.fl_str_mv | eng | 
| language | eng | 
| dc.relation.ispartof.none.fl_str_mv | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings | 
| dc.rights.none.fl_str_mv | info:eu-repo/semantics/openAccess | 
| eu_rights_str_mv | openAccess | 
| dc.publisher.none.fl_str_mv | European Language Resources Association (ELRA) | 
| publisher.none.fl_str_mv | European Language Resources Association (ELRA) | 
| dc.source.none.fl_str_mv | reponame:CONCYTEC-Institucional instname:Consejo Nacional de Ciencia Tecnología e Innovación instacron:CONCYTEC | 
| instname_str | Consejo Nacional de Ciencia Tecnología e Innovación | 
| instacron_str | CONCYTEC | 
| institution | CONCYTEC | 
| reponame_str | CONCYTEC-Institucional | 
| collection | CONCYTEC-Institucional | 
| bitstream.url.fl_str_mv | https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/download https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/download | 
| bitstream.checksum.fl_str_mv | 8adb66ddaa97a52917dab478ec1ab9e4 ba84e7545d6a560b23f7e759ec01a984 a6e53cfda3d7b91a660f70ba863ed180 | 
| bitstream.checksumAlgorithm.fl_str_mv | MD5 MD5 MD5 | 
| repository.name.fl_str_mv | Repositorio Institucional CONCYTEC | 
| repository.mail.fl_str_mv | repositorio@concytec.gob.pe | 
| _version_ | 1844883017348153344 | 
| spelling | Publicationrp06836600rp00952600rp06837600Bustamante G.Oncevay A.Zariquiey R.2024-05-30T23:13:38Z2024-05-30T23:13:38Z2020https://hdl.handle.net/20.500.12390/26482-s2.0-85096526337We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - ConcytecengEuropean Language Resources Association (ELRA)LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedingsinfo:eu-repo/semantics/openAccessYineAshaninka-1Corpus creation-1Endangered languages-1Indigenous languages-1Low-resource languages-1Monolingual corpus-1Pdf processing-1Shipibo-Konibo-1Yanesha-1https://purl.org/pe-repo/ocde/ford#6.02.02-1No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peruinfo:eu-repo/semantics/articlereponame:CONCYTEC-Institucionalinstname:Consejo Nacional de Ciencia Tecnología e Innovacióninstacron:CONCYTECORIGINALNo data to crawl Monolingual corpus.pdfNo data to crawl Monolingual corpus.pdfapplication/pdf984112https://repositorio.concytec.gob.pe/bitstreams/e1a4f53f-6400-43aa-b055-031e5aeec114/download8adb66ddaa97a52917dab478ec1ab9e4MD51TEXTNo data to crawl Monolingual corpus.pdf.txtNo data to crawl Monolingual corpus.pdf.txtExtracted texttext/plain42342https://repositorio.concytec.gob.pe/bitstreams/73a3ccb8-fb77-4bd5-90e0-4f8336d25b11/downloadba84e7545d6a560b23f7e759ec01a984MD52THUMBNAILNo data to crawl Monolingual corpus.pdf.jpgNo data to crawl Monolingual corpus.pdf.jpgGenerated Thumbnailimage/jpeg5806https://repositorio.concytec.gob.pe/bitstreams/cf1f3938-3589-4aa2-883f-9e87d9003943/downloada6e53cfda3d7b91a660f70ba863ed180MD5320.500.12390/2648oai:repositorio.concytec.gob.pe:20.500.12390/26482025-01-20 22:00:37.967http://purl.org/coar/access_right/c_abf2info:eu-repo/semantics/openAccessopen accesshttps://repositorio.concytec.gob.peRepositorio Institucional CONCYTECrepositorio@concytec.gob.pe#PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE##PLACEHOLDER_PARENT_METADATA_VALUE#<Publication xmlns="https://www.openaire.eu/cerif-profile/1.1/" id="b16290ef-44cd-4ca3-b02c-de3e60b23381"> <Type xmlns="https://www.openaire.eu/cerif-profile/vocab/COAR_Publication_Types">http://purl.org/coar/resource_type/c_1843</Type> <Language>eng</Language> <Title>No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru</Title> <PublishedIn> <Publication> <Title>LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings</Title> </Publication> </PublishedIn> <PublicationDate>2020</PublicationDate> <SCP-Number>2-s2.0-85096526337</SCP-Number> <Authors> <Author> <DisplayName>Bustamante G.</DisplayName> <Person id="rp06836" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Oncevay A.</DisplayName> <Person id="rp00952" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> <Author> <DisplayName>Zariquiey R.</DisplayName> <Person id="rp06837" /> <Affiliation> <OrgUnit> </OrgUnit> </Affiliation> </Author> </Authors> <Editors> </Editors> <Publishers> <Publisher> <DisplayName>European Language Resources Association (ELRA)</DisplayName> <OrgUnit /> </Publisher> </Publishers> <Keyword>Yine</Keyword> <Keyword>Ashaninka</Keyword> <Keyword>Corpus creation</Keyword> <Keyword>Endangered languages</Keyword> <Keyword>Indigenous languages</Keyword> <Keyword>Low-resource languages</Keyword> <Keyword>Monolingual corpus</Keyword> <Keyword>Pdf processing</Keyword> <Keyword>Shipibo-Konibo</Keyword> <Keyword>Yanesha</Keyword> <Abstract>We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.</Abstract> <Access xmlns="http://purl.org/coar/access_right" > </Access> </Publication> -1 | 
| score | 13.421253 | 
 Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).
    La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).
 
   
   
             
            