Resultados de Búsqueda por Autor

artículo

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Publicado por
Bustamante G., Oncevay A., Zariquiey R.

Publicado 2020

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.

objeto de conferencia

Chanot: An intelligent annotation tool for indigenous and highly agglutinative languages in Peru

Publicado por
Mercado-Gonzales R., Pereira-Noriega J., Sobrevilla M., Oncevay A.

Publicado 2019

Enlace

Linguistic corpus annotation is one of the most important phases for solving Natural Language Processing (NLP) tasks, as these methods are deeply involved with corpus-based techniques. However, meta-data annotation is a highly laborious manual task. A supportive alternative requires the use of computational tools. They are likely to simplify some of these operations, while can be adjusted appropriately to the needs of particular language features at the same time. Therefore, this paper presents ChAnot, a web-based annotation tool developed for Peruvian indigenous and highly agglutinative languages, where Shipibo-Konibo was the case study. This new tool is able to support a diverse set of linguistic annotation tasks, such as word segmentation, POS-tag markup, among others. Also, it includes a suggestion engine based on historic and machine learning models, and a set of statistics about pr...