Resultados de Búsqueda por Autor

1

objeto de conferencia

A Low-Resourced Peruvian Language Identification Model

Publicado por
Linares A.E., Oncevay-Marcos A.

Publicado 2017

Due to the linguistic revitalization in Peru´ through the last years, there is a growing interest to reinforce the bilingual education in the country and to increase the research focused in its native languages. From the computer science perspective, one of the first steps to support the languages study is the implementation of an automatic language identification tool using machine learning methods. Therefore, this work focuses in two steps: (1) the building of a digital and annotated corpus for 16 Peruvian native languages extracted from documents in web repositories, and (2) the fit of a supervised learning model for the language identification task using features identified from related studies in the state of the art, such as ngrams. The obtained results were promising (97% in average precision), and it is expected to take advantage of the corpus and the model for more complex task...

2

objeto de conferencia

Language identification with scarce data: A case study from Peru

Publicado por
Espichán-Linares A., Oncevay-Marcos A.

Publicado 2018

Enlace

Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex ta...

3

objeto de conferencia

Corpus creation and initial SMT experiments between Spanish and Shipibo-Konibo

Publicado por
Galarreta A.-P., Melgar A., Oncevay-Marcos A.

Publicado 2017

Enlace

In this paper, we present the first attempts to develop a machine translation (MT) system between Spanish and Shipibo-konibo (es-shp).

4

objeto de conferencia

WordNet-SHP: Towards the building of a lexical database for a Peruvian minority language

Publicado por
Maguiño-Valencia D., Oncevay-Marcos A., Sobrevilla Cabezudo M.A.

Publicado 2019

Enlace

WordNet-like resources are lexical databases with highly relevance information and data which could be exploited in more complex computational linguistics research and applications. The building process requires manual and automatic tasks, that could be more arduous if the language is a minority one with fewer digital resources. This study focuses in the construction of an initial WordNetdatabase for a low-resourced and indigenous language in Peru: Shipibo-Konibo (shp). First, the stages of development from a scarce scenario (a bilingual dictionary shp-es) are described. Then, it is proposed a synset alignment method by comparing the definition glosses in the dictionary (written in Spanish) with the content of a Spanish WordNet. In this sense, word2vec similarity was the chosen metric for the proximity measure. Finally, an evaluation process is performed for the synsets, using a manually...

5

objeto de conferencia

Ship-lemmatagger: Building an nlp toolkit for a peruvian native language

Publicado por
Pereira-Noriega J., Mercado-Gonzales R., Melgar A., Sobrevilla-Cabezudo M., Oncevay-Marcos A.

Publicado 2017

Enlace

Natural Language Processing deals with the understanding and generation of texts through computer programs. There are many different functionalities used in this area, but among them there are some functions that are the support of the remaining ones. These methods are related to the core processing of the morphology of the language (such as lemmatization) and automatic identification of the part-of-speech tag. Thereby, this paper describes the implementation of a basic NLP toolkit for a new language, focusing in the features mentioned before, and testing them in an own corpus built for the occasion. The obtained results exceeded the expected results and could be used for more complex tasks such as machine translation.