Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

Descripción del Articulo

This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classifica...

Descripción completa

Detalles Bibliográficos
Autores: Bravo-Rocca, Gusseppe, Torres-Robatty, Piero, Fiestas-Iquira, Jose
Formato: capítulo de libro
Fecha de Publicación:2019
Institución:Consejo Nacional de Ciencia Tecnología e Innovación
Repositorio:CONCYTEC-Institucional
Lenguaje:inglés
OAI Identifier:oai:repositorio.concytec.gob.pe:20.500.12390/1325
Enlace del recurso:https://hdl.handle.net/20.500.12390/1325
https://doi.org/10.1007/978-3-030-11680-4_13
Nivel de acceso:acceso abierto
Materia:Statistics
Semi-automated machine learning
Data Science
Data mining
Data engineering
Big data
https://purl.org/pe-repo/ocde/ford#5.08.02
Descripción
Sumario:This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classification and regression problems. These models include exploratory data analysis, data preprocessing, feature engineering and modeling. The project has its basis in Pymach, a similar library that faces those steps for small and medium-sized datasets (about ten millions of rows and a few columns). Sparkmach central labor is to scale Pymach to overcome big datasets by using Apache Spark distributed computing, a distributed engine for large-scale data processing, that tackle several data science related problems in a cluster environment. Despite the software nature, Sparkmach can be of use for local environments, getting the most benefits from the distributed processing tools.
Nota importante:
La información contenida en este registro es de entera responsabilidad de la institución que gestiona el repositorio institucional donde esta contenido este documento o set de datos. El CONCYTEC no se hace responsable por los contenidos (publicaciones y/o datos) accesibles a través del Repositorio Nacional Digital de Ciencia, Tecnología e Innovación de Acceso Abierto (ALICIA).