Introducción a Spark MLlib para Big Data y Machine Learning

Share on facebook
Share on twitter
Share on linkedin
Share on telegram
Share on whatsapp

Contents

This post was released as part of the Data Science Blogathon.

Overview

Introduction

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets and can also spread data processing tasks across multiple computers., either by itself or in conjunction with other distributed IT tools. It's an ultra-fast unified analytics engine for big data and machine learning.

To support Python with Spark, Apache Spark community released a tool, PySpark. With PySpark, you can work with RDD in the Python programming language.

Spark's components are:

  1. Spark Core
  2. Spark SQL
  3. Spark Streaming
  4. Spark MLlib
  5. GraphX
  6. Spark R
Spark MLlib

Spark Core

All the functionalities provided by Apache Spark are centered on top of Spark Core. Manage all the functionalities of E / S essential. Used for task dispatch and crash recovery. Spark Core is embedded with a special collection called RDD (resilient distributed data set). RDD is among the Spark abstractions. Spark RDD handles data partitioning on all nodes in a cluster. Keeps them in the cluster memory pool as a single unit. There are two operations done in RDD:

Transformation: It is a function that produces new RDDs from existing RDDs.

Action: In transformation, RDDs create each other. But when we want to work with the actual dataset, then, at that point, we use Action.

Spark SQL

The Spark SQL component is a distributed framework for processing structured data. Spark SQL works to enter structured and semi-structured information. It also enables powerful and interactive analytical applications on streaming and historical data.. DataFrames and SQL provide a common way to access a range of data sources. Its main characteristic is to be an optimizer based on costs and fault tolerance of medium query.

Spark Streaming

It is a plug-in to the Spark core API that enables scalable stream processing, High performance and fault tolerant of live data transmissions. Spark Streaming, bundles live data into small batches. It is then delivered to the batch system for processing.. It also provides fault tolerance characteristics.

Spark GraphX:

GraphX ​​on Spark is an API for graphs and parallel execution of graphs. It is a network graphics analysis engine and data warehouse. In the graphs it is also possible to group, categorize, travel, search and find routes.

SparkR:

SparkR provides a distributed data frame implementation. Supports operations as selection, filtered out, aggregation, but in large data sets.

Spark MLlib:

Spark MLlib is used to perform machine learning on Apache Spark. MLlib consists of popular algorithms and utilities. MLlib on Spark is a scalable machine learning library that analyzes both high-quality and high-speed algorithm. Machine learning algorithms as regression, classification, grouping, pattern mining and collaborative filtering. Lower-level machine learning primitives, as the generic gradient descent optimization algorithm, they are also present in MLlib.

Spark.ml is the main machine learning API for Spark. Library Spark.ml offers a top-level API built on top of DataFrames to build ML pipelines.

Spark MLlib tools are provided below:

  1. ML Algorithms
  2. Characterization
  3. Pipelines
  4. Persistence
  5. Utilities
  1. ML Algorithms

    ML algorithms form the core of MLlib. These include common learning algorithms like classification, regression, collaborative grouping and filtering.

    MLlib standardizes APIs to help combine multiple algorithms into a single pipeline or workflow. The key concepts are the Pipelines API, where the pipeline concept is inspired by the scikit-learn project.

    Transformer:

    A transformer is an algorithm that can transform a DataFrame into another DataFrame. Technically, a Transformer implements a transform method (), which converts one DataFrame to another, generally adding one or more columns. As an example:

    A feature transformer can take a DataFrame, read a column (as an example, text), assign it to a new column (as an example, feature vectors) and generate a new DataFrame with mapped column attached.

    A learning model can take a DataFrame, read the column containing the feature vectors, predict the label for each feature vector and generate a new DataFrame with predicted labels added as a column.

    Estimator:

    An estimator is an algorithm that can be fitted to a DataFrame to produce a transformer. Technically, an Estimator implements a fit method (), which accepts a DataFrame and produces a Model, what is a transformer. As an example, a learning algorithm like LogisticRegression is an Estimator, and call fit () train a LogisticRegressionModel, what is a model and, therefore, a transformer.

    Transformer.transform () and Estimator.fit () they are stateless. In the future, stateful algorithms can be compatible with alternative concepts.

    Each instance of a Transformer or Estimator has a unique ID, which is useful for specifying parameters (described below).

  2. Characterization

    Characterization includes extraction, transformation, decrease in dimensionality and feature selection.

    1. Feature Extraction is about extracting features from raw data.
    2. Feature transformation includes scaling, renew or modify features
    3. Feature selection involves choosing a subset of must-have features from a large feature set.

  3. Pipelines:

    A pipeline chains multiple transformers and estimators to specify an AA workflow. It also provides tools to build, examine and adjust ML Pipelines.

    In machine learning, it is common to run a sequence of algorithms to process and learn from the data. MLlib represents a workflow as Pipeline, that it is a sequence of Pipeline Stages (Transformers and Estimators) to be executed in a specific order. We will use this simple workflow as a running example in this section.

    Example: the pipeline sample shown below does the preprocessing of data in a specific order as given below:

    1. Apply the String Indexer method to find the index of categorical columns

    2. Apply OneHot encoding for categorical columns

    3. Apply string indexer for column “label” of the output variable

    4. VectorAssembler applies to both categorical columns and numeric columns. VectorAssembler is a transformer that combines a given list of columns into a single vector column.

    The pipeline workflow will run the data modeling in the specific order above.

    from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
    categoricalColumns = ['job', 'marital', 'education', 'default', 'housing', 'loan']
    stages = []
    for categoricalCol in categoricalColumns:
        stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Indexer')
        encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "Vec"])
        stages += [stringIndexer, encoder]
    label_stringIdx = StringIndexer(inputCol="deposit", outputCol="label")
    stages += [label_stringIdx]
    numericColumns = ['age', 'balance', 'duration']
    assemblerInputs = [c + "Vec" for c in categoricalColumns] + numericColumns
    Vassembler = VectorAssembler(inputCols = assemblerInputs, outputCol="features")
    stages += [Vassembler]
    from pyspark.ml import Pipeline
    pipeline = Pipeline(stages = stages)
    pipelineModel = pipeline.fit(df)
    df = pipelineModel.transform(df)
    selectedCols = ['label', 'features'] + cols
    df = df.select(selectedCols)

    Data frame

    Los marcos de datos proporcionan una API más fácil de utilizar que los RDD. DataFrame-based API for MLlib provides a consistent API across ML algorithms and in multiple languages. Data frameworks facilitate hands-on machine learning pipelines, in particular feature transformations.

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName('mlearnsample').getOrCreate()
    df = spark.read.csv('loan_bank.csv', header = True, inferSchema = True)
    df.printSchema()
  4. Persistence:

    Persistence helps save and load algorithms, models and pipelines. This helps reduce time and efforts, since the model is persistent, can be loaded / reuse at any time when necessary.

    from pyspark.ml.classification import LogisticRegression
    lr = LogisticRegression(featuresCol="features", labelCol="label")
    lrModel = lr.fit(train)

    de pyspark.ml.evaluation import BinaryClassificationEvaluator

    evaluador = BinaryClassificationEvaluator ()

    print (‘Test Area Under ROC’, evaluator.evaluate (predictions))

    
    
    predictions = lrModel.transform(test)
    predictions.select('age', 'label', 'rawPrediction', 'prediction').show()
  5. Utilities:

    Utilities for linear algebra, statistics and data management. Example: mllib.linalg are the MLlib utilities for linear algebra.

Reference material:

https://spark.apache.org/docs/latest/ml-guide.html

Final notes

Spark MLlib is necessary if it comes to big data and machine learning. In this post, learned about the details of Spark MLlib, data frames and pipelines. In the future post, we will work on practical code to implement Pipelines and build data models using MLlib.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.