Sparks give you the possiblity for partioning your processing of big data onto a cluster of machines. It also does the in-memory processing, so it make jobs 100 times faster than the apache hadoop. The RDD data structures makes easy parallelisation. Spark also has a Natural Integration for stemming data. And now we can integrate machine learning with the help of Spark MLlib.
Spark MLlib has a strong scalability with the number of machines so it it fast and it also have all we need included in it. The goal of spark is to make a general purpose platform where user can do all processing of data without stepping outside, because the cost of switching platform is high. Spark ships as a full package. Spark has a Spark SQL for SQL data, Spark streaming for streaming data, MLlib for machine learning , GraphX for graph representations And many other.
The goal for Spark MLlib is to make machine learning easy and scalable. MLlib has common machine learning algorithms such as Classification, regression, clustering and filtering. It can do feature extraction, transformation, dimensional reduction and selection. It also has tools for construing , evaluating and tuning of ML pipelines. A pipeline chains multiple transformer’s and estimator’s together to perform the ML workflow. Spark ML uses the schema RDD from Spark SQL as a data-set so it can hold variety of data type. E.g. a data-set cloud have different columns storing some text, feature vectors. True labels and predictions. So Spark ML is fast, easy and has a lots of tools and feature ML.
Author: Pushpak Hurpade