This presentation includes a comprehensive introduction to Apache Spark. From an explanation of its rapid ascent to performance and developer advantages over MapReduce. We also explore its built-in functionality for application types involving streaming, machine learning, and Extract, Transform and Load (ETL).
2. www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
5. www.mammothdata.com | @mammothdataco
● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…
6. www.mammothdata.com | @mammothdataco
● Began at UC Berkeley in 2009
● Apache project since 2013
● Top-level Apache project since 2014
● Creators formed databricks.com
History
14. www.mammothdata.com | @mammothdataco
●RDD = Resilient Distributed Dataset
●Immutable, Fault-tolerant
●Operated on in parallel
●Can be created manually or from external sources
RDDs – The Building Block
19. www.mammothdata.com | @mammothdataco
●cache() / persist()
●When an action is performed for the first time - keep the result in
memory
●Different levels of persistence available
RDDs – cache()
20. www.mammothdata.com | @mammothdataco
●Micro-batches (DStreams of RDDs)
●Access to other parts of Spark (MLLib, GraphX, Dataframes)
●Fault-tolerant
●Connectors to Kafka, Flume, Kinesis, ZeroMQ
●(we’ll come back to this)
Streaming
21. www.mammothdata.com | @mammothdataco
●Spark SQL
●Support for JSON, Cassandra, SQL databases, etc.
●Easier syntax than RDDs
●Dataframes ‘borrowed’ from Python/R
●Catalyst query planner
Dataframes
22. www.mammothdata.com | @mammothdataco
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example
23. www.mammothdata.com | @mammothdataco
●Optimizing query planning for Spark
●Takes Dataframe operations and ‘compiles’ them down to RDD
operations
●Often faster than writing RDD code manually
●Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst
25. www.mammothdata.com | @mammothdataco
●Machine Learning
●Includes algorithm implementations for Bayes, k-means
clustering, ALS, word2vec, random forests, etc.
●Matrix operations (dense / sparse), dimensionality reduction, etc.
●And basic stats too!
MLLib
26. www.mammothdata.com | @mammothdataco
●Common interface between different ML solutions
●(still in progress, but production-ready as of 1.5)
●Pipelines : ML as Dataframes : RDDs
MLLib - Pipelines
27. www.mammothdata.com | @mammothdataco
●Graph processing algorithms
●Operations on vertices and edges
●Includes PageRank algorithm
●Can be combined with Streaming/Dataframes/MLLib
GraphX