Intro to Apache Spark

www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)

●Lead Consultant on all things DevOps and Spark
●@carsondial
Me!

●Apache Spark™ is a fast and general engine for large-scale data
processing
What Is Apache Spark?!

● Framework for massive parallel computing (cluster)
● Harnessing power of cheap memory
● Direct Acyclic Graph (DAG) computing engine
● It goes very fast!
● Apache Project (spark.apache.org)
What Is Apache Spark?! No, But Really…

● Began at UC Berkeley in 2009
● Apache project since 2013
● Top-level Apache project since 2014
● Creators formed databricks.com
History

● Performance
● Developer productivity
Why Spark?

● Graysort benchmark (100TB)
● Hadoop - 72 minutes / 2100 nodes / datacentre
● Spark - 23 minutes / 206 nodes / AWS
● HDFS versus Memory
Performance!

●First class support for Scala, Java, Python, and R!
●Data Science friendly
Developers!

Word Count: Hadoop

from pyspark import SparkContext
logFile = "hdfs:///input"
sc = SparkContext("spark://spark-m:7077", "WordCount")
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("hdfs:///output")
Word Count: Spark

●Spark Streaming
●GraphX (graph algorithms)
●MLLib (machine learning)
●Dataframes (data access)
Spark: Batteries Included

● Analytics (batch / streaming)
● Machine Learning
● ETL (Extract - Transform - Load)
● …and many more!
Applications

●RDD = Resilient Distributed Dataset
●Immutable, Fault-tolerant
●Operated on in parallel
●Can be created manually or from external sources
RDDs – The Building Block

●Transformations
●Actions
●Transformations are lazy
●Actions evaluate transformations in pipeline as well as
performing action
RDDs – The Building Block

● map()
● filter()
● pipe()
● sample()
● …and more!
RDDs – Example Transformations

●reduce()
●count()
●take()
●saveAsTextFile()
●…and yes, more
RDDs – Example Actions

●cache() / persist()
●When an action is performed for the first time - keep the result in
memory
●Different levels of persistence available
RDDs – cache()

●Micro-batches (DStreams of RDDs)
●Access to other parts of Spark (MLLib, GraphX, Dataframes)
●Fault-tolerant
●Connectors to Kafka, Flume, Kinesis, ZeroMQ
●(we’ll come back to this)
Streaming

●Spark SQL
●Support for JSON, Cassandra, SQL databases, etc.
●Easier syntax than RDDs
●Dataframes ‘borrowed’ from Python/R
●Catalyst query planner
Dataframes

val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
df.show()
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
Dataframes: Example

●Optimizing query planning for Spark
●Takes Dataframe operations and ‘compiles’ them down to RDD
operations
●Often faster than writing RDD code manually
●Use Dataframes whenever possible (v1.4+)
Dataframes: Catalyst

Dataframes: Catalyst

●Machine Learning
●Includes algorithm implementations for Bayes, k-means
clustering, ALS, word2vec, random forests, etc.
●Matrix operations (dense / sparse), dimensionality reduction, etc.
●And basic stats too!
MLLib

●Common interface between different ML solutions
●(still in progress, but production-ready as of 1.5)
●Pipelines : ML as Dataframes : RDDs
MLLib - Pipelines

●Graph processing algorithms
●Operations on vertices and edges
●Includes PageRank algorithm
●Can be combined with Streaming/Dataframes/MLLib
GraphX

●Standalone
●YARN (Hadoop ecosystem)
●Mesos (Hipster ecosystem)
Deploying Spark

●Traditional (write code, submit to cluster)
●REPL / Shell (write code interactively, backed by cluster)
●Interactive Notebooks (iPython/Zeppelin)
Using Spark

●Log / diary approach to data science
●Type code into a web page
●Visualizations built-in
Interactive Notebooks

●iPython / Juypter - most popular
●Zeppelin - built for Spark

●spark.apache.org
●databricks.com
●zeppelin.incubator.apache.org
●mammothdata.com/white-papers/spark-a-modern-tool-for-big-
data-applications
Links

Questions?

Intro to Apache Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Intro to Apache Spark

Similaire à Intro to Apache Spark (20)

Plus de Mammoth Data

Plus de Mammoth Data (7)

Dernier

Dernier (20)

Intro to Apache Spark