Apache Spark Introduction @ University College London

1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark
Training – University College London
September 8th, 2014
Suhas Gogate : Architect, Pivotal Hadoop Engg
Shivram Mani: Lead Engineer, Pivotal Hadoop Engg.

2© Copyright 2013 Pivotal. All rights reserved.
About Me: Suhas (https://www.linkedin.com/in/vgogate)
 Since 2008, active in Hadoop infrastructure and ecosystem components
– Worked with lead Hadoop technology based companies
– Yahoo, Netflix, Hortonworks, EMC-Greenplum/Pivotal
 Founder and PMC member/committer of the Apache Ambari project
 Contributed Apache “Hadoop Vaidya” – Performance diagnostics for M/R
 Prior to Hadoop,
– IBM Almaden Research (2000-2008), CS Software & Storage systems.
– In early days (1993) of my career, worked with a team that built first Indian
super computer, PARAM (Transputer based MPP system) at Center for
Development of Advance Computing (CDAC, Pune)

About Me: Shivram (https://www.linkedin.com/in/shivrammani)
 Since 2009, active user of Hadoop
– Yahoo, EMC-Greenplum/Pivotal
 Built the Pivotal Command Center (Cluster Configuration/Management)
 Lead developer for
– Pivotal Extension Framework
– Unified Storage System
 Prior to Hadoop,
– Yahoo Web Search Federation
– Yahoo Vertical Search Relevance

Abstract
Apache Spark is one of the most exciting and talked about ASF projects
today, but how should enterprise architects view it, and what type of impact
might it have on our platforms? This talk will introduce Spark and its core
concepts, the ecosystem of services on top of it, types of problems it can
solve, similarities and differences from Hadoop, deployment topologies,
and possible uses in enterprise. Concepts will be illustrated with a variety
of demos covering: the programming model, the development experience,
“realistic” infrastructure simulation with local virtual deployments, and
Spark cluster monitoring tools.

Day 1 (Sept 8th, 2014) – Agenda
 What is Spark,
– What does it have to do with Big Data/Hadoop?
 Spark Programming Model
 Spark Internals:
– Execution, Shuffles, Tasks, Stages
 Spark Deployment models
 Demo & Hands-on exercise
 Q/A

What Is Spark?

What is Spark?
 Distributed Compute Engine for analysis of large data sets, like Hadoop M/R
– Inspired by deficiencies in Hadoop M/R batch processing
▪ Data Replication, Serialization, Disk I/O etc.
– Effectively uses distributed cluster memory for faster computations
 A common framework primarily designed for following types of workloads,
– Iterative graph processing algorithms (Google Pregel)
– Iterative machine learning algorithms like Page Rank, K-means clustering, Logistic regression etc (HALoop)
– Interactive data mining – run multiple ad-hoc queries on the same data set
– Along with “batch” workloads like Hadoop M/R on data in memory
 Implementation of Resilient Distributed Dataset (RDD) in Scala
 Similar scalability and fault tolerance as Hadoop Map/Reduce
– Although uses different fault tolerance model of lineage to reconstitute data instead of replication
 Programmatic interface via API or Interactive
– Scala, Java7/8, Python

Spark is also …
 Came out of AMPLab project at UCB
 An ASF Top Level project
– https://spark.apache.org/
– https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-
projects-plugin:summary-panel
 An active community of ~100-200 contributors across 25-35 companies
– More active than Hadoop MapReduce
– 1000 people (the max) attended Spark Summit
– http://spark-summit.org
 Hadoop Compatible

Spark is not …
 An OLTP data store
 A “permanent” data store
 Or an application cache
It’s also not Mature enough compared to Hadoop
– This is a good thing. Lots of room to grow.

Spark is not Hadoop, but is compatible
 Often better than Hadoop
– M/R is fine for “Data Parallel”, but awkward for some workloads
– Low latency dispatch, Iterative, Streaming
 Natively accesses Hadoop data
– Data Locality
 Spark just another YARN job
– Utilizes current investments in Hadoop
– Brings Spark to the Data
 It’s not OR … it’s AND!

Improvements over Map/Reduce
 Efficiency
– General Execution Graphs (not just map->reduce->store)
– In memory
 Usability
– Rich APIs in Scala, Java, Python
– Interactive
Can Spark be the R for Big Data?

Short History
 2009 Started as research project at UCB
 2010 Open Sourced
 January 2011 AMPLab Created
 October 2012 0.6
– Java, Stand alone cluster, maven
 June 21 2013 Spark accepted into ASF Incubator
 Feb 27 2014 Spark becomes top level ASF project
 May 30 2014 Spark 1.0
 August 5th, 2014 Spark 1.0.2

Spark Programming
Model
RDDs in Detail

Spark Program Model (Scala)
 A driver program runs user’s main function, execute set of parallel
operations (transformations and actions), on a collection of
elements called Resilient, Distributed Dataset (RDD)
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val file = sc.textFile(“hdfs://.../logfile”)
val err_lines = file.filter(_.contains(“ERROR”))
val err_msgs = err_lines.map(MyFunc.extractMessage)
err_msgs.cache()
val errFile = err_msgs.saveAsTextFile(“hdfs://…/errlogfile”)
err_msgs.count()
RDDs
Actions
Transformation
s

Resilient Distributed Dataset (RDD)
 A new Data Type supported under spark framework extension to
– Scala, Java, Python
 A read-only collection of records partitioned across cluster nodes
– Does not support insert/update/delete of records from RDD
 Created by
– Reading file(s) from HDFS
– Parallelizing existing collections (lists, arrary, maps etc.)
– By executing transformations on existing RDDs
 RDDs can be persisted w/ following options
– In memory – Serialized or Non-serialized object (optionally replicated)
– On Disk – Serialized (optionally replicated)
– In memory file system like Tachyon
 RDDs store lineage information
– Support coarse grain recovery of whole partition upon node failure

Two Categories of Operations on RDD
 Transforms
– Create from stable storage (hdfs, tachyon, etc.)
– Generate RDD from other RDD (map, filter, groupBy)
– Lazy Operations that build a DAG of Tasks
– Once Spark knows your transformations it can build a plan
 Actions
– Return a result or write to storage (count, collect, save, etc.)
– Actions cause the DAG to execute (like Apache PIG)

Transformation and Actions
 Transformations
– map
– filter
– flatMap
– sample
– groupByKey
– reduceByKey
– union
– join
– sort
 Actions
– count
– collect
– reduce
– lookup
– save

Spark Shared Variables (btw’n Tasks & Driver)
 Broadcast variables
– Read only variable cached once on each node (not shipped w/ each task)
– Multiple tasks can refer to it that runs on the node
– Broadcast variable should be used in the program after it is created
– Original variable should not be modified after broadcast
val broadcastVar = sc.broadcast(Array(1, 2, 3))
 Accumulators
– Accumulator variables are like counters in M/R (i.e. tasks can add to it)
– Tasks can not read the value of accum variable, only driver program can read it
scala> val accum = sc.accumulator(0)
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
scala> accum.value
 Local variables of the transformation function are shipped along with the function to each node
– They can not be accessed by driver program

RDDs from External & Internal data sets
 Parallelize existing collection to create RDD
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data, num_partitions)
 Create RDD from external data source
– Support, local FS, HDFS, Cassandra, Hbase, Amazon S3
– Support, Text files, Sequence files, Hadoop InputFormats
– Local FS file path should be available and same on all the nodes
– Reading directories are supported
– textFile() by default makes one partition for each HDFS file block
scala> val distFile = sc.textFile("data.txt”, num_partitions)

RDD Persistence
 Persisting the result RDD after bunch of transformations is recommended
– This allows further actions on this result RDD much faster (no computations again)
– RDD.cache() – in memory persistence
 RDD Persistence options
– MEMORY_ONLY
▪ Store as deserialized Java objects. Does not fit in memory, some partitions will be recomputed
– MEMORY_AND_DISK
▪ Store as deserialized Java objects. Does not fit in memory, store additional partitions on disk
– MEMORY_ONLY_SER
▪ Store as serialized Java objects (one byte array per partition). More space-efficient
– MEMORY_AND_DISK_SER
▪ Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk
– DISK_ONLY
▪ Store the RDD partitions only on disk.
– MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.:
▪ Same as the levels above, but replicate each partition on two cluster nodes.
– OFF_HEAP (experimental) :
▪ Store RDD in serialized format in Tachyon, Reduce GC overhead, Share RDDs across Apps

How to choose RDD Persistence level
 Persistent levels trade-offs between memory usage and CPU efficiency.
 If your RDDs fit comfortably with the default storage level (MEMORY_ONLY)
 If not, try using MEMORY_ONLY_SER and selecting a fast serialization library
– Spilling to disk is costly (Spark by default uses Java serialization)
– Python uses by default Pickle library
– Scala/Java can use Kryo library, faster than default Java serialization
 Use the replicated storage levels if you want fast fault recovery
 In environments with high amounts of memory or multiple applications, the
experimental OFF_HEAP mode has several advantages:
– It allows multiple executors to share the same pool of memory in Tachyon.
– It significantly reduces garbage collection costs.
– Cached data is not lost if individual executors crash.

Spark Configuration
 Spark properties
– Control most application parameters and can be set by using a SparkConf
object, or through Java system properties
– Precedence
▪ SparkConf in driver program,
▪ spark submit options,
▪ Default conf file in spark conf directory
– http://spark.apache.org/docs/latest/configuration.html
 Environment variables
– Can be used to set per-machine settings, such as the IP address, through the
conf/spark-env.sh script on each node.
 Logging
– can be configured through log4j.properties

Spark Application Monitoring
 Web Interface
– Includes
▪ A list of scheduler stages and tasks
▪ A summary of RDD sizes and memory usage
▪ Environmental information.
▪ Information about the running executors
– Provide web-ui for running application at
▪ http://<driver-node>:4040
– Provide ability to view application information after it is finished
▪ Set spark.eventLog.enabled = true
▪ Set spark.eventLog.dir = file:///tmp/spark-events
▪ Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events

How Spark Runs
DAGs, shuffle’s, tasks, stages, etc.

Sample

What happens
 Create RDDs
 Pipeline operations as much of possible
– When a results doesn’t depend on other results, we can pipeline
– But, when data needs to be reorganized, no longer pipeline
 Stage is a merged operation
 Each stage gets a set of tasks
 Task is data and computation

RDDs and Stages

Tasks

Stages running
 Number of
partitions matter for
concurrency
 Rule of thumb is at
least 2x number of
cores

The Shuffle
 Redistributes data among
partitions
– Hash keys into buckets
– Pull not push
– Writes to intermediate
files to disk
– Becoming plugable
 Optimizations:
– Avoided when possible, if ”data is already properly" partitioned
– Partial aggregation reduces data movement

Other thought’s on Memory
 By default Spark owns 90% of the memory
 Partitions don’t have to fit in memory, but some things do
– EG: values for large sets in groupBy’s must fit in memory
 Shuffle memory is 30%
– If it goes over that, it’ll spill the data to disk
– Shuffle always writes to disk
 Turn on compression to keep objects serialized
– Saves space, but takes compute to serialize/de-serialize

Spark Deployment
modes

Spark Topology/Deployment modes
 Local
– Great for Dev
 Spark Cluster (master/slaves)
– Improving rapidly
 Cluster Resource Managers
– YARN
– MESOS

Spark Application Architecture

Yarn Cluster Mode

Yarn Client Mode

Comparison of Deployment Modes

Spark Hands-on

Spark Hands-on
 How to Build Spark ?
 Spark on Dev environment (mac)
 Spark on AWB
 Pyspark & Lamda Functions
 Spark Examples using Pyspark
 Web UI/Debugging

Spark Code/Build
 Spark Git Repsitory
– https://github.com/apache/spark
 Download Spark
– git clone https://github.com/apache/spark
 Build Spark
– Maven
– Shell script

Build Spark with Maven
 Setting up Maven’s Memory Usage
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -
XX:ReservedCodeCacheSize=512m”
 Specifying the Hadoop Version
# Apache Hadoop 2.2.X
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package

Build Spark using Script
 Use make distribution script
 Abstraction using maven
– ./make-distribution.sh --skip-java-test -Pyarn -Phadoop-2.2 -
Phadoop.version=2.2.0
– Creates dist/ folder containing spark artifacts
– -- tgz creates spark distribution

Spark on AWB
 Access
– ssh manis2@acs04.analyticsworkbench.com -p 45326
 Topology
– https://portal.analyticsworkbench.com/projects/awbhome/wiki/Cluste
r_Topology
– Spark Admin: http://access3.ic.analyticsworkbench.com:4040
 Directory
– /usr/share/spark
 Spark using Yarn Cluster

Using Spark Submit
 Suitable for yarn-cluster mode
 export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-1.0.0-
hadoop2.2.0-gphd-3.0.1.0.jar
 export HADOOP_CONF_DIR=/etc/gphd/hadoop/conf
 ./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster
--num-executors 3
--driver-memory 1g
--executor-memory 2g
--executor-cores 1
lib/spark-examples*.jar 10


Using Spark Shell
 Suitable for yarn-client mode
 Interactive shell suitable for debugging
 export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-
1.0.0-hadoop2.2.0-gphd-3.0.1.0.jar
 bin/pyspark –master yarn-client --num-executors 2

Spark Logging
 Spark Logs contains the following information
– # of Partitions
– Size of tasks
– Nodes tasks are running
– Progress of tasks
 Yarn Container Logs
– Available locally
– Aggregated on HDFS

Spark Web Interface
 Web UI: http://<driver-node>:4040
 One port for each application (aka SparkContext)
 Shows the following information
– List of scheduler stages & tasks
– Summary of RDD sizes & memory usage
– Environmental information
– Information about running executors
 Provide ability to view application information after it is finished
– Set spark.eventLog.enabled = true
– Set spark.eventLog.dir = file:///tmp/spark-events
– Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events

Spark Examples
 Line with most words
 Lines with a particular word
 Word count
 Sorting
 PageRank

Berkley Data Stack -
Related Projects
Things that use Spark Core

Berkley Data Analytics Stack (BDAS)
Support
 Batch
 Streaming
 Interactive
Make it easy to
compose them
https://amplab.cs.berkeley.edu/software/

Spark SQL
 Lib in Spark Core that models RDDs as relations
– SchemaRDD
 Replaces Shark
– Lighter weight version with no code from Hive
 Import/Export in different Storage formats
– Parquet, learn schema from existing Hive warehouse
 Takes columnar storage from Shark

Spark Streaming
 Extend Spark to do large scale stream processing
– 100s of nodes with second scale end to end latency
 Simple, batch like API with RDDs
 Single semantics for both real time and high latency
 Other features
– Window-based Transformations
– Arbitrary join of streams

Streaming (cont)
 Input is broken up into Batches that become RDDs
 RDD’s are composed into DAGs to generate output
 Raw data is replicated in-memory for FT

GraphX (Alpha)
 Graph processing library
– Replaces Spark Bagel
 Graph Parallel not Data Parallel
– Reason in the context of neighbors
– GraphLab API
 Graph Creation => Algorithm => Post Processing
– Existing systems mainly deal with the Algorithm and not interactive
– Unify collection and graph models

MLbase
 Machine Learning toolset
– Library and higher level abstractions
 General tool in space is MatLab
– Difficult for end users to learn, debug, scale solutions
 Starting with MLlib
– Low level Distributed Machine Learning Library
 Many different Algorithms
– Classification, Regression, Collaborative Filtering, etc.

Thanks!

Data Science Platform
IMDG
Cluster Manager
RDDM/R
Application Platform
Stream
Server
MPP
SQL
Data Lake / HDFS / Virtual Storage
App Data Platform
SQL
Objects
JSON GemFireX
D
...ETC
Hadoop HDFS Isilon
App Dev / Ops
YARN
Mesos
MLbaseStreaming
Legacy
Systems
Legacy
Data Scientists/AnalystsData Sources End Users
SparkSQL

Backup Slides

PHD
General Solution Pipeline
Streaming Ingest
GemFire
(IMDB)
Machine
data
Stream
message
Source
RabbitMQ
Transport
HDFS
Sink
GemFire
Tap
SQL
REST
API
Analytics –
Counters and
Gauges
Message
Transformer
Analytics
Taps
HDFS
Dashboard

PHD
Where’s Spark?
Streaming Ingest
GemFire
(IMDB)
Machine
data
Stream
message
Source
Transport
HDFS
Sink
GemFire
Tap
SQL
REST
API
Analytics –
Counters and
Gauges
Message
Transformer
Analytics
Taps
HDFS
Dashboard

Slides by Shivram
 How to download/build and run spark examples
– Build w/ specific version of Hadoop (PHD?)
▪ http://spark.apache.org/docs/latest/building-with-maven.html
 Spark cluster deployment modes
– YARN, & Singlenode ( EC2, Mesos, Standalone)
– http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-
yarn-app-models/
– http://spark.apache.org/docs/latest/running-on-yarn.html
– Submit apps vs interactive shell
 Explain simple spark example and run it
▪ run-example, spark-shell/pyspark, spark-submitt,
▪ http://spark.apache.org/docs/latest/quick-start.html

Deployment modes (+YARN slide)
 Local
– Great for Dev
 Spark Cluster (master/slaves)
– Improving rapidly
 Cluster Resource Managers
– YARN
– MESOS

 Intro to Spark based Projects
– Spark SQL
– Spark Streaming
– MLBase
– GraphX

Apache Spark Introduction @ University College London

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Apache Spark Introduction @ University College London

Similaire à Apache Spark Introduction @ University College London (20)

Dernier

Dernier (20)

Apache Spark Introduction @ University College London