Contenu connexe
Similaire à Apache Spark Introduction @ University College London (20)
Apache Spark Introduction @ University College London
- 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark
Training – University College London
September 8th, 2014
Suhas Gogate : Architect, Pivotal Hadoop Engg
Shivram Mani: Lead Engineer, Pivotal Hadoop Engg.
- 2. 2© Copyright 2013 Pivotal. All rights reserved.
About Me: Suhas (https://www.linkedin.com/in/vgogate)
Since 2008, active in Hadoop infrastructure and ecosystem components
– Worked with lead Hadoop technology based companies
– Yahoo, Netflix, Hortonworks, EMC-Greenplum/Pivotal
Founder and PMC member/committer of the Apache Ambari project
Contributed Apache “Hadoop Vaidya” – Performance diagnostics for M/R
Prior to Hadoop,
– IBM Almaden Research (2000-2008), CS Software & Storage systems.
– In early days (1993) of my career, worked with a team that built first Indian
super computer, PARAM (Transputer based MPP system) at Center for
Development of Advance Computing (CDAC, Pune)
- 3. 3© Copyright 2013 Pivotal. All rights reserved.
About Me: Shivram (https://www.linkedin.com/in/shivrammani)
Since 2009, active user of Hadoop
– Yahoo, EMC-Greenplum/Pivotal
Built the Pivotal Command Center (Cluster Configuration/Management)
Lead developer for
– Pivotal Extension Framework
– Unified Storage System
Prior to Hadoop,
– Yahoo Web Search Federation
– Yahoo Vertical Search Relevance
- 4. 4© Copyright 2013 Pivotal. All rights reserved.
Abstract
Apache Spark is one of the most exciting and talked about ASF projects
today, but how should enterprise architects view it, and what type of impact
might it have on our platforms? This talk will introduce Spark and its core
concepts, the ecosystem of services on top of it, types of problems it can
solve, similarities and differences from Hadoop, deployment topologies,
and possible uses in enterprise. Concepts will be illustrated with a variety
of demos covering: the programming model, the development experience,
“realistic” infrastructure simulation with local virtual deployments, and
Spark cluster monitoring tools.
- 5. 5© Copyright 2013 Pivotal. All rights reserved.
Day 1 (Sept 8th, 2014) – Agenda
What is Spark,
– What does it have to do with Big Data/Hadoop?
Spark Programming Model
Spark Internals:
– Execution, Shuffles, Tasks, Stages
Spark Deployment models
Demo & Hands-on exercise
Q/A
- 6. 6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved.
What Is Spark?
- 7. 7© Copyright 2013 Pivotal. All rights reserved.
What is Spark?
Distributed Compute Engine for analysis of large data sets, like Hadoop M/R
– Inspired by deficiencies in Hadoop M/R batch processing
▪ Data Replication, Serialization, Disk I/O etc.
– Effectively uses distributed cluster memory for faster computations
A common framework primarily designed for following types of workloads,
– Iterative graph processing algorithms (Google Pregel)
– Iterative machine learning algorithms like Page Rank, K-means clustering, Logistic regression etc (HALoop)
– Interactive data mining – run multiple ad-hoc queries on the same data set
– Along with “batch” workloads like Hadoop M/R on data in memory
Implementation of Resilient Distributed Dataset (RDD) in Scala
Similar scalability and fault tolerance as Hadoop Map/Reduce
– Although uses different fault tolerance model of lineage to reconstitute data instead of replication
Programmatic interface via API or Interactive
– Scala, Java7/8, Python
- 8. 8© Copyright 2013 Pivotal. All rights reserved.
Spark is also …
Came out of AMPLab project at UCB
An ASF Top Level project
– https://spark.apache.org/
– https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-
projects-plugin:summary-panel
An active community of ~100-200 contributors across 25-35 companies
– More active than Hadoop MapReduce
– 1000 people (the max) attended Spark Summit
– http://spark-summit.org
Hadoop Compatible
- 9. 9© Copyright 2013 Pivotal. All rights reserved.
Spark is not …
An OLTP data store
A “permanent” data store
Or an application cache
It’s also not Mature enough compared to Hadoop
– This is a good thing. Lots of room to grow.
- 10. 10© Copyright 2013 Pivotal. All rights reserved.
Spark is not Hadoop, but is compatible
Often better than Hadoop
– M/R is fine for “Data Parallel”, but awkward for some workloads
– Low latency dispatch, Iterative, Streaming
Natively accesses Hadoop data
– Data Locality
Spark just another YARN job
– Utilizes current investments in Hadoop
– Brings Spark to the Data
It’s not OR … it’s AND!
- 11. 11© Copyright 2013 Pivotal. All rights reserved.
Improvements over Map/Reduce
Efficiency
– General Execution Graphs (not just map->reduce->store)
– In memory
Usability
– Rich APIs in Scala, Java, Python
– Interactive
Can Spark be the R for Big Data?
- 12. 12© Copyright 2013 Pivotal. All rights reserved.
Short History
2009 Started as research project at UCB
2010 Open Sourced
January 2011 AMPLab Created
October 2012 0.6
– Java, Stand alone cluster, maven
June 21 2013 Spark accepted into ASF Incubator
Feb 27 2014 Spark becomes top level ASF project
May 30 2014 Spark 1.0
August 5th, 2014 Spark 1.0.2
- 13. 13© Copyright 2013 Pivotal. All rights reserved. 13© Copyright 2013 Pivotal. All rights reserved.
Spark Programming
Model
RDDs in Detail
- 14. 14© Copyright 2013 Pivotal. All rights reserved.
Spark Program Model (Scala)
A driver program runs user’s main function, execute set of parallel
operations (transformations and actions), on a collection of
elements called Resilient, Distributed Dataset (RDD)
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val file = sc.textFile(“hdfs://.../logfile”)
val err_lines = file.filter(_.contains(“ERROR”))
val err_msgs = err_lines.map(MyFunc.extractMessage)
err_msgs.cache()
val errFile = err_msgs.saveAsTextFile(“hdfs://…/errlogfile”)
err_msgs.count()
RDDs
Actions
Transformation
s
- 15. 15© Copyright 2013 Pivotal. All rights reserved.
Resilient Distributed Dataset (RDD)
A new Data Type supported under spark framework extension to
– Scala, Java, Python
A read-only collection of records partitioned across cluster nodes
– Does not support insert/update/delete of records from RDD
Created by
– Reading file(s) from HDFS
– Parallelizing existing collections (lists, arrary, maps etc.)
– By executing transformations on existing RDDs
RDDs can be persisted w/ following options
– In memory – Serialized or Non-serialized object (optionally replicated)
– On Disk – Serialized (optionally replicated)
– In memory file system like Tachyon
RDDs store lineage information
– Support coarse grain recovery of whole partition upon node failure
- 16. 16© Copyright 2013 Pivotal. All rights reserved.
Two Categories of Operations on RDD
Transforms
– Create from stable storage (hdfs, tachyon, etc.)
– Generate RDD from other RDD (map, filter, groupBy)
– Lazy Operations that build a DAG of Tasks
– Once Spark knows your transformations it can build a plan
Actions
– Return a result or write to storage (count, collect, save, etc.)
– Actions cause the DAG to execute (like Apache PIG)
- 17. 17© Copyright 2013 Pivotal. All rights reserved.
Transformation and Actions
Transformations
– map
– filter
– flatMap
– sample
– groupByKey
– reduceByKey
– union
– join
– sort
Actions
– count
– collect
– reduce
– lookup
– save
- 18. 18© Copyright 2013 Pivotal. All rights reserved.
Spark Shared Variables (btw’n Tasks & Driver)
Broadcast variables
– Read only variable cached once on each node (not shipped w/ each task)
– Multiple tasks can refer to it that runs on the node
– Broadcast variable should be used in the program after it is created
– Original variable should not be modified after broadcast
val broadcastVar = sc.broadcast(Array(1, 2, 3))
Accumulators
– Accumulator variables are like counters in M/R (i.e. tasks can add to it)
– Tasks can not read the value of accum variable, only driver program can read it
scala> val accum = sc.accumulator(0)
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
scala> accum.value
Local variables of the transformation function are shipped along with the function to each node
– They can not be accessed by driver program
- 19. 19© Copyright 2013 Pivotal. All rights reserved.
RDDs from External & Internal data sets
Parallelize existing collection to create RDD
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data, num_partitions)
Create RDD from external data source
– Support, local FS, HDFS, Cassandra, Hbase, Amazon S3
– Support, Text files, Sequence files, Hadoop InputFormats
– Local FS file path should be available and same on all the nodes
– Reading directories are supported
– textFile() by default makes one partition for each HDFS file block
scala> val distFile = sc.textFile("data.txt”, num_partitions)
- 20. 20© Copyright 2013 Pivotal. All rights reserved.
RDD Persistence
Persisting the result RDD after bunch of transformations is recommended
– This allows further actions on this result RDD much faster (no computations again)
– RDD.cache() – in memory persistence
RDD Persistence options
– MEMORY_ONLY
▪ Store as deserialized Java objects. Does not fit in memory, some partitions will be recomputed
– MEMORY_AND_DISK
▪ Store as deserialized Java objects. Does not fit in memory, store additional partitions on disk
– MEMORY_ONLY_SER
▪ Store as serialized Java objects (one byte array per partition). More space-efficient
– MEMORY_AND_DISK_SER
▪ Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk
– DISK_ONLY
▪ Store the RDD partitions only on disk.
– MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.:
▪ Same as the levels above, but replicate each partition on two cluster nodes.
– OFF_HEAP (experimental) :
▪ Store RDD in serialized format in Tachyon, Reduce GC overhead, Share RDDs across Apps
- 21. 21© Copyright 2013 Pivotal. All rights reserved.
How to choose RDD Persistence level
Persistent levels trade-offs between memory usage and CPU efficiency.
If your RDDs fit comfortably with the default storage level (MEMORY_ONLY)
If not, try using MEMORY_ONLY_SER and selecting a fast serialization library
– Spilling to disk is costly (Spark by default uses Java serialization)
– Python uses by default Pickle library
– Scala/Java can use Kryo library, faster than default Java serialization
Use the replicated storage levels if you want fast fault recovery
In environments with high amounts of memory or multiple applications, the
experimental OFF_HEAP mode has several advantages:
– It allows multiple executors to share the same pool of memory in Tachyon.
– It significantly reduces garbage collection costs.
– Cached data is not lost if individual executors crash.
- 22. 22© Copyright 2013 Pivotal. All rights reserved.
Spark Configuration
Spark properties
– Control most application parameters and can be set by using a SparkConf
object, or through Java system properties
– Precedence
▪ SparkConf in driver program,
▪ spark submit options,
▪ Default conf file in spark conf directory
– http://spark.apache.org/docs/latest/configuration.html
Environment variables
– Can be used to set per-machine settings, such as the IP address, through the
conf/spark-env.sh script on each node.
Logging
– can be configured through log4j.properties
- 23. 23© Copyright 2013 Pivotal. All rights reserved.
Spark Application Monitoring
Web Interface
– Includes
▪ A list of scheduler stages and tasks
▪ A summary of RDD sizes and memory usage
▪ Environmental information.
▪ Information about the running executors
– Provide web-ui for running application at
▪ http://<driver-node>:4040
– Provide ability to view application information after it is finished
▪ Set spark.eventLog.enabled = true
▪ Set spark.eventLog.dir = file:///tmp/spark-events
▪ Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
- 24. 24© Copyright 2013 Pivotal. All rights reserved. 24© Copyright 2013 Pivotal. All rights reserved.
How Spark Runs
DAGs, shuffle’s, tasks, stages, etc.
- 26. 26© Copyright 2013 Pivotal. All rights reserved.
What happens
Create RDDs
Pipeline operations as much of possible
– When a results doesn’t depend on other results, we can pipeline
– But, when data needs to be reorganized, no longer pipeline
Stage is a merged operation
Each stage gets a set of tasks
Task is data and computation
- 29. 29© Copyright 2013 Pivotal. All rights reserved.
Stages running
Number of
partitions matter for
concurrency
Rule of thumb is at
least 2x number of
cores
- 30. 30© Copyright 2013 Pivotal. All rights reserved.
The Shuffle
Redistributes data among
partitions
– Hash keys into buckets
– Pull not push
– Writes to intermediate
files to disk
– Becoming plugable
Optimizations:
– Avoided when possible, if ”data is already properly" partitioned
– Partial aggregation reduces data movement
- 31. 31© Copyright 2013 Pivotal. All rights reserved.
Other thought’s on Memory
By default Spark owns 90% of the memory
Partitions don’t have to fit in memory, but some things do
– EG: values for large sets in groupBy’s must fit in memory
Shuffle memory is 30%
– If it goes over that, it’ll spill the data to disk
– Shuffle always writes to disk
Turn on compression to keep objects serialized
– Saves space, but takes compute to serialize/de-serialize
- 32. 32© Copyright 2013 Pivotal. All rights reserved. 32© Copyright 2013 Pivotal. All rights reserved.
Spark Deployment
modes
- 33. 33© Copyright 2013 Pivotal. All rights reserved.
Spark Topology/Deployment modes
Local
– Great for Dev
Spark Cluster (master/slaves)
– Improving rapidly
Cluster Resource Managers
– YARN
– MESOS
- 38. 38© Copyright 2013 Pivotal. All rights reserved. 38© Copyright 2013 Pivotal. All rights reserved.
Spark Hands-on
- 39. 39© Copyright 2013 Pivotal. All rights reserved.
Spark Hands-on
How to Build Spark ?
Spark on Dev environment (mac)
Spark on AWB
Pyspark & Lamda Functions
Spark Examples using Pyspark
Web UI/Debugging
- 40. 40© Copyright 2013 Pivotal. All rights reserved.
Spark Code/Build
Spark Git Repsitory
– https://github.com/apache/spark
Download Spark
– git clone https://github.com/apache/spark
Build Spark
– Maven
– Shell script
- 41. 41© Copyright 2013 Pivotal. All rights reserved.
Build Spark with Maven
Setting up Maven’s Memory Usage
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -
XX:ReservedCodeCacheSize=512m”
Specifying the Hadoop Version
# Apache Hadoop 2.2.X
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
# Apache Hadoop 2.3.X
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
# Apache Hadoop 2.4.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
- 42. 42© Copyright 2013 Pivotal. All rights reserved.
Build Spark using Script
Use make distribution script
Abstraction using maven
– ./make-distribution.sh --skip-java-test -Pyarn -Phadoop-2.2 -
Phadoop.version=2.2.0
– Creates dist/ folder containing spark artifacts
– -- tgz creates spark distribution
- 43. 43© Copyright 2013 Pivotal. All rights reserved.
Spark on AWB
Access
– ssh manis2@acs04.analyticsworkbench.com -p 45326
Topology
– https://portal.analyticsworkbench.com/projects/awbhome/wiki/Cluste
r_Topology
– Spark Admin: http://access3.ic.analyticsworkbench.com:4040
Directory
– /usr/share/spark
Spark using Yarn Cluster
- 44. 44© Copyright 2013 Pivotal. All rights reserved.
Using Spark Submit
Suitable for yarn-cluster mode
export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-1.0.0-
hadoop2.2.0-gphd-3.0.1.0.jar
export HADOOP_CONF_DIR=/etc/gphd/hadoop/conf
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster
--num-executors 3
--driver-memory 1g
--executor-memory 2g
--executor-cores 1
lib/spark-examples*.jar 10
- 45. 45© Copyright 2013 Pivotal. All rights reserved.
Using Spark Shell
Suitable for yarn-client mode
Interactive shell suitable for debugging
export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-
1.0.0-hadoop2.2.0-gphd-3.0.1.0.jar
bin/pyspark –master yarn-client --num-executors 2
- 46. 46© Copyright 2013 Pivotal. All rights reserved.
Spark Logging
Spark Logs contains the following information
– # of Partitions
– Size of tasks
– Nodes tasks are running
– Progress of tasks
Yarn Container Logs
– Available locally
– Aggregated on HDFS
- 47. 47© Copyright 2013 Pivotal. All rights reserved.
Spark Web Interface
Web UI: http://<driver-node>:4040
One port for each application (aka SparkContext)
Shows the following information
– List of scheduler stages & tasks
– Summary of RDD sizes & memory usage
– Environmental information
– Information about running executors
Provide ability to view application information after it is finished
– Set spark.eventLog.enabled = true
– Set spark.eventLog.dir = file:///tmp/spark-events
– Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
- 48. 48© Copyright 2013 Pivotal. All rights reserved.
Spark Examples
Line with most words
Lines with a particular word
Word count
Sorting
PageRank
- 49. 49© Copyright 2013 Pivotal. All rights reserved. 49© Copyright 2013 Pivotal. All rights reserved.
Berkley Data Stack -
Related Projects
Things that use Spark Core
- 50. 50© Copyright 2013 Pivotal. All rights reserved.
Berkley Data Analytics Stack (BDAS)
Support
Batch
Streaming
Interactive
Make it easy to
compose them
https://amplab.cs.berkeley.edu/software/
- 51. 51© Copyright 2013 Pivotal. All rights reserved.
Spark SQL
Lib in Spark Core that models RDDs as relations
– SchemaRDD
Replaces Shark
– Lighter weight version with no code from Hive
Import/Export in different Storage formats
– Parquet, learn schema from existing Hive warehouse
Takes columnar storage from Shark
- 52. 52© Copyright 2013 Pivotal. All rights reserved.
Spark Streaming
Extend Spark to do large scale stream processing
– 100s of nodes with second scale end to end latency
Simple, batch like API with RDDs
Single semantics for both real time and high latency
Other features
– Window-based Transformations
– Arbitrary join of streams
- 53. 53© Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
Input is broken up into Batches that become RDDs
RDD’s are composed into DAGs to generate output
Raw data is replicated in-memory for FT
- 54. 54© Copyright 2013 Pivotal. All rights reserved.
GraphX (Alpha)
Graph processing library
– Replaces Spark Bagel
Graph Parallel not Data Parallel
– Reason in the context of neighbors
– GraphLab API
Graph Creation => Algorithm => Post Processing
– Existing systems mainly deal with the Algorithm and not interactive
– Unify collection and graph models
- 55. 55© Copyright 2013 Pivotal. All rights reserved.
MLbase
Machine Learning toolset
– Library and higher level abstractions
General tool in space is MatLab
– Difficult for end users to learn, debug, scale solutions
Starting with MLlib
– Low level Distributed Machine Learning Library
Many different Algorithms
– Classification, Regression, Collaborative Filtering, etc.
- 56. 56© Copyright 2013 Pivotal. All rights reserved. 56© Copyright 2013 Pivotal. All rights reserved.
Thanks!
- 57. 57© Copyright 2013 Pivotal. All rights reserved.
Data Science Platform
IMDG
Cluster Manager
RDDM/R
Application Platform
Stream
Server
MPP
SQL
Data Lake / HDFS / Virtual Storage
App Data Platform
SQL
Objects
JSON GemFireX
D
...ETC
Hadoop HDFS Isilon
App Dev / Ops
YARN
Mesos
MLbaseStreaming
Legacy
Systems
Legacy
Data Scientists/AnalystsData Sources End Users
SparkSQL
- 58. 58© Copyright 2013 Pivotal. All rights reserved. 58© Copyright 2013 Pivotal. All rights reserved.
Backup Slides
- 59. 59© Copyright 2013 Pivotal. All rights reserved.
PHD
General Solution Pipeline
Streaming Ingest
GemFire
(IMDB)
Machine
data
Stream
message
Source
RabbitMQ
Transport
HDFS
Sink
GemFire
Tap
SQL
REST
API
Analytics –
Counters and
Gauges
Message
Transformer
Analytics
Taps
HDFS
Dashboard
- 60. 60© Copyright 2013 Pivotal. All rights reserved.
PHD
Where’s Spark?
Streaming Ingest
GemFire
(IMDB)
Machine
data
Stream
message
Source
Transport
HDFS
Sink
GemFire
Tap
SQL
REST
API
Analytics –
Counters and
Gauges
Message
Transformer
Analytics
Taps
HDFS
Dashboard
- 61. 61© Copyright 2013 Pivotal. All rights reserved.
Slides by Shivram
How to download/build and run spark examples
– Build w/ specific version of Hadoop (PHD?)
▪ http://spark.apache.org/docs/latest/building-with-maven.html
Spark cluster deployment modes
– YARN, & Singlenode ( EC2, Mesos, Standalone)
– http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-
yarn-app-models/
– http://spark.apache.org/docs/latest/running-on-yarn.html
– Submit apps vs interactive shell
Explain simple spark example and run it
▪ run-example, spark-shell/pyspark, spark-submitt,
▪ http://spark.apache.org/docs/latest/quick-start.html
- 62. 62© Copyright 2013 Pivotal. All rights reserved.
Deployment modes (+YARN slide)
Local
– Great for Dev
Spark Cluster (master/slaves)
– Improving rapidly
Cluster Resource Managers
– YARN
– MESOS
- 63. 63© Copyright 2013 Pivotal. All rights reserved.
Intro to Spark based Projects
– Spark SQL
– Spark Streaming
– MLBase
– GraphX