SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Andrei Avramescu
Radu Chilom
xPatterns on Spark, Tachyon and
Mesos
2
• Introduction
• Spark
• SparkSQL
• Tachyon
• Mesos
• Lessons learned
• Demo: xPatterns
• Q & A
Agenda
3
oBig data analytics / machine learning
oOffices in Seattle and Timisoara
o5+ years with Hadoop ecosystem
o1 year with Spark
4
• “fast and general engine for large-scale data
processing”
• open sourced
• API for Java/Scala/Python (80 operators)
• not bounded to map-reduce paradigm
• powers a stack of high level tools including
Spark SQL, MLlib, Spark Streaming.
Apache Spark
5
• Main entry point to Spark
• SparkConf: spark.app.name, spark.master, spark.serializer,
spark.cores.max, spark.task.cpus
SparkContext
val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”))
Cluster URL, or local
/ local[N]
App
name
Spark install
path on cluster
List of JARs with
app code (to ship)
6
Resilient Distributed Dataset
• Immutable collection of elements partitioned
across the cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure ( lineage )
Operations
• Transformations (e.g. map, filter, groupBy)
• Actions (e.g. count, collect, save)
Key Concept: RDDs
7
Parallelize collection into an RDD
> sc.parallelize(List(1, 2, 3))
Load text file from local FS, HDFS, or S3
> sc.textFile(“test.txt”)
> sc.textFile(“textDir/*.txt”)
> sc.textFile(“hdfs://...”)
Use existing Hadoop InputFormat (Java/Scala only)
> sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Creating RDDs
8
> nums = sc.parallelize(List(1, 2, 3))
Pass each element through a function
> squares = nums.map(x => x * x) // {1, 4, 9}
Keep elements passing a predicate
> even = squares.filter(x => x % 2 == 0) // {4}
Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
Return first K elements
> nums.take(2) # => [1, 2]
Count number of elements
> nums.count() # => 3
Basic Transformations
Basic Actions
9
RDD Persistence
• persist() or cache()
• MEMORY_ONLY , MEMORY_AND_DISK,
• MEMORY_ONLY_SER, MEMORY_AND_DISK_SER,
• DISK_ONLY
• OFF_HEAP
10
RDD FaultTolerance
• RDDs maintain lineage information that can be used
to reconstruct lost partitions
• Ex:cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD
11
Example: Log Mining
• Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Cached RDD
Parallel operation
12
SharedVariables
• Broadcast Variables
• Accumulators
13
• Data Serialization
oJava serialization ( default )
oKryo serialization
Spark tuning
14
Spark tuning
Kryo serialization : spark.kryoserializer.buffer.mb
15
Spark tuning
• Level of Parallelism
o number of partitions
o “reduce” operations <- largest parent RDD’s number of
partitions
o spark.default.parallelism
• Memory Usage of Reduce Tasks
• Broadcasting Large Variables
16
• Improved running time
Spark vs Hadoop
17
Spark vs Hadoop (word count)
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
18
val sc = new SparkContext(“spark://...”, “MyJob”, home, jars)
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
LOC : 6(11) vs 35
Spark vs Hadoop (word count)
19
• Much more operators than map reduce
• Hadoop is a bigger and older community
• Happily Coexist
Spark vs Hadoop
20
• Shark modified the Hive backend to run over
Spark, but had inconvenients:
 Limited integration with Spark programs
 Hive optimizer not designed for Spark
Spark SQL (alpha)
21
• Spark SQL reuses the best parts of Shark
 Hive data loading
 In-memory column store
Spark SQL (alpha)
22
• Adds
 Support for multiple input formats
 Rich language interfaces
 RDD-aware optimizer
Spark SQL (alpha)
23
• SchemaRDD
 Row objects
 Schema
• Row objects can be:
 Case Classes (Scala)
 Beans (Java)
Spark SQL (alpha)
24
Create SQL Context
val sqlContext = new SQLContext(sparkContext)
Create people RDD and register table
val people = sqlContext
.textFile("examples/src/main/resources/people.txt").map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt),p(3))
people.registerTempTable("people")
Query table
val teenagers = sqlContext.sql("SELECT name FROM people WHERE
age >= 13 AND age <= 23")
Spark SQL (alpha)
Radu,24,1.70
Andrei,23,1.88
25
• Running time improvment 4x – 30x
• Bucketing
• Bucket Joins
• Skew Joins
• Partial DAG Execution
SparkSQL vs Hive
26
• 0.8.0 - first POC … lots of OOM
• 0.8.1 – first production deployment, still lots of OOM
 20 billion healthcare records, 200 TB of compressed hdfs data
 Hadoop MR: 100 m1.xlarge (4c x 15GB)
 BDAS: 20 cc2.8xlarge (32c x 60.8 GB), still lots of OOM map & reducer side
 Perf gains of 4x to 40x, required individual dataset and query fine-tuning
 Mixed Hive & Shark workloads where it made sense
 Daily processing reduced from 14 hours to 1.5 hours!
• 0.9.0 - fixed many of the problems, but still requires patches! spilling on the
reducer side fixed (less OOM)
• 1.0.2 – in production today
• 1.1 upgrade in progress
Spark 0.8.0 to 1.1
27
• cluster resource manager
• Multi-resource scheduling (memory, CPU, disk, and
ports)
• Scalability to 10,000s of nodes
• Fault-tolerant replicated master and slaves using
ZooKeeper
Mesos (0.20)
28
• memory-centric distributed file system enabling
reliable file sharing at memory-speed across cluster
frameworks
• Pluggable underlayer file system: hdfs, S3, local file
system,…
Tachyon (v0.5)
29
• Java like File API / FileSystem API
• Configurable block size
• Memory management
Tachyon (v0.5)
30
• Jaws, xPatterns http spark sql server!
http://github.com/Atigeo/http-spark-sql-server
 Backward compatible with Shark
 Backend in spray io (REST on Akka)
• Spark Job Server
 multiple Spark contexts in same JVM, job submission in Java + Scala
 https://github.com/Atigeo/spark-job-rest
• Mesos framework starvation bug
• *SchedulerBackend update due to race conditions, Spark 0.9.0
patches
Community contribution
31
• Read the papers
• Fine tuning can really boost your running time
• When using spark don’t think map-reduce
Lessons learned
32
Demo
33
Q & A
34
Apache Spark
https://spark.apache.org/
Parallel programming with Spark
http://ampcamp.berkeley.edu/wp-
content/uploads/2013/02/Parallel-Programming-With-Spark-
Matei-Zaharia-Strata-2013.pdf
Introduction to spark internals
spark.apache.org/talks/dev-meetup-dec-2012.pptx
Bibliography
© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this
presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided
after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Contenu connexe

Tendances

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016Duyhai Doan
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_sparkYiguang Hu
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016Duyhai Doan
 

Tendances (19)

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 

Similaire à xPatterns on Spark, Tachyon and Mesos - Bucharest meetup

OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceChitturi Kiran
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 

Similaire à xPatterns on Spark, Tachyon and Mesos - Bucharest meetup (20)

OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Shark
SharkShark
Shark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Spark core
Spark coreSpark core
Spark core
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 

Dernier

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 

Dernier (20)

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 

xPatterns on Spark, Tachyon and Mesos - Bucharest meetup

  • 1. Andrei Avramescu Radu Chilom xPatterns on Spark, Tachyon and Mesos
  • 2. 2 • Introduction • Spark • SparkSQL • Tachyon • Mesos • Lessons learned • Demo: xPatterns • Q & A Agenda
  • 3. 3 oBig data analytics / machine learning oOffices in Seattle and Timisoara o5+ years with Hadoop ecosystem o1 year with Spark
  • 4. 4 • “fast and general engine for large-scale data processing” • open sourced • API for Java/Scala/Python (80 operators) • not bounded to map-reduce paradigm • powers a stack of high level tools including Spark SQL, MLlib, Spark Streaming. Apache Spark
  • 5. 5 • Main entry point to Spark • SparkConf: spark.app.name, spark.master, spark.serializer, spark.cores.max, spark.task.cpus SparkContext val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”)) Cluster URL, or local / local[N] App name Spark install path on cluster List of JARs with app code (to ship)
  • 6. 6 Resilient Distributed Dataset • Immutable collection of elements partitioned across the cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure ( lineage ) Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Key Concept: RDDs
  • 7. 7 Parallelize collection into an RDD > sc.parallelize(List(1, 2, 3)) Load text file from local FS, HDFS, or S3 > sc.textFile(“test.txt”) > sc.textFile(“textDir/*.txt”) > sc.textFile(“hdfs://...”) Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf) Creating RDDs
  • 8. 8 > nums = sc.parallelize(List(1, 2, 3)) Pass each element through a function > squares = nums.map(x => x * x) // {1, 4, 9} Keep elements passing a predicate > even = squares.filter(x => x % 2 == 0) // {4} Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] Return first K elements > nums.take(2) # => [1, 2] Count number of elements > nums.count() # => 3 Basic Transformations Basic Actions
  • 9. 9 RDD Persistence • persist() or cache() • MEMORY_ONLY , MEMORY_AND_DISK, • MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, • DISK_ONLY • OFF_HEAP
  • 10. 10 RDD FaultTolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions • Ex:cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD
  • 11. 11 Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(„t‟)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Cached RDD Parallel operation
  • 13. 13 • Data Serialization oJava serialization ( default ) oKryo serialization Spark tuning
  • 14. 14 Spark tuning Kryo serialization : spark.kryoserializer.buffer.mb
  • 15. 15 Spark tuning • Level of Parallelism o number of partitions o “reduce” operations <- largest parent RDD’s number of partitions o spark.default.parallelism • Memory Usage of Reduce Tasks • Broadcasting Large Variables
  • 16. 16 • Improved running time Spark vs Hadoop
  • 17. 17 Spark vs Hadoop (word count) public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
  • 18. 18 val sc = new SparkContext(“spark://...”, “MyJob”, home, jars) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") LOC : 6(11) vs 35 Spark vs Hadoop (word count)
  • 19. 19 • Much more operators than map reduce • Hadoop is a bigger and older community • Happily Coexist Spark vs Hadoop
  • 20. 20 • Shark modified the Hive backend to run over Spark, but had inconvenients:  Limited integration with Spark programs  Hive optimizer not designed for Spark Spark SQL (alpha)
  • 21. 21 • Spark SQL reuses the best parts of Shark  Hive data loading  In-memory column store Spark SQL (alpha)
  • 22. 22 • Adds  Support for multiple input formats  Rich language interfaces  RDD-aware optimizer Spark SQL (alpha)
  • 23. 23 • SchemaRDD  Row objects  Schema • Row objects can be:  Case Classes (Scala)  Beans (Java) Spark SQL (alpha)
  • 24. 24 Create SQL Context val sqlContext = new SQLContext(sparkContext) Create people RDD and register table val people = sqlContext .textFile("examples/src/main/resources/people.txt").map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt),p(3)) people.registerTempTable("people") Query table val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 23") Spark SQL (alpha) Radu,24,1.70 Andrei,23,1.88
  • 25. 25 • Running time improvment 4x – 30x • Bucketing • Bucket Joins • Skew Joins • Partial DAG Execution SparkSQL vs Hive
  • 26. 26 • 0.8.0 - first POC … lots of OOM • 0.8.1 – first production deployment, still lots of OOM  20 billion healthcare records, 200 TB of compressed hdfs data  Hadoop MR: 100 m1.xlarge (4c x 15GB)  BDAS: 20 cc2.8xlarge (32c x 60.8 GB), still lots of OOM map & reducer side  Perf gains of 4x to 40x, required individual dataset and query fine-tuning  Mixed Hive & Shark workloads where it made sense  Daily processing reduced from 14 hours to 1.5 hours! • 0.9.0 - fixed many of the problems, but still requires patches! spilling on the reducer side fixed (less OOM) • 1.0.2 – in production today • 1.1 upgrade in progress Spark 0.8.0 to 1.1
  • 27. 27 • cluster resource manager • Multi-resource scheduling (memory, CPU, disk, and ports) • Scalability to 10,000s of nodes • Fault-tolerant replicated master and slaves using ZooKeeper Mesos (0.20)
  • 28. 28 • memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks • Pluggable underlayer file system: hdfs, S3, local file system,… Tachyon (v0.5)
  • 29. 29 • Java like File API / FileSystem API • Configurable block size • Memory management Tachyon (v0.5)
  • 30. 30 • Jaws, xPatterns http spark sql server! http://github.com/Atigeo/http-spark-sql-server  Backward compatible with Shark  Backend in spray io (REST on Akka) • Spark Job Server  multiple Spark contexts in same JVM, job submission in Java + Scala  https://github.com/Atigeo/spark-job-rest • Mesos framework starvation bug • *SchedulerBackend update due to race conditions, Spark 0.9.0 patches Community contribution
  • 31. 31 • Read the papers • Fine tuning can really boost your running time • When using spark don’t think map-reduce Lessons learned
  • 34. 34 Apache Spark https://spark.apache.org/ Parallel programming with Spark http://ampcamp.berkeley.edu/wp- content/uploads/2013/02/Parallel-Programming-With-Spark- Matei-Zaharia-Strata-2013.pdf Introduction to spark internals spark.apache.org/talks/dev-meetup-dec-2012.pptx Bibliography
  • 35. © 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Notes de l'éditeur

  1. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Lineage : Logging the transformation used to built a dataset Hdfs : one block / partition Storage : in memory serialized Tachyon / deserialized in JVM / on disk (hdfs)