SlideShare une entreprise Scribd logo
1  sur  52
Télécharger pour lire hors ligne
Lighting Fast Big Data Analytics with 
Apache . 
Andy Petrella (@noootsab), Gerard Maas (@maasg) 
Big Data Hacker Data Processing Team Lead 
#devoxx #sparkvoxx @noootsab @maasg
Agenda 
What is Spark? 
Spark Foundation: The RDD 
Demo 
Ecosystem 
Examples 
Resources 
#devoxx #sparkvoxx @noootsab @maasg
Memory Network 
CPU’s 
(and don’t forget to throw some disks in the mix) 
#devoxx #sparkvoxx @noootsab @maasg
What is Spark? 
Spark is a fast and general engine for large-scale distributed data processing. 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line. 
split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...") 
Fast Functional 
Growing 
Ecosystem 
#devoxx #sparkvoxx @noootsab @maasg
Spark: A Strong Open Source Project 
27/02 Apache top-level proj 
30/05 Spark 1.0.0 REL 
11/09 Spark 1.1.0 REL 
42 contibutors 118 contibutors 
#Commits. src: github.com/apache/spark 
176 contibutors 
#devoxx #sparkvoxx @noootsab @maasg
Compared to Map-Reduce 
public class WordCount { 
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable( 1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
context.write(word, one); 
} 
} 
} 
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException { 
int sum = 0; 
for (IntWritable val : values) { 
sum += val.get(); 
} 
context.write(key, new IntWritable(sum)); 
} 
} 
public static void main(String[] args) throws Exception { 
Configuration conf = new Configuration(); 
Job job = new Job(conf, "wordcount" ); 
job.setOutputKeyClass(Text.class); 
job.setOutputValueClass(IntWritable.class); 
job.setMapperClass(Map.class); 
job.setReducerClass(Reduce.class); 
job.setInputFormatClass(TextInputFormat.class); 
job.setOutputFormatClass(TextOutputFormat.class); 
FileInputFormat.addInputPath(job, new Path(args[ 0])); 
FileOutputFormat.setOutputPath(job, new Path(args[ 1])); 
job.waitForCompletion( true); 
} 
} 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line. 
split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...") 
Spark 
#devoxx #sparkvoxx @noootsab @maasg
The Big Idea... 
Express computations in terms of operations on a data set. 
Spark Core Concept: RDD => Resilient Distributed Dataset 
Think of an RDD as an immutable, distributed collection of objects 
• Resilient => Can be reconstructed in case of failure 
• Distributed => Transformations are parallelizable operations 
• Dataset => Data loaded and partitioned across cluster nodes (executors) 
RDDs are memory-intensive. Caching behavior is controllable. 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
Executors 
Spark Cluster 
HDFS 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...") RDD 
Partitions 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ")) 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
75 
7 
3 
#devoxx #sparkvoxx @noootsab @maasg
RDDs 
.textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
75 
7753 
7 
3 
#devoxx #sparkvoxx @noootsab @maasg
The Spark Lingo 
.textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 
111111 
111111 
111111 
.reduceByKey(_ + _) 
2411 
3121 
2221 
75 
7753 
7 
3 
Job 
Cluster 
Executor 
RDD 
Partition 
Stage 
Task 
#devoxx #sparkvoxx @noootsab @maasg
Spark: RDD Operations 
INPUT 
DATA 
HDFS 
TEXT/ 
Sequence 
File 
RDD 
SparkContext 
RDD 
OUTPUT 
Data 
HDFS 
TEXT/ 
Sequence 
File 
Cassandra 
#devoxx #sparkvoxx @noootsab @maasg
Transformations 
Inner Manipulations 
> map, flatMap, filter, distinct 
Cross RDD 
> union, subtract, intersection, join, cartesian 
Structural reorganization (Expensive) 
> groupBy, aggregate, sort 
Tuning 
> coalesce, repartition 
#devoxx #sparkvoxx @noootsab @maasg
Actions 
Fetch Data 
> collect, take, first, takeSample 
Aggregate Results 
> reduce, count, countByKey 
Output 
> foreach, foreachPartition, save* 
#devoxx #sparkvoxx @noootsab @maasg
RDD Lineage 
Each RDDs keeps track of its parent. 
This is the basis for DAG scheduling 
and fault recovery 
val file = spark.textFile("hdfs://...") 
val wordsRDD = file.flatMap(line => line.split 
(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
val scoreRdd = words.map{case (k,v) => (v,k)} 
HadoopRDD 
MappedRDD 
FlatMappedRDD 
MappedRDD 
MapPartitionsRDD 
ShuffleRDD 
wordsRDD MapPartitionsRDD 
scoreRDD MappedRDD 
rdd.toDebugString is your friend 
#devoxx #sparkvoxx @noootsab @maasg
Spark has Support for... 
Java 
Scala Notebook 
Python 
API 
Shell 
> 
A 
A API 
A API 
> Shell Notebook 
R API Shell 
The Spark Shell is the best way to start exploring Spark 
#devoxx #sparkvoxx @noootsab @maasg
Demo 
Exploring and 
transforming data with 
the Spark Shell 
Acknowlegments: 
Book data provided by Project Gutenberg (http://www.gutenberg.org/) 
through https://www.opensciencedatacloud.org/ 
Cluster computing resources provided by http://www.virdata.com 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg
Agenda 
What is Spark? 
Spark Foundation: The RDD 
Demo 
Ecosystem 
Examples 
Resources 
#devoxx #sparkvoxx @noootsab @maasg
Ecosystem 
Now, we know what is Spark! 
At least, we know its Core, let’s say SDK. 
Thanks to its great and enthusiastic community 
Spark Core have been used in an ever growing number of fields 
Hence the ecosystem is evolving fast 
#devoxx #sparkvoxx @noootsab @maasg
Higher level primitives ... 
… or APIs 
… or the rise of the popolo 
If Spark Core is the fold of distributed computing 
Then we’re going to look at the map, filter, countBy, groupBy, ... 
#devoxx #sparkvoxx @noootsab @maasg
Spark Streaming 
When you have big fat streams behaving as one single collection 
t 
DStream[T] 
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] 
#devoxx #sparkvoxx @noootsab @maasg
Spark Streaming 
#devoxx #sparkvoxx @noootsab @maasg
Spark SQL 
From SQL to noSQL to SQL … to noSQL 
Structured Query Language 
We’re not really querying but we’re processing 
SQL provides the mathematical (abstraction) structures to manipulate data 
We can optimize, Spark has Catalyst 
#devoxx #sparkvoxx @noootsab @maasg
Spark SQL 
#devoxx #sparkvoxx @noootsab @maasg
MLLib 
“The library to teach them all” 
SciPy, SciKitLearn, R, MatLab and c° → learn on one machine 
(sadly often, one core) 
SVM lm 
NaiveBayes 
PCA 
K-Means ALS 
SVD 
#devoxx #sparkvoxx @noootsab @maasg
GraphX 
Connecting the dots 
Graph processing at scale. 
> Takes edges 
> Add some nodes 
> Combine = Send messages (Pregel) 
#devoxx #sparkvoxx @noootsab @maasg
GraphX 
Connecting the dots 
Graph processing at scale. 
> Take edges 
> Link nodes 
> Combine/Send messages 
#devoxx #sparkvoxx @noootsab @maasg
ADAM 
The new kid on the block in the Spark community (with the uncovered Thunder) 
Game changing library for processing DNA, Genotypes, Variant and co. 
Comes with the right stack for processing … 
… legacy huge bunch of vital data 
#devoxx #sparkvoxx @noootsab @maasg
Tooling (NoIDE) 
Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! 
An IDE is not enough because not only softwares or services are crafted. 
Spark is for data analysis, and data scientist need 
> interactivity (exploration) 
> reproducibility (environment, data and logic) 
> shareability (results) 
#devoxx #sparkvoxx @noootsab @maasg
ISpark 
Spark-Shell backend for IPython (Worksheet for data analysts) 
#devoxx #sparkvoxx @noootsab @maasg
Zeppelin 
Well shaped Notebook based on Kibana, offering Spark dedicated features 
> Multi languages (Scala, sql, markdown, shell) 
> Dynamic forms (generating inputs) 
> Data visualization (and export) 
Check the website! 
#devoxx #sparkvoxx @noootsab @maasg
Spark Notebook 
Scala-Notebook fork, enhanced for Spark peculiarities. 
Full Scala, Akka and RxScala. 
Features including: 
> Multi languages (Scala, sql, markdown, javascript) 
> Data visualization 
> Spark work tracking 
Try it: 
curl https://raw.githubusercontent.com/andypetrella/spark-notebook/spark/run.sh | bash -s dev 
#devoxx #sparkvoxx @noootsab @maasg
Databricks Cloud 
The amazing product crafted by the company behind Spark! 
Cannot say more than this product will be amazing. 
Fully collaborative, dashboard creation and publication. 
Register for a beta account (Still eagerly waiting for mine ) 
Go there 
#devoxx #sparkvoxx @noootsab @maasg
Examples 
#devoxx #sparkvoxx @noootsab @maasg
Mining DNA 
#devoxx #sparkvoxx @noootsab @maasg
#devoxx #sparkvoxx @noootsab @maasg
Mining Geodata 
#devoxx #sparkvoxx @noootsab @maasg
Dallas  Seattle 
divergence of 18.4 
#devoxx #sparkvoxx @noootsab @maasg
Mining Texts 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
Process Wikipedia XML dump put in HDFS 
Convert XML (multi-lined ) to CSV 
Push to S3 
Sampling 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
Compute some stats: TF-IDF 
Train a NaiveBayes classifier 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
See what the machine can say 
#devoxx #sparkvoxx @noootsab @maasg
A small project just for the fun 
But… quite some data 
#devoxx #sparkvoxx @noootsab @maasg
A Word of Advice 
Spark beautiful simplicity is often overshadowed by the complexity of building 
and maintaining a working distributed system. 
Sharpen up your Ops skills… 
… or ooops 
#devoxx #sparkvoxx @noootsab @maasg
Resources 
Project website: http://spark.apache.org/ 
Spark presentations: http://spark-summit.org/2014 
Starting Questions: http://stackoverflow.com/questions/tagged/apache-spark 
More Advanced Questions: user@spark.apache.org 
Source Code: https://github.com/apache/spark 
Getting involved: http://spark.apache.org/community.html 
#devoxx #sparkvoxx @noootsab @maasg
Acknowledgments 
Devoxx ! 
Virdata → Shell Demo cluster 
NextLab → Wikipedia ML Cluster 
Rand Hindi (Snips) → Geodata example 
Xavier Tordoir (SilicoCloud) → DNA example 
#devoxx #sparkvoxx @noootsab @maasg
Answers! 
#devoxx #sparkvoxx @noootsab @maasg

Contenu connexe

Tendances

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with ElasticsearchHolden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarDatabricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureSpark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 

Tendances (20)

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 

Similaire à Spark devoxx2014

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache SparkYasoda Jayaweera
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 

Similaire à Spark devoxx2014 (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Spark core
Spark coreSpark core
Spark core
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 

Plus de Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 

Plus de Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 

Spark devoxx2014

  • 1. Lighting Fast Big Data Analytics with Apache . Andy Petrella (@noootsab), Gerard Maas (@maasg) Big Data Hacker Data Processing Team Lead #devoxx #sparkvoxx @noootsab @maasg
  • 2. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  • 3. Memory Network CPU’s (and don’t forget to throw some disks in the mix) #devoxx #sparkvoxx @noootsab @maasg
  • 4. What is Spark? Spark is a fast and general engine for large-scale distributed data processing. val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Fast Functional Growing Ecosystem #devoxx #sparkvoxx @noootsab @maasg
  • 5. Spark: A Strong Open Source Project 27/02 Apache top-level proj 30/05 Spark 1.0.0 REL 11/09 Spark 1.1.0 REL 42 contibutors 118 contibutors #Commits. src: github.com/apache/spark 176 contibutors #devoxx #sparkvoxx @noootsab @maasg
  • 6. Compared to Map-Reduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable( 1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount" ); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[ 0])); FileOutputFormat.setOutputPath(job, new Path(args[ 1])); job.waitForCompletion( true); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line. split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark #devoxx #sparkvoxx @noootsab @maasg
  • 7. The Big Idea... Express computations in terms of operations on a data set. Spark Core Concept: RDD => Resilient Distributed Dataset Think of an RDD as an immutable, distributed collection of objects • Resilient => Can be reconstructed in case of failure • Distributed => Transformations are parallelizable operations • Dataset => Data loaded and partitioned across cluster nodes (executors) RDDs are memory-intensive. Caching behavior is controllable. #devoxx #sparkvoxx @noootsab @maasg
  • 8. RDDs Executors Spark Cluster HDFS #devoxx #sparkvoxx @noootsab @maasg
  • 9. RDDs .textFile("...") RDD Partitions #devoxx #sparkvoxx @noootsab @maasg
  • 10. RDDs .textFile("...").flatMap(l => l.split(" ")) #devoxx #sparkvoxx @noootsab @maasg
  • 11. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 #devoxx #sparkvoxx @noootsab @maasg
  • 12. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 #devoxx #sparkvoxx @noootsab @maasg
  • 13. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7 3 #devoxx #sparkvoxx @noootsab @maasg
  • 14. RDDs .textFile("...").flatMap(l => l.split(" ").)map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 #devoxx #sparkvoxx @noootsab @maasg
  • 15. The Spark Lingo .textFile("...").flatMap(l => l.split(" ")).map(w => (w,1)) 111111 111111 111111 .reduceByKey(_ + _) 2411 3121 2221 75 7753 7 3 Job Cluster Executor RDD Partition Stage Task #devoxx #sparkvoxx @noootsab @maasg
  • 16. Spark: RDD Operations INPUT DATA HDFS TEXT/ Sequence File RDD SparkContext RDD OUTPUT Data HDFS TEXT/ Sequence File Cassandra #devoxx #sparkvoxx @noootsab @maasg
  • 17. Transformations Inner Manipulations > map, flatMap, filter, distinct Cross RDD > union, subtract, intersection, join, cartesian Structural reorganization (Expensive) > groupBy, aggregate, sort Tuning > coalesce, repartition #devoxx #sparkvoxx @noootsab @maasg
  • 18. Actions Fetch Data > collect, take, first, takeSample Aggregate Results > reduce, count, countByKey Output > foreach, foreachPartition, save* #devoxx #sparkvoxx @noootsab @maasg
  • 19. RDD Lineage Each RDDs keeps track of its parent. This is the basis for DAG scheduling and fault recovery val file = spark.textFile("hdfs://...") val wordsRDD = file.flatMap(line => line.split (" ")) .map(word => (word, 1)) .reduceByKey(_ + _) val scoreRdd = words.map{case (k,v) => (v,k)} HadoopRDD MappedRDD FlatMappedRDD MappedRDD MapPartitionsRDD ShuffleRDD wordsRDD MapPartitionsRDD scoreRDD MappedRDD rdd.toDebugString is your friend #devoxx #sparkvoxx @noootsab @maasg
  • 20. Spark has Support for... Java Scala Notebook Python API Shell > A A API A API > Shell Notebook R API Shell The Spark Shell is the best way to start exploring Spark #devoxx #sparkvoxx @noootsab @maasg
  • 21. Demo Exploring and transforming data with the Spark Shell Acknowlegments: Book data provided by Project Gutenberg (http://www.gutenberg.org/) through https://www.opensciencedatacloud.org/ Cluster computing resources provided by http://www.virdata.com #devoxx #sparkvoxx @noootsab @maasg
  • 23. Agenda What is Spark? Spark Foundation: The RDD Demo Ecosystem Examples Resources #devoxx #sparkvoxx @noootsab @maasg
  • 24. Ecosystem Now, we know what is Spark! At least, we know its Core, let’s say SDK. Thanks to its great and enthusiastic community Spark Core have been used in an ever growing number of fields Hence the ecosystem is evolving fast #devoxx #sparkvoxx @noootsab @maasg
  • 25. Higher level primitives ... … or APIs … or the rise of the popolo If Spark Core is the fold of distributed computing Then we’re going to look at the map, filter, countBy, groupBy, ... #devoxx #sparkvoxx @noootsab @maasg
  • 26. Spark Streaming When you have big fat streams behaving as one single collection t DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] #devoxx #sparkvoxx @noootsab @maasg
  • 27. Spark Streaming #devoxx #sparkvoxx @noootsab @maasg
  • 28. Spark SQL From SQL to noSQL to SQL … to noSQL Structured Query Language We’re not really querying but we’re processing SQL provides the mathematical (abstraction) structures to manipulate data We can optimize, Spark has Catalyst #devoxx #sparkvoxx @noootsab @maasg
  • 29. Spark SQL #devoxx #sparkvoxx @noootsab @maasg
  • 30. MLLib “The library to teach them all” SciPy, SciKitLearn, R, MatLab and c° → learn on one machine (sadly often, one core) SVM lm NaiveBayes PCA K-Means ALS SVD #devoxx #sparkvoxx @noootsab @maasg
  • 31. GraphX Connecting the dots Graph processing at scale. > Takes edges > Add some nodes > Combine = Send messages (Pregel) #devoxx #sparkvoxx @noootsab @maasg
  • 32. GraphX Connecting the dots Graph processing at scale. > Take edges > Link nodes > Combine/Send messages #devoxx #sparkvoxx @noootsab @maasg
  • 33. ADAM The new kid on the block in the Spark community (with the uncovered Thunder) Game changing library for processing DNA, Genotypes, Variant and co. Comes with the right stack for processing … … legacy huge bunch of vital data #devoxx #sparkvoxx @noootsab @maasg
  • 34. Tooling (NoIDE) Besides the classical Eclipse, IntellijIDEA, Netbeans, Sublime Text and family! An IDE is not enough because not only softwares or services are crafted. Spark is for data analysis, and data scientist need > interactivity (exploration) > reproducibility (environment, data and logic) > shareability (results) #devoxx #sparkvoxx @noootsab @maasg
  • 35. ISpark Spark-Shell backend for IPython (Worksheet for data analysts) #devoxx #sparkvoxx @noootsab @maasg
  • 36. Zeppelin Well shaped Notebook based on Kibana, offering Spark dedicated features > Multi languages (Scala, sql, markdown, shell) > Dynamic forms (generating inputs) > Data visualization (and export) Check the website! #devoxx #sparkvoxx @noootsab @maasg
  • 37. Spark Notebook Scala-Notebook fork, enhanced for Spark peculiarities. Full Scala, Akka and RxScala. Features including: > Multi languages (Scala, sql, markdown, javascript) > Data visualization > Spark work tracking Try it: curl https://raw.githubusercontent.com/andypetrella/spark-notebook/spark/run.sh | bash -s dev #devoxx #sparkvoxx @noootsab @maasg
  • 38. Databricks Cloud The amazing product crafted by the company behind Spark! Cannot say more than this product will be amazing. Fully collaborative, dashboard creation and publication. Register for a beta account (Still eagerly waiting for mine ) Go there #devoxx #sparkvoxx @noootsab @maasg
  • 39. Examples #devoxx #sparkvoxx @noootsab @maasg
  • 40. Mining DNA #devoxx #sparkvoxx @noootsab @maasg
  • 42. Mining Geodata #devoxx #sparkvoxx @noootsab @maasg
  • 43. Dallas Seattle divergence of 18.4 #devoxx #sparkvoxx @noootsab @maasg
  • 44. Mining Texts #devoxx #sparkvoxx @noootsab @maasg
  • 45. A small project just for the fun Process Wikipedia XML dump put in HDFS Convert XML (multi-lined ) to CSV Push to S3 Sampling #devoxx #sparkvoxx @noootsab @maasg
  • 46. A small project just for the fun Compute some stats: TF-IDF Train a NaiveBayes classifier #devoxx #sparkvoxx @noootsab @maasg
  • 47. A small project just for the fun See what the machine can say #devoxx #sparkvoxx @noootsab @maasg
  • 48. A small project just for the fun But… quite some data #devoxx #sparkvoxx @noootsab @maasg
  • 49. A Word of Advice Spark beautiful simplicity is often overshadowed by the complexity of building and maintaining a working distributed system. Sharpen up your Ops skills… … or ooops #devoxx #sparkvoxx @noootsab @maasg
  • 50. Resources Project website: http://spark.apache.org/ Spark presentations: http://spark-summit.org/2014 Starting Questions: http://stackoverflow.com/questions/tagged/apache-spark More Advanced Questions: user@spark.apache.org Source Code: https://github.com/apache/spark Getting involved: http://spark.apache.org/community.html #devoxx #sparkvoxx @noootsab @maasg
  • 51. Acknowledgments Devoxx ! Virdata → Shell Demo cluster NextLab → Wikipedia ML Cluster Rand Hindi (Snips) → Geodata example Xavier Tordoir (SilicoCloud) → DNA example #devoxx #sparkvoxx @noootsab @maasg
  • 52. Answers! #devoxx #sparkvoxx @noootsab @maasg