SlideShare une entreprise Scribd logo
1  sur  57
David Taieb
STSM - IBM Cloud Data Services
Developer advocate
david_taieb@us.ibm.com
My boss wants me to work on Apache
Spark: should I use Scala, Java or Python or
a combination of the three?
Java One 2016, San Francisco
©2016 IBM Corporation
Objectives
By the end of this session, you should be able to:
– Have Basic knowledge of Apache Spark (at least enough to get
you started)
– Understand key differences between Java, Scala and Python
– Better informed about deciding which language to use
– Understand the role of the Notebook in quickly building data
solution.
©2016 IBM Corporation
Let’s start with a story…
Disclaimer
All characters and events depicted in this Story are entirely fictitious. Any similarity to
actual use cases, events or persons is actually intentional
©2016 IBM Corporation
Meet Ben “the developer”
• Hold a master degree in computer science
• 10 year experience, 6 years with the company
• Full stack Web developer
• Languages of choice: Java, Node.js, HTML5/CSS3
• Data: No SQL (Cloudant, Mongo), relational
• Protocols: REST, JSON, MQTT
• No major experience with Big data
Favorite Quote:
“The Best Line of Code is the
One I Didn't Have to Write!”
©2016 IBM Corporation
Meet Natasha “the data scientist”
• Hold a PHD in data science
• 5 year experience, 2 years with the company
• Experienced in Python and R
• Expert in Machine Learning and Data visualization
• Software engineering is not her thing
Favorite Quote:
“In God we trust.
All others bring data”
W. Edwards Deming
©2016 IBM Corporation
Surprise meeting with VP of development
We have an urgent need for our marketing department
to build an application that can provide real-time sentiment analysis on Twitter data
courtesy of http://linkq.com.vn/
©2016 IBM Corporation
Key Constraints
• You only have 6 weeks to build the application
• Target consumer is the LOB user: must be easy to use even for
non technical people
• Web interface: should be accessible from a standard browser
(desktop or mobile)
• It must scale out of the box: I want you to use Apache Spark
©2016 IBM Corporation
Some learning to do
What exactly is
Apache Spark?
©2016 IBM Corporation
What is Apache spark ?
Spark is an open source
in-memory
computing framework for
distributed data processing
and
iterative analysis
on massive data volumes
©2016 IBM Corporation
Spark Core Libraries
general compute engine, handles distributed task
dispatching, scheduling and basic I/O functions
Spark SQL
executes SQL
statements
performs streaming
analytics using
micro-batches
common machine
learning and statistical
algorithms
distributed graph
processing framework
Spark
Streaming
Mllib
Machine
Learning
GraphX
Graph
Spark Core
©2016 IBM Corporation
Spark is evolved quickly and has strong traction
Spark is one of the most active
open source projects
Interest over time (Google Trends)
Source: https://www.google.com/trends/explore#q=apache%20spark&cmpt=q&tz=http://www.indeed.com/jobanalytics/jobtrends?q=apache+spark&l=
Job Trends (Indeed.com)
©2016 IBM Corporation
Resilient Distributed Datasets (RDDs)
• A collection of elements that Spark works on in parallel
• May be kept in memory or on disk
• Applications can also explicitly tell Spark to
cache an RDD, which is great for iterative algorithms
• An RDD contains the “raw data”, plus the function
to compute it
• Fault-tolerance: if any partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it
RDD built from a Java collection
RDD built from an external dataset
(local FS, HDFS, Hbase,…)
©2016 IBM Corporation
RDD’s fault tolerance scenario
RD
D1
Map
Transformation
Data Node A
File Input Split
1
File Input Split
2
File Input Split
3
RD
D1
RD
D1
RD
D2
RD
D2
Reduce
Transformation
RD
D3
RD
D3
filter
Transformation
Data Node B
©2016 IBM Corporation
Spark SQL and Data Frames
• Spark’s interface for working with structured and semi-structured data
• Ability to load from a verity of data sources (parquet, JSON, Hive)
• Supports SQL syntax for both JDBC/ODBC connectors
• Special RDD tooling (Join between RDD and an SQL table, Expose UDF’s)
• DataFrames
• A Data Frame is a distributed collection of data organized into named columns.
• It is conceptually equivalent to a table in a relational database
• DataFrames can be constructed from a wide array of sources such as: structured data
files, tables in Hive, external databases, or existing RDDs
• Provide better performance than RDD thanks to Spark SQL’s catalyst query optimizer
©2016 IBM Corporation
Consuming Spark
Spark Application
(driver)
Master
(cluster Manager)
Worker
Node
Worker
Node
Worker
Node
Worker
Node
…
Spark Cluster
Kernel
Master
(cluster Manager)
Worker
Node
Worker
Node
…
Spark Cluster
Notebook Server
Browser
Http/WebSockets
Kernel Protocol
Batch Job
(Spark-Submit)
Interactive
Notebook
• RDD Partitioning
• Task packaging and
dispatching
• Worker node
scheduling
©2016 IBM Corporation
IBM’s commitment to Spark
• Contribute to the Core
• Spark Technology Cluster (STC)
• Open source SystemML
• Foster Community
• Educate 1M+ data scientists and engineers
• Sponsor AMPLab
IBM Analytics for Apache
Spark
• IBM Analytics for Apache Spark Managed Service
• www.ibm.com/analytics/us/en/technology/cloud-data-services/spark-as-a-service
IBM Data Science
Experience
• DataScience Experience:
• datascience.ibm.com
IBM Packages for
Apache Spark
•IBM Packages for Apache Spark:
• ibm.biz/spark-kit
©2016 IBM Corporation
Ben and Natasha start brainstorming
• I’ll work on data acquisition from
Twitter and enrichment with
sentiment analysis scores using Spark
Streaming
• I know Java very well, but I don’t have
time to learn Python.
• However, I am willing to learn Scala if
that helps improve my productivity
• I’ll perform the data exploration and
analysis
• I know Python and R, but am not familiar
enough with Java or Scala
• I like pandas and numpy. I’m ok to learn
Spark but expect the same level of apis
• I need to work iteratively with the data
©2016 IBM Corporation
More learning
What exactly is a
Notebook?
©2016 IBM Corporation
https://www.pinterest.com/pin/309763280589997658/&psig=AFQjCNGrBtz45g-g-yU_j9Wj3eOuqhIz8w&ust=1472442566813212
©2016 IBM Corporation
http://ia.media-imdb.com/images/M/MV5BMTk3OTM5Njg5M15BMl5BanBnXkFtZTYwMzA0ODI3._V1_SX640_SY720_.jpg
©2016 IBM Corporation
What is a Notebook
Text, Annotations
Code, Data
Visualizations,
Widgets, Output
• Web based UI for running
apache spark console
commands
• Easy, no install spark
accelerator
• Best way to start working
with spark
©2016 IBM Corporation
What is Jupyter?
with a “y”, clever ah?
Browser
Kernel
Code Output
https://www.bluetrack.com/uploads/items_images/kernel-of-corn-stress-balls1_thumb.jpg?r=1
Look! Apache Toree for Spark
"Open source, interactive data science and
scientific computing"
– Formerly IPython
– Large, open, growing community and ecosystem
Very popular
– “~2 million users for IPython” [1]
– $6m in funding in 2015 [3]
– 200 contributors to notebook subproject alone [4]
– 275,000 public notebooks on GitHub [2]
©2016 IBM Corporation
Language war?
Likes Scala
Likes Python
Easy to learn Scala has many powerful features that may be hard
to learn, but I only want to use the features that are
similar to Java
Fairly Easy once you get familiar with Python
Idiosyncrasies
Performance Scala is super fast, because it runs in the JVM
Yeah, Python is interpreted and slower, but
there are ways to optimize when needed
Ecosystem Scala ecosystem is small but can reuse Java
ecosystem as it is fully interoperable with Java
Statistics: Pandas, Numpy, matplotlib
Machine Learning: scikit-learn, PyML
Tooling IntelliJ, Scala IDE for Eclipse PyCharm, PyDev (Eclipse)
Interactive
Scala repl (Console),
Scala Notebook for Spark (Limited viz libraries)
Python interpreter, Python Notebook
(Jupyter, Zeppelin) provide great viz libs
©2016 IBM Corporation
Scala vs Java
• Less Verbose
public class Circle{
private double radius;
private double xCenter;
private double yCenter;
public Circle(double radius,double xCenter,
double yCenter){
this.radius = radius;
this.xCenter = xCenter;
this.yCenter = yCenter;
}
public void setRadius(double radius){
this.radius=radius;}
public double getRadius(){return radius;}
public void setXCenter(double xCenter){
this.xCenter=xCenter;}
public double getXCenter(){return xCenter;}
public void setYCenter(double yCenter){
this.yCenter=yCenter;}
public double getYCenter(){return yCenter;}
}
Java
class Circle(
var radius:Double,
var xCenter:Double,
var yCenter:Double
)
Scala
Define a Class with a few fields
©2016 IBM Corporation
Scala vs Java
• Less Verbose
• Mixes OOP and FP
Sort a list of circles by radius size and by center abscissa
Collections.sort(circles, new Comparator<Circle>() {
public int compare(Circle a, Circle b) {
return a.getRadius()-b.getRadius()
}
});
Collections.sort(circles, new Comparator<Circle>() {
public int compare(Circle a, Circle b) {
return a.getXCenter()-b.getXCenter();
}
});
circles.sortBy(
c=>(c.radius,c.xCenter)
)
Scala
Java 7 and below
circles
.sort(Comparator.comparing(Circle::getRadius())
.thenComparing(Circle::getXCenter())
Java 8
©2016 IBM Corporation
Scala vs Java
• Less Verbose
• Mixes OOP and FP
• Concurrency: Actors and Futures
class JavaOneActor extends Actor {
def receive = {
case ”version" => println(”JavaOne 2016")
case “location” => println(“San Francisco”)
case _ => println(”Sorry, I didn’t get that")
}
}
object JavaOneActorMain extends App {
val system = ActorSystem(”JavaOne”)
val actor= system.actorOf(Props[JavaOneActor],name=”JavaOne")
actor ! ”version"
actor ! ”location”
actor ! “Is it raining?”
}
©2016 IBM Corporation
Scala vs Java
• Less Verbose
• Mixes OOP and FP
• Actors
• Better type system
• Parameterized Types
• Abstract Types
• Compound Types
• Singleton Types
• Tuples Types
• Function Types
• Lambdas Types
• Self-Recursive Types
• Functors
• Monads
©2016 IBM Corporation
Scala vs Java
• Less Verbose
• Mixes OOP and FP
• Actors
• Better type system
• Runs as fast
• Scala compiles into Java Bytecode
• Full interoperability with Java
• Runs in the JVM
• Of course, algorithm and optimizations matters
©2016 IBM Corporation
Scala vs Java
• Less Verbose
• Mixes OOP and FP
• Actors
• Better type system
• Runs as fast
• Some features can be confusing
• SBT looks like black magic
• Everything is a function: easy to write cryptic code
• Implicits
• Currying
• Partial functions
• Partially applied functions
courtesy of http://www.codeodor.com/
©2016 IBM Corporation
Let’s look at an example Spark Application
The world famous Word Count!
©2016 IBM Corporation
Java
JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("hdfs://...");
Create RDD from a file on the hadoop cluster
New RDD made of all words in the files
For each word create a tuple (word,1)
Aggregate all the results, using the
word as the key
Save the results back to the hadoop cluster
©2016 IBM Corporation
Scala
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Create RDD from a file on the hadoop cluster
New RDD made of all words in the files
For each word create a tuple (word,1)
Aggregate all the results, using the
word as the key
Save the results back to the hadoop cluster
©2016 IBM Corporation
Python
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
Create RDD from a file on the hadoop cluster
New RDD made of all words in the files
For each word create a tuple (word,1)
Aggregate all the results, using the
word as the key
Save the results back to the hadoop cluster
©2016 IBM Corporation
OK, let’s agree on the architecture
Watson Tone
Analyzer
Input Stream
Enrich data with
Emotion Tone Scores
Processed data
Notebook
Agree to disagree on the Language
©2016 IBM Corporation
Dividing the tasks
• Implement a Spark Streaming
connector to Twitter
• Call Watson Tone Analyzer for each
tweets
• Return a Spark DataFrame with the
tweets enriched with Tone scores
• Code written in Scala, delivered as a Jar
• Will test in Scala Notebook
• Works in a Python Notebook
• Load the twitter data with Tone score from
a persisted store
• Perform the data exploration and analysis:
trending hashtags and sentiments
• Produce visualizations to LOB Users
©2016 IBM Corporation
A word on Watson Tone Analyzer
• Uses linguistic analysis to detect 3 types of tones: Emotion, Social Tendencies and Language styles
• Available as a cloud service on IBM Bluemix
Input
http://www.ibm.com/watson/developercloud/tone-analyzer.html
Results
©2016 IBM Corporation
How is Ben doing?
Spark Streaming
to Twitter
Get Watson Tone
sentiment scores
Enrich Tweets
with new Scores
Make spark Dataframe
available
ssc = new StreamingContext( sc, Seconds(5) )
ssc.addStreamingListener( new StreamingListener )
val keys = config.getConfig("tweets.key").split(",");
val stream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None );
val tweets = stream.filter { status =>
Option(status.getUser).flatMap[String] {
u => Option(u.getLang)
}.getOrElse("").startsWith("en") &&
CharMatcher.ASCII.matchesAllOf(status.getText) &&
( keys.isEmpty || keys.exists{status.getText.contains(_)
})
...
Create a Spark StremingContext
with a 5 sec batch window
Create a Twitter Stream
Filter any tweet that are not in English
or that do not match the word filters
©2016 IBM Corporation
How is Ben doing?
Spark Streaming
to Twitter
Get Watson Tone
sentiment scores
Enrich Tweets
with new Scores
Make spark Dataframe
available
case class DocumentTone( document_tone: Sentiment )
case class Sentiment(tone_categories: Seq[ToneCategory]);
…
val sentimentResults: String =
EntityEncoder[String].toEntity("{"text": " + JSONObject.quote( status.text ) + "}" ).flatMap {
entity =>
val s = broadcastVar.value.get("watson.tone.url").get + "/v3/tone?version=" + broadcastVar.value.get("watson.api.version").get
val toneuri: Uri = Uri.fromString( s ).getOrElse( null )
client( Request( method = Method.POST, uri = toneuri, headers = …, body = entity.body))
.flatMap { response =>
if (response.status.code == 200 ) {
response.as[String]
} else {
println("Error received from Watson Tone Analyzer.”)
null
}
}
}.run
//Return Sentiment object pickled from response
upickle.read[DocumentTone](sentimentResults).document_tone
Call the Watson Tone Analyzer
Service using a POST Request
Pickle the JSON Response and
return a DocumentTone Object
Define the Sentiment Data Model
©2016 IBM Corporation
How is Ben doing?
Spark Streaming
to Twitter
Get Watson Tone
sentiment scores
Enrich Tweets
with new Scores
Make spark Dataframe
available
lazy val client = PooledHttp1Client()
val rowTweets = tweets.map(status=> {
val sentiment = ToneAnalyzer.computeSentiment( client, status, broadcastVar )
var colValues = Array[Any](
status.getUser.getName, //author
status.getCreatedAt.toString, //date
status.getUser.getLang, //Lang
status.getText, //text
Option(status.getGeoLocation).map{ _.getLatitude}.getOrElse(0.0), //lat
Option(status.getGeoLocation).map{_.getLongitude}.getOrElse(0.0) //long
)
var scoreMap = getScoreMap(sentiment)
colValues = colValues ++ ToneAnalyzer.sentimentFactors.map { f => round(f) * 100.0}
//Return [Row, (sentiment, status)]
(Row(colValues.toArray:_*),(sentiment, status))
})
Enrich the Tweets with Sentiment
Scores, returns a Row Object
©2016 IBM Corporation
How is Ben doing?
Spark Streaming
to Twitter
Get Watson Tone
sentiment scores
Enrich Tweets
with new Scores
Make spark Dataframe
available
def createTwitterDataFrames(sc: SparkContext) : (SQLContext, DataFrame) = {
if ( workingRDD.count <= 0 ){
println("No data receive. Please start the Twitter stream again to collect data")
return null
}
try{
val df = sqlContext.createDataFrame( workingRDD, schemaTweets )
df.registerTempTable("tweets")
println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the
SQLContext variable")
println("Here's the schema for tweets")
df.printSchema()
(sqlContext, df)
}catch{
case e: Exception => {logError(e.getMessage, e ); return null}
}
}
Transform the RDD of Enriched
Tweets to a DataFrame
Display the DataFrame Schema
©2016 IBM Corporation
How’s Natasha doing?
Load the Tweets
from cluster
Compute the distribution
of Sentiments
Compute top 5
hashtags
Compute aggregate Tone
scores for each 5 top hashtags
Load the Tweets data from a
parquet file in Object Storage
Register a Temp SQL Table so it
can be queried later on
©2016 IBM Corporation
How’s Natasha doing?
Load the Tweets
from cluster
Compute the distribution
of Sentiments
Compute top 5
hashtags
Compute aggregate Tone
scores for each 5 top hashtags
Build an array that contains the number of
tweets where score is greater than 60%
©2016 IBM Corporation
How’s Natasha doing?
Load the Tweets
from cluster
Compute the distribution
of Sentiments
Compute top 5
hashtags
Compute aggregate Tone
scores for each 5 top hashtags
Creates a flat Map of all the
words in the tweets and filter
to keep only the hashtags
Map then Reduce to count the
occurrences of each hashtag
©2016 IBM Corporation
How’s Natasha doing?
Load the Tweets
from cluster
Compute the distribution
of Sentiments
Compute top 5
hashtags
Compute aggregate Tone
scores for each 5 top hashtags
Create RDD from tweets dataframe
Keep only the entries in the top 10
Index by Tag-Tone
©2016 IBM Corporation
How’s Natasha doing?
Load the Tweets
from cluster
Compute the distribution
of Sentiments
Compute top 5
hashtags
Compute aggregate Tone
scores for each 5 top hashtags
Count the occurrences for each score
Count the tone average for each tag and count
Final Reduce to get the data the way
we want it for display
©2016 IBM Corporation
How’s Natasha doing?
Load the Tweets
from cluster
Compute the distribution
of Sentiments
Compute top 5
hashtags
Compute aggregate Tone
scores for each 5 top hashtags
©2016 IBM Corporation
How good is Python at handling data?
Given the following quarterly sales series
Year Quarter Revenue
Return a list that contains the revenue for a specific quarter, 0 if not defined. e.g.: 1st Quarter: [8977551.03
4th Quarter: [9179464.4, 6717172.01, 2694937.3, 0]
Let’s look at a simple task
©2016 IBM Corporation
Let’s take a look at the application
©2016 IBM Corporation
Meeting back with the VP
This is great, but why do I need to run 2
notebooks – one in Scala and one in
Python? Please Fix it!
©2016 IBM Corporation
Can’t we both get along?
• It would be great if we could run Scala code from a Python Notebook
• And be able to transfer variables between the 2 languages
Open Source Pixiedust Python library let’s us do just that
https://github.com/ibm-cds-labs/pixiedust
©2016 IBM Corporation
New Version with Scala mix-in
©2016 IBM Corporation
What about the LOB User?
courtesy: http://www.flickr.com
C-Suite executive need to be able to run
the application from Notebook, select
filters and see real-time charts without
writing code!
©2016 IBM Corporation
Embedded application with Pixiedust plugin
©2016 IBM Corporation
Embedded application with Pixiedust plugin
©2016 IBM Corporation
Conclusion
• Programming languages are just tools, you need to
choose the right one for you (the programmer) and the
task:
• Scala is better suited for engineering work that
involves large, reusable components
• Python is the language of choice for data scientists
• If you are starting with Spark and just want to play:
• Start local: no need for a big cluster yet, get installed
and started in minutes
• Use the language you are most familiar with, or is
easiest to learn
• Use Notebooks to learn the APIs
©2016 IBM Corporation
Resources
• http://programming-scala.org
• http://python.org
• http://spark.apache.org
• www.ibm.com/analytics/us/en/technology/cloud-data-services/spark-as-a-service
• http://datascience.ibm.com
• http://ibm.biz/spark-kit
• www.ibm.com/watson/developercloud/tone-analyzer.html
• developer.ibm.com/clouddataservices/2016/01/15/real-time-sentiment-analysis-of-
twitter-hashtags-with-spark/
• developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/
• www.ibm.com/analytics/us/en/technology/spark/
• github.com/ibm-cds-labs/spark.samples
• github.com/ibm-cds-labs/pixiedust
©2016 IBM Corporation
Questions!
David Taieb
david_taieb@us.ibm.com
@DTAIEB55
courtesy: http://omogemura.com/

Contenu connexe

Tendances

How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceLuciano Resende
 
DbyDx Software Corporate Presentation
DbyDx Software Corporate PresentationDbyDx Software Corporate Presentation
DbyDx Software Corporate PresentationDbyDx Software
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...
NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...
NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...Paul Calvano
 
Whizlabs webinar - Deploying Portfolio Site with AWS Serverless
Whizlabs webinar - Deploying Portfolio Site with AWS ServerlessWhizlabs webinar - Deploying Portfolio Site with AWS Serverless
Whizlabs webinar - Deploying Portfolio Site with AWS ServerlessDhaval Nagar
 
E fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcs
E fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcsE fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcs
E fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcsMohamedcpcbma
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
 
Building your own calendly using amazon app sync
Building your own calendly using amazon app syncBuilding your own calendly using amazon app sync
Building your own calendly using amazon app syncDhaval Nagar
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Alexander Dean
 
SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...
SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...
SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...Patrick Guimonet
 
An introduction to Serverless
An introduction to ServerlessAn introduction to Serverless
An introduction to ServerlessAdrien Blind
 
Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019Suhail Jamaldeen
 
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...European Collaboration Summit
 
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Flink Forward
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
Box to Google Drive Business Migration Guide for IT Admins
Box to Google Drive Business Migration Guide for IT AdminsBox to Google Drive Business Migration Guide for IT Admins
Box to Google Drive Business Migration Guide for IT AdminsVazhayil Aiswarya
 
Low Cost AWS Services For Application Development in the Cloud
Low Cost AWS Services For Application Development in the CloudLow Cost AWS Services For Application Development in the Cloud
Low Cost AWS Services For Application Development in the CloudDhaval Nagar
 

Tendances (19)

How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
 
DbyDx Software Corporate Presentation
DbyDx Software Corporate PresentationDbyDx Software Corporate Presentation
DbyDx Software Corporate Presentation
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...
NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...
NYC WebPerf Meetup Feb 2020 - Measuring the Adoption of Web Performance Techn...
 
Whizlabs webinar - Deploying Portfolio Site with AWS Serverless
Whizlabs webinar - Deploying Portfolio Site with AWS ServerlessWhizlabs webinar - Deploying Portfolio Site with AWS Serverless
Whizlabs webinar - Deploying Portfolio Site with AWS Serverless
 
E fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcs
E fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcsE fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcs
E fw b4rbr62uiizvvipyb_cannell_lowcodelowdown_apex_vbcs
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP Archive
 
Building your own calendly using amazon app sync
Building your own calendly using amazon app syncBuilding your own calendly using amazon app sync
Building your own calendly using amazon app sync
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
 
SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...
SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...
SharePoint Saturday Netherlands 2016 - SharePoint and Office 365 performances...
 
An introduction to Serverless
An introduction to ServerlessAn introduction to Serverless
An introduction to Serverless
 
Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019
 
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
 
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Box to Google Drive Business Migration Guide for IT Admins
Box to Google Drive Business Migration Guide for IT AdminsBox to Google Drive Business Migration Guide for IT Admins
Box to Google Drive Business Migration Guide for IT Admins
 
Low Cost AWS Services For Application Development in the Cloud
Low Cost AWS Services For Application Development in the CloudLow Cost AWS Services For Application Development in the Cloud
Low Cost AWS Services For Application Development in the Cloud
 

Similaire à JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or All of Them?

Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architectureAdrian Tanase
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu Behera
 

Similaire à JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or All of Them? (20)

Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architecture
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
 

Dernier

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 

Dernier (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or All of Them?

  • 1. David Taieb STSM - IBM Cloud Data Services Developer advocate david_taieb@us.ibm.com My boss wants me to work on Apache Spark: should I use Scala, Java or Python or a combination of the three? Java One 2016, San Francisco
  • 2. ©2016 IBM Corporation Objectives By the end of this session, you should be able to: – Have Basic knowledge of Apache Spark (at least enough to get you started) – Understand key differences between Java, Scala and Python – Better informed about deciding which language to use – Understand the role of the Notebook in quickly building data solution.
  • 3. ©2016 IBM Corporation Let’s start with a story… Disclaimer All characters and events depicted in this Story are entirely fictitious. Any similarity to actual use cases, events or persons is actually intentional
  • 4. ©2016 IBM Corporation Meet Ben “the developer” • Hold a master degree in computer science • 10 year experience, 6 years with the company • Full stack Web developer • Languages of choice: Java, Node.js, HTML5/CSS3 • Data: No SQL (Cloudant, Mongo), relational • Protocols: REST, JSON, MQTT • No major experience with Big data Favorite Quote: “The Best Line of Code is the One I Didn't Have to Write!”
  • 5. ©2016 IBM Corporation Meet Natasha “the data scientist” • Hold a PHD in data science • 5 year experience, 2 years with the company • Experienced in Python and R • Expert in Machine Learning and Data visualization • Software engineering is not her thing Favorite Quote: “In God we trust. All others bring data” W. Edwards Deming
  • 6. ©2016 IBM Corporation Surprise meeting with VP of development We have an urgent need for our marketing department to build an application that can provide real-time sentiment analysis on Twitter data courtesy of http://linkq.com.vn/
  • 7. ©2016 IBM Corporation Key Constraints • You only have 6 weeks to build the application • Target consumer is the LOB user: must be easy to use even for non technical people • Web interface: should be accessible from a standard browser (desktop or mobile) • It must scale out of the box: I want you to use Apache Spark
  • 8. ©2016 IBM Corporation Some learning to do What exactly is Apache Spark?
  • 9. ©2016 IBM Corporation What is Apache spark ? Spark is an open source in-memory computing framework for distributed data processing and iterative analysis on massive data volumes
  • 10. ©2016 IBM Corporation Spark Core Libraries general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework Spark Streaming Mllib Machine Learning GraphX Graph Spark Core
  • 11. ©2016 IBM Corporation Spark is evolved quickly and has strong traction Spark is one of the most active open source projects Interest over time (Google Trends) Source: https://www.google.com/trends/explore#q=apache%20spark&cmpt=q&tz=http://www.indeed.com/jobanalytics/jobtrends?q=apache+spark&l= Job Trends (Indeed.com)
  • 12. ©2016 IBM Corporation Resilient Distributed Datasets (RDDs) • A collection of elements that Spark works on in parallel • May be kept in memory or on disk • Applications can also explicitly tell Spark to cache an RDD, which is great for iterative algorithms • An RDD contains the “raw data”, plus the function to compute it • Fault-tolerance: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it RDD built from a Java collection RDD built from an external dataset (local FS, HDFS, Hbase,…)
  • 13. ©2016 IBM Corporation RDD’s fault tolerance scenario RD D1 Map Transformation Data Node A File Input Split 1 File Input Split 2 File Input Split 3 RD D1 RD D1 RD D2 RD D2 Reduce Transformation RD D3 RD D3 filter Transformation Data Node B
  • 14. ©2016 IBM Corporation Spark SQL and Data Frames • Spark’s interface for working with structured and semi-structured data • Ability to load from a verity of data sources (parquet, JSON, Hive) • Supports SQL syntax for both JDBC/ODBC connectors • Special RDD tooling (Join between RDD and an SQL table, Expose UDF’s) • DataFrames • A Data Frame is a distributed collection of data organized into named columns. • It is conceptually equivalent to a table in a relational database • DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs • Provide better performance than RDD thanks to Spark SQL’s catalyst query optimizer
  • 15. ©2016 IBM Corporation Consuming Spark Spark Application (driver) Master (cluster Manager) Worker Node Worker Node Worker Node Worker Node … Spark Cluster Kernel Master (cluster Manager) Worker Node Worker Node … Spark Cluster Notebook Server Browser Http/WebSockets Kernel Protocol Batch Job (Spark-Submit) Interactive Notebook • RDD Partitioning • Task packaging and dispatching • Worker node scheduling
  • 16. ©2016 IBM Corporation IBM’s commitment to Spark • Contribute to the Core • Spark Technology Cluster (STC) • Open source SystemML • Foster Community • Educate 1M+ data scientists and engineers • Sponsor AMPLab IBM Analytics for Apache Spark • IBM Analytics for Apache Spark Managed Service • www.ibm.com/analytics/us/en/technology/cloud-data-services/spark-as-a-service IBM Data Science Experience • DataScience Experience: • datascience.ibm.com IBM Packages for Apache Spark •IBM Packages for Apache Spark: • ibm.biz/spark-kit
  • 17. ©2016 IBM Corporation Ben and Natasha start brainstorming • I’ll work on data acquisition from Twitter and enrichment with sentiment analysis scores using Spark Streaming • I know Java very well, but I don’t have time to learn Python. • However, I am willing to learn Scala if that helps improve my productivity • I’ll perform the data exploration and analysis • I know Python and R, but am not familiar enough with Java or Scala • I like pandas and numpy. I’m ok to learn Spark but expect the same level of apis • I need to work iteratively with the data
  • 18. ©2016 IBM Corporation More learning What exactly is a Notebook?
  • 21. ©2016 IBM Corporation What is a Notebook Text, Annotations Code, Data Visualizations, Widgets, Output • Web based UI for running apache spark console commands • Easy, no install spark accelerator • Best way to start working with spark
  • 22. ©2016 IBM Corporation What is Jupyter? with a “y”, clever ah? Browser Kernel Code Output https://www.bluetrack.com/uploads/items_images/kernel-of-corn-stress-balls1_thumb.jpg?r=1 Look! Apache Toree for Spark "Open source, interactive data science and scientific computing" – Formerly IPython – Large, open, growing community and ecosystem Very popular – “~2 million users for IPython” [1] – $6m in funding in 2015 [3] – 200 contributors to notebook subproject alone [4] – 275,000 public notebooks on GitHub [2]
  • 23. ©2016 IBM Corporation Language war? Likes Scala Likes Python Easy to learn Scala has many powerful features that may be hard to learn, but I only want to use the features that are similar to Java Fairly Easy once you get familiar with Python Idiosyncrasies Performance Scala is super fast, because it runs in the JVM Yeah, Python is interpreted and slower, but there are ways to optimize when needed Ecosystem Scala ecosystem is small but can reuse Java ecosystem as it is fully interoperable with Java Statistics: Pandas, Numpy, matplotlib Machine Learning: scikit-learn, PyML Tooling IntelliJ, Scala IDE for Eclipse PyCharm, PyDev (Eclipse) Interactive Scala repl (Console), Scala Notebook for Spark (Limited viz libraries) Python interpreter, Python Notebook (Jupyter, Zeppelin) provide great viz libs
  • 24. ©2016 IBM Corporation Scala vs Java • Less Verbose public class Circle{ private double radius; private double xCenter; private double yCenter; public Circle(double radius,double xCenter, double yCenter){ this.radius = radius; this.xCenter = xCenter; this.yCenter = yCenter; } public void setRadius(double radius){ this.radius=radius;} public double getRadius(){return radius;} public void setXCenter(double xCenter){ this.xCenter=xCenter;} public double getXCenter(){return xCenter;} public void setYCenter(double yCenter){ this.yCenter=yCenter;} public double getYCenter(){return yCenter;} } Java class Circle( var radius:Double, var xCenter:Double, var yCenter:Double ) Scala Define a Class with a few fields
  • 25. ©2016 IBM Corporation Scala vs Java • Less Verbose • Mixes OOP and FP Sort a list of circles by radius size and by center abscissa Collections.sort(circles, new Comparator<Circle>() { public int compare(Circle a, Circle b) { return a.getRadius()-b.getRadius() } }); Collections.sort(circles, new Comparator<Circle>() { public int compare(Circle a, Circle b) { return a.getXCenter()-b.getXCenter(); } }); circles.sortBy( c=>(c.radius,c.xCenter) ) Scala Java 7 and below circles .sort(Comparator.comparing(Circle::getRadius()) .thenComparing(Circle::getXCenter()) Java 8
  • 26. ©2016 IBM Corporation Scala vs Java • Less Verbose • Mixes OOP and FP • Concurrency: Actors and Futures class JavaOneActor extends Actor { def receive = { case ”version" => println(”JavaOne 2016") case “location” => println(“San Francisco”) case _ => println(”Sorry, I didn’t get that") } } object JavaOneActorMain extends App { val system = ActorSystem(”JavaOne”) val actor= system.actorOf(Props[JavaOneActor],name=”JavaOne") actor ! ”version" actor ! ”location” actor ! “Is it raining?” }
  • 27. ©2016 IBM Corporation Scala vs Java • Less Verbose • Mixes OOP and FP • Actors • Better type system • Parameterized Types • Abstract Types • Compound Types • Singleton Types • Tuples Types • Function Types • Lambdas Types • Self-Recursive Types • Functors • Monads
  • 28. ©2016 IBM Corporation Scala vs Java • Less Verbose • Mixes OOP and FP • Actors • Better type system • Runs as fast • Scala compiles into Java Bytecode • Full interoperability with Java • Runs in the JVM • Of course, algorithm and optimizations matters
  • 29. ©2016 IBM Corporation Scala vs Java • Less Verbose • Mixes OOP and FP • Actors • Better type system • Runs as fast • Some features can be confusing • SBT looks like black magic • Everything is a function: easy to write cryptic code • Implicits • Currying • Partial functions • Partially applied functions courtesy of http://www.codeodor.com/
  • 30. ©2016 IBM Corporation Let’s look at an example Spark Application The world famous Word Count!
  • 31. ©2016 IBM Corporation Java JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://..."); Create RDD from a file on the hadoop cluster New RDD made of all words in the files For each word create a tuple (word,1) Aggregate all the results, using the word as the key Save the results back to the hadoop cluster
  • 32. ©2016 IBM Corporation Scala val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Create RDD from a file on the hadoop cluster New RDD made of all words in the files For each word create a tuple (word,1) Aggregate all the results, using the word as the key Save the results back to the hadoop cluster
  • 33. ©2016 IBM Corporation Python text_file = sc.textFile("hdfs://...") counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...") Create RDD from a file on the hadoop cluster New RDD made of all words in the files For each word create a tuple (word,1) Aggregate all the results, using the word as the key Save the results back to the hadoop cluster
  • 34. ©2016 IBM Corporation OK, let’s agree on the architecture Watson Tone Analyzer Input Stream Enrich data with Emotion Tone Scores Processed data Notebook Agree to disagree on the Language
  • 35. ©2016 IBM Corporation Dividing the tasks • Implement a Spark Streaming connector to Twitter • Call Watson Tone Analyzer for each tweets • Return a Spark DataFrame with the tweets enriched with Tone scores • Code written in Scala, delivered as a Jar • Will test in Scala Notebook • Works in a Python Notebook • Load the twitter data with Tone score from a persisted store • Perform the data exploration and analysis: trending hashtags and sentiments • Produce visualizations to LOB Users
  • 36. ©2016 IBM Corporation A word on Watson Tone Analyzer • Uses linguistic analysis to detect 3 types of tones: Emotion, Social Tendencies and Language styles • Available as a cloud service on IBM Bluemix Input http://www.ibm.com/watson/developercloud/tone-analyzer.html Results
  • 37. ©2016 IBM Corporation How is Ben doing? Spark Streaming to Twitter Get Watson Tone sentiment scores Enrich Tweets with new Scores Make spark Dataframe available ssc = new StreamingContext( sc, Seconds(5) ) ssc.addStreamingListener( new StreamingListener ) val keys = config.getConfig("tweets.key").split(","); val stream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None ); val tweets = stream.filter { status => Option(status.getUser).flatMap[String] { u => Option(u.getLang) }.getOrElse("").startsWith("en") && CharMatcher.ASCII.matchesAllOf(status.getText) && ( keys.isEmpty || keys.exists{status.getText.contains(_) }) ... Create a Spark StremingContext with a 5 sec batch window Create a Twitter Stream Filter any tweet that are not in English or that do not match the word filters
  • 38. ©2016 IBM Corporation How is Ben doing? Spark Streaming to Twitter Get Watson Tone sentiment scores Enrich Tweets with new Scores Make spark Dataframe available case class DocumentTone( document_tone: Sentiment ) case class Sentiment(tone_categories: Seq[ToneCategory]); … val sentimentResults: String = EntityEncoder[String].toEntity("{"text": " + JSONObject.quote( status.text ) + "}" ).flatMap { entity => val s = broadcastVar.value.get("watson.tone.url").get + "/v3/tone?version=" + broadcastVar.value.get("watson.api.version").get val toneuri: Uri = Uri.fromString( s ).getOrElse( null ) client( Request( method = Method.POST, uri = toneuri, headers = …, body = entity.body)) .flatMap { response => if (response.status.code == 200 ) { response.as[String] } else { println("Error received from Watson Tone Analyzer.”) null } } }.run //Return Sentiment object pickled from response upickle.read[DocumentTone](sentimentResults).document_tone Call the Watson Tone Analyzer Service using a POST Request Pickle the JSON Response and return a DocumentTone Object Define the Sentiment Data Model
  • 39. ©2016 IBM Corporation How is Ben doing? Spark Streaming to Twitter Get Watson Tone sentiment scores Enrich Tweets with new Scores Make spark Dataframe available lazy val client = PooledHttp1Client() val rowTweets = tweets.map(status=> { val sentiment = ToneAnalyzer.computeSentiment( client, status, broadcastVar ) var colValues = Array[Any]( status.getUser.getName, //author status.getCreatedAt.toString, //date status.getUser.getLang, //Lang status.getText, //text Option(status.getGeoLocation).map{ _.getLatitude}.getOrElse(0.0), //lat Option(status.getGeoLocation).map{_.getLongitude}.getOrElse(0.0) //long ) var scoreMap = getScoreMap(sentiment) colValues = colValues ++ ToneAnalyzer.sentimentFactors.map { f => round(f) * 100.0} //Return [Row, (sentiment, status)] (Row(colValues.toArray:_*),(sentiment, status)) }) Enrich the Tweets with Sentiment Scores, returns a Row Object
  • 40. ©2016 IBM Corporation How is Ben doing? Spark Streaming to Twitter Get Watson Tone sentiment scores Enrich Tweets with new Scores Make spark Dataframe available def createTwitterDataFrames(sc: SparkContext) : (SQLContext, DataFrame) = { if ( workingRDD.count <= 0 ){ println("No data receive. Please start the Twitter stream again to collect data") return null } try{ val df = sqlContext.createDataFrame( workingRDD, schemaTweets ) df.registerTempTable("tweets") println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContext variable") println("Here's the schema for tweets") df.printSchema() (sqlContext, df) }catch{ case e: Exception => {logError(e.getMessage, e ); return null} } } Transform the RDD of Enriched Tweets to a DataFrame Display the DataFrame Schema
  • 41. ©2016 IBM Corporation How’s Natasha doing? Load the Tweets from cluster Compute the distribution of Sentiments Compute top 5 hashtags Compute aggregate Tone scores for each 5 top hashtags Load the Tweets data from a parquet file in Object Storage Register a Temp SQL Table so it can be queried later on
  • 42. ©2016 IBM Corporation How’s Natasha doing? Load the Tweets from cluster Compute the distribution of Sentiments Compute top 5 hashtags Compute aggregate Tone scores for each 5 top hashtags Build an array that contains the number of tweets where score is greater than 60%
  • 43. ©2016 IBM Corporation How’s Natasha doing? Load the Tweets from cluster Compute the distribution of Sentiments Compute top 5 hashtags Compute aggregate Tone scores for each 5 top hashtags Creates a flat Map of all the words in the tweets and filter to keep only the hashtags Map then Reduce to count the occurrences of each hashtag
  • 44. ©2016 IBM Corporation How’s Natasha doing? Load the Tweets from cluster Compute the distribution of Sentiments Compute top 5 hashtags Compute aggregate Tone scores for each 5 top hashtags Create RDD from tweets dataframe Keep only the entries in the top 10 Index by Tag-Tone
  • 45. ©2016 IBM Corporation How’s Natasha doing? Load the Tweets from cluster Compute the distribution of Sentiments Compute top 5 hashtags Compute aggregate Tone scores for each 5 top hashtags Count the occurrences for each score Count the tone average for each tag and count Final Reduce to get the data the way we want it for display
  • 46. ©2016 IBM Corporation How’s Natasha doing? Load the Tweets from cluster Compute the distribution of Sentiments Compute top 5 hashtags Compute aggregate Tone scores for each 5 top hashtags
  • 47. ©2016 IBM Corporation How good is Python at handling data? Given the following quarterly sales series Year Quarter Revenue Return a list that contains the revenue for a specific quarter, 0 if not defined. e.g.: 1st Quarter: [8977551.03 4th Quarter: [9179464.4, 6717172.01, 2694937.3, 0] Let’s look at a simple task
  • 48. ©2016 IBM Corporation Let’s take a look at the application
  • 49. ©2016 IBM Corporation Meeting back with the VP This is great, but why do I need to run 2 notebooks – one in Scala and one in Python? Please Fix it!
  • 50. ©2016 IBM Corporation Can’t we both get along? • It would be great if we could run Scala code from a Python Notebook • And be able to transfer variables between the 2 languages Open Source Pixiedust Python library let’s us do just that https://github.com/ibm-cds-labs/pixiedust
  • 51. ©2016 IBM Corporation New Version with Scala mix-in
  • 52. ©2016 IBM Corporation What about the LOB User? courtesy: http://www.flickr.com C-Suite executive need to be able to run the application from Notebook, select filters and see real-time charts without writing code!
  • 53. ©2016 IBM Corporation Embedded application with Pixiedust plugin
  • 54. ©2016 IBM Corporation Embedded application with Pixiedust plugin
  • 55. ©2016 IBM Corporation Conclusion • Programming languages are just tools, you need to choose the right one for you (the programmer) and the task: • Scala is better suited for engineering work that involves large, reusable components • Python is the language of choice for data scientists • If you are starting with Spark and just want to play: • Start local: no need for a big cluster yet, get installed and started in minutes • Use the language you are most familiar with, or is easiest to learn • Use Notebooks to learn the APIs
  • 56. ©2016 IBM Corporation Resources • http://programming-scala.org • http://python.org • http://spark.apache.org • www.ibm.com/analytics/us/en/technology/cloud-data-services/spark-as-a-service • http://datascience.ibm.com • http://ibm.biz/spark-kit • www.ibm.com/watson/developercloud/tone-analyzer.html • developer.ibm.com/clouddataservices/2016/01/15/real-time-sentiment-analysis-of- twitter-hashtags-with-spark/ • developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/ • www.ibm.com/analytics/us/en/technology/spark/ • github.com/ibm-cds-labs/spark.samples • github.com/ibm-cds-labs/pixiedust
  • 57. ©2016 IBM Corporation Questions! David Taieb david_taieb@us.ibm.com @DTAIEB55 courtesy: http://omogemura.com/

Notes de l'éditeur

  1. One thing to note here is how recent all of this attention is – keep in mind that Spark was only founded in 2009 and open-sourced in 2010
  2. RDD’s track lineage info to rebuild lost data through DAG’s (directed acyclic graph) If “Data node A” fails, the spark engine has all the information which contains what input splits or RDD’s were transformed to compute the RDD part of node A And if taking into account that catastrophic failures like that are a rare occasion , we can see why this approach can enhance performance The only shortcoming of this approach is that there will be a lot of re-computation done when such a failure eventually occurs