SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
APACHE SPARK & ELASTICSEARCH
Holden Karau
Reducing duplicated code and saving on network overhead
Who am I?
Holden Karau
● Software Engineer @ Databricks
● I’ve worked with Elasticsearch before
● I prefer she/her for pronouns
● Author of a book on Spark and co-writing another*
● github https://github.com/holdenk
● e-mail holden@databricks.com
● @holdenkarau
*Which is why I might be sleepy today.
What is Spark & Elasticsearch
Spark
● Apache Spark™ is a fast and general engine for large-
scale data processing.
● http://spark.apache.org/
Elasticsearch
● Elasticsearch is a real-time distributed search and
analytics engine.
● http://www.elasticsearch.org/
Talk overview
Goal: understand how to work with ES & Hadoop
● Spark & Spark streaming let us re-use indexing code
● Its a bit ugly right now….
● Demo* with tweets & top hash tags per region
● We can customize the ES connector to write to the shard based on
partition
● This is an early version of the talk (feedback welcome!)
Assumptions:
● Familiar(ish) with Elasticsearch or at least Solr
● Can read Scala
*Demo gods willing.
Why should you care?
Small differences between off-line and on-line
Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File:
Spot_the_difference.png
Leads to
fire works photo by Hailey Toft
Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Lets start with the on-line pipeline
val ssc = new StreamingContext(master, "IndexTweetsLive",
Seconds(1))
// Set up the system properties for twitter
System.setProperty("twitter4j.oauth.consumerKey", cK)
System.setProperty("twitter4j.oauth.consumerSecret", cS)
System.setProperty("twitter4j.oauth.accessToken", aT)
System.setProperty("twitter4j.oauth.accessTokenSecret",
ats)
val tweets = TwitterUtils.createStream(ssc, None)
Lets get ready to write the data into
Elasticsearch
Photo by Cloned Milkmen
Lets get ready to write the data into
Elasticsearch
def setupEsOnSparkContext(sc: SparkContext) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("mapred.output.format.class",
"org.elasticsearch.hadoop.mr.EsOutputFormat")
jobConf.setOutputCommitter(classOf[FileOutputCommitter])
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE,
“twitter/tweet”)
FileOutputFormat.setOutputPath(jobConf, new Path("-"))
jobconf
}
Add a schema
curl -XPUT 'http://localhost:
9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string"},
"hashTags" : {"type" : "string"},
"location" : {"type" : "geo_point"}
}
}
}
'
Lets format our tweets
def prepareTweets(tweet: twitter4j.Status) = {
val loc = tweet.getGeoLocation()
val lat = loc.getLatitude()
val lon = loc.getLongitude()
val hashTags = tweet.getHashtagEntities().map(_.getText())
HashMap(
"docid" -> tweet.getId().toString,
"message" -> tweet.getText(),
"hashTags" -> hashTags.mkString(" "),
"location" -> s"$lat,$lon"
)
}
}
// Convert to HadoopWritable types
mapToOutput(fields)
}
And save them...
tweets.foreachRDD{(tweetRDD, time) =>
val sc = tweetRDD.context
// The jobConf isn’t serilizable so we create it here
val jobConf = SharedESConfig.setupEsOnSparkContext(sc,
esResource, Some(esNodes))
// Convert our tweets to something that can be indexed
val tweetsAsMap = tweetRDD.map(
SharedIndex.prepareTweets)
tweetsAsMap.saveAsHadoopDataset(jobConf)
}
Now let’s query them!
{"filtered" : {
"query" : {
"match_all" : {}
}
,"filter" :
{"geo_distance" :
{
"distance" : "${dist}km",
"location" :
{
"lat" : "${lat}",
"lon" : "${lon}"
}}}}}}
Now let’s find the hash tags :)
jobConf.set("es.query", query)
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]],
classOf[Object], classOf[MapWritable])
val tweets = currentTweets.map{ case (key, value) =>
SharedIndex.mapWritableToInput(value) }
val hashTags = tweets.flatMap{t =>
t.getOrElse("hashTags", "").split(" ")
}
println(hashTags.countByValue())
oh wait :(
Sad panda by Jose Antonio Tovar
Now let’s find some common
words….
// Extract the top words
val words = tweets.flatMap{tweet =>
tweet.flatMap{elem =>
elem._2 match {
case null => Nil
case _ => elem._2.split(" ")
}}}
val wordCounts = words.countByValue()
println("------")
wordCounts.foreach{ case(key, value) => println(key +
":" + value) }
println("------")
object WordCountOrdering extends Ordering[(String, Int)]
{
def compare(a: (String, Int), b: (String, Int)) = {
b._2 compare a._2
}
}
val wc = words.map(x => (x, 1)).reduceByKey((x,y) =>
x+y)
val topWords = wc.takeOrdered(40)(WordCountOrdering)
ok, fine, the “top” words
NYC words
I,144
a,83
the,76
to,75
you,66
my,56
and,52
me,47
that,39
of,38
in,38
on,37
so,36
like,35
is,32
this,30
was,29
it,28
with,27
for,25
I'm,24
i,23
but,22
just,22
at,21
be,21
are,20
don't,19
have,18
lol,17
out,17
your,16
love,16
up,16
all,16
her,15
when,14
not,13
it's,13
SF words
I,18
the,16
to,15
you,9
a,9
my,8
me,8
for,8
at,7
of,7
want,6
Just,6
in,6
is,6
this,5
got,5
she,5
when,5
was,4
so,4
what,4
&,4
he,4
your,4
as,4
they,4
it,4
@,4
get,4
and,4
are,4
say,3
w/,3
do,3
dont,3
going,3
fuck,3
i,3
know,3
or,3
Indexing Part 2
(electric boogaloo)
Writing directly to a node with the correct shard saves us network overhead
Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
private int detectCurrentInstance(Configuration conf) {
if (sparkInstance != null) {
if (log.isDebugEnabled()) {
log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri));
}
return sparkInstance;
}
….
}
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored,
JobConf job, String name, Progressable progress) {
EsRecordWriter writer = new EsRecordWriter(job, progress);
// This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our
// jobconf asks for it we use this partition number as the shard number.
if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) {
writer.setSparkInstance(Integer.valueOf(name.substring(5)));
}
return writer;
}
So what does that give us?
Spark sets the file name to part-[partition number]
If we have same partitioner we write directly
Likely the best place to use this is in re-indexing data
Re-index all the things*
// Read in our data set
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]],
classOf[Object], classOf[MapWritable])
// Fetch them from twitter
val t4jt = tweets.flatMap{ tweet =>
val twitter = TwitterFactory.getSingleton()
val tweetID = tweet.getOrElse("docid", "")
Option(twitter.showStatus(tweetID.toLong))
}
t4jt.map(SharedIndex.prepareTweets)
.saveAsHadoopDataset(jobConf)
*Until you hit your twitter rate limit…. oops
Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61-
6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/
“Useful” links
● Feedback: holden@databricks.com
● Customized ES connector*: https://github.
com/holdenk/elasticsearch-hadoop
● Demo code: https://github.com/holdenk/elasticsearchspark
● Elasticsearch: http://www.elasticsearch.org/
● Spark: http://spark.apache.org/
● Spark streaming: http://spark.apache.org/streaming/

Contenu connexe

Tendances

Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureSpark Summit
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Karel Minarik
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in actionCodemotion
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchPatricia Gorla
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hoodSmartCat
 

Tendances (20)

Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in action
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hood
 

Similaire à 2014 spark with elastic search

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
A hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file formatA hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file formatrety61
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Jean-Georges Perrin
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Railsfreelancing_god
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchPedro Franceschi
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Forcelandia 2016 PK Chunking
Forcelandia 2016 PK ChunkingForcelandia 2016 PK Chunking
Forcelandia 2016 PK ChunkingDaniel Peter
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
Publishing a Perl6 Module
Publishing a Perl6 ModulePublishing a Perl6 Module
Publishing a Perl6 Moduleast_j
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQLSimon Elliston Ball
 
Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaAlexander Gyoshev
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Ramamohan Chokkam
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPMariano Iglesias
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 

Similaire à 2014 spark with elastic search (20)

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
A hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file formatA hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file format
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearch
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Forcelandia 2016 PK Chunking
Forcelandia 2016 PK ChunkingForcelandia 2016 PK Chunking
Forcelandia 2016 PK Chunking
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Publishing a Perl6 Module
Publishing a Perl6 ModulePublishing a Perl6 Module
Publishing a Perl6 Module
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and java
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHP
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 

Dernier

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

2014 spark with elastic search

  • 1. APACHE SPARK & ELASTICSEARCH Holden Karau Reducing duplicated code and saving on network overhead
  • 2. Who am I? Holden Karau ● Software Engineer @ Databricks ● I’ve worked with Elasticsearch before ● I prefer she/her for pronouns ● Author of a book on Spark and co-writing another* ● github https://github.com/holdenk ● e-mail holden@databricks.com ● @holdenkarau *Which is why I might be sleepy today.
  • 3. What is Spark & Elasticsearch Spark ● Apache Spark™ is a fast and general engine for large- scale data processing. ● http://spark.apache.org/ Elasticsearch ● Elasticsearch is a real-time distributed search and analytics engine. ● http://www.elasticsearch.org/
  • 4. Talk overview Goal: understand how to work with ES & Hadoop ● Spark & Spark streaming let us re-use indexing code ● Its a bit ugly right now…. ● Demo* with tweets & top hash tags per region ● We can customize the ES connector to write to the shard based on partition ● This is an early version of the talk (feedback welcome!) Assumptions: ● Familiar(ish) with Elasticsearch or at least Solr ● Can read Scala *Demo gods willing.
  • 5. Why should you care? Small differences between off-line and on-line Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: Spot_the_difference.png
  • 6. Leads to fire works photo by Hailey Toft
  • 7. Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 8. Lets start with the on-line pipeline val ssc = new StreamingContext(master, "IndexTweetsLive", Seconds(1)) // Set up the system properties for twitter System.setProperty("twitter4j.oauth.consumerKey", cK) System.setProperty("twitter4j.oauth.consumerSecret", cS) System.setProperty("twitter4j.oauth.accessToken", aT) System.setProperty("twitter4j.oauth.accessTokenSecret", ats) val tweets = TwitterUtils.createStream(ssc, None)
  • 9. Lets get ready to write the data into Elasticsearch Photo by Cloned Milkmen
  • 10. Lets get ready to write the data into Elasticsearch def setupEsOnSparkContext(sc: SparkContext) = { val jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat") jobConf.setOutputCommitter(classOf[FileOutputCommitter]) jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, “twitter/tweet”) FileOutputFormat.setOutputPath(jobConf, new Path("-")) jobconf }
  • 11. Add a schema curl -XPUT 'http://localhost: 9200/twitter/tweet/_mapping' -d ' { "tweet" : { "properties" : { "message" : {"type" : "string"}, "hashTags" : {"type" : "string"}, "location" : {"type" : "geo_point"} } } } '
  • 12. Lets format our tweets def prepareTweets(tweet: twitter4j.Status) = { val loc = tweet.getGeoLocation() val lat = loc.getLatitude() val lon = loc.getLongitude() val hashTags = tweet.getHashtagEntities().map(_.getText()) HashMap( "docid" -> tweet.getId().toString, "message" -> tweet.getText(), "hashTags" -> hashTags.mkString(" "), "location" -> s"$lat,$lon" ) } } // Convert to HadoopWritable types mapToOutput(fields) }
  • 13. And save them... tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val jobConf = SharedESConfig.setupEsOnSparkContext(sc, esResource, Some(esNodes)) // Convert our tweets to something that can be indexed val tweetsAsMap = tweetRDD.map( SharedIndex.prepareTweets) tweetsAsMap.saveAsHadoopDataset(jobConf) }
  • 14. Now let’s query them! {"filtered" : { "query" : { "match_all" : {} } ,"filter" : {"geo_distance" : { "distance" : "${dist}km", "location" : { "lat" : "${lat}", "lon" : "${lon}" }}}}}}
  • 15. Now let’s find the hash tags :) jobConf.set("es.query", query) val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) val tweets = currentTweets.map{ case (key, value) => SharedIndex.mapWritableToInput(value) } val hashTags = tweets.flatMap{t => t.getOrElse("hashTags", "").split(" ") } println(hashTags.countByValue())
  • 16. oh wait :( Sad panda by Jose Antonio Tovar
  • 17. Now let’s find some common words…. // Extract the top words val words = tweets.flatMap{tweet => tweet.flatMap{elem => elem._2 match { case null => Nil case _ => elem._2.split(" ") }}} val wordCounts = words.countByValue() println("------") wordCounts.foreach{ case(key, value) => println(key + ":" + value) } println("------")
  • 18. object WordCountOrdering extends Ordering[(String, Int)] { def compare(a: (String, Int), b: (String, Int)) = { b._2 compare a._2 } } val wc = words.map(x => (x, 1)).reduceByKey((x,y) => x+y) val topWords = wc.takeOrdered(40)(WordCountOrdering) ok, fine, the “top” words
  • 21. Indexing Part 2 (electric boogaloo) Writing directly to a node with the correct shard saves us network overhead Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
  • 22. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] private int detectCurrentInstance(Configuration conf) { if (sparkInstance != null) { if (log.isDebugEnabled()) { log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri)); } return sparkInstance; } …. }
  • 23. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) { EsRecordWriter writer = new EsRecordWriter(job, progress); // This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our // jobconf asks for it we use this partition number as the shard number. if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) { writer.setSparkInstance(Integer.valueOf(name.substring(5))); } return writer; }
  • 24. So what does that give us? Spark sets the file name to part-[partition number] If we have same partitioner we write directly Likely the best place to use this is in re-indexing data
  • 25. Re-index all the things* // Read in our data set val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) // Fetch them from twitter val t4jt = tweets.flatMap{ tweet => val twitter = TwitterFactory.getSingleton() val tweetID = tweet.getOrElse("docid", "") Option(twitter.showStatus(tweetID.toLong)) } t4jt.map(SharedIndex.prepareTweets) .saveAsHadoopDataset(jobConf) *Until you hit your twitter rate limit…. oops
  • 26. Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/
  • 27. “Useful” links ● Feedback: holden@databricks.com ● Customized ES connector*: https://github. com/holdenk/elasticsearch-hadoop ● Demo code: https://github.com/holdenk/elasticsearchspark ● Elasticsearch: http://www.elasticsearch.org/ ● Spark: http://spark.apache.org/ ● Spark streaming: http://spark.apache.org/streaming/