SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
APACHE SPARK & ELASTICSEARCH
Holden Karau
Reducing duplicated code and saving on network overhead
Who am I?
Holden Karau
● Software Engineer @ Databricks
● I’ve worked with Elasticsearch before
● I prefer she/her for pronouns
● Author of a book on Spark and co-writing another*
● github https://github.com/holdenk
● e-mail holden@databricks.com
● @holdenkarau
*Which is why I might be sleepy today.
What is Spark & Elasticsearch
Spark
● Apache Spark™ is a fast and general engine for large-
scale data processing.
● http://spark.apache.org/
Elasticsearch
● Elasticsearch is a real-time distributed search and
analytics engine.
● http://www.elasticsearch.org/
Talk overview
Goal: understand how to work with ES & Hadoop
● Spark & Spark streaming let us re-use indexing code
● Its a bit ugly right now….
● Demo* with tweets & top hash tags per region
● We can customize the ES connector to write to the shard based on
partition
● This is an early version of the talk (feedback welcome!)
Assumptions:
● Familiar(ish) with Elasticsearch or at least Solr
● Can read Scala
*Demo gods willing.
Why should you care?
Small differences between off-line and on-line
Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File:
Spot_the_difference.png
Leads to
fire works photo by Hailey Toft
Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Lets start with the on-line pipeline
val ssc = new StreamingContext(master, "IndexTweetsLive",
Seconds(1))
// Set up the system properties for twitter
System.setProperty("twitter4j.oauth.consumerKey", cK)
System.setProperty("twitter4j.oauth.consumerSecret", cS)
System.setProperty("twitter4j.oauth.accessToken", aT)
System.setProperty("twitter4j.oauth.accessTokenSecret",
ats)
val tweets = TwitterUtils.createStream(ssc, None)
Lets get ready to write the data into
Elasticsearch
Photo by Cloned Milkmen
Lets get ready to write the data into
Elasticsearch
def setupEsOnSparkContext(sc: SparkContext) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("mapred.output.format.class",
"org.elasticsearch.hadoop.mr.EsOutputFormat")
jobConf.setOutputCommitter(classOf[FileOutputCommitter])
jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE,
“twitter/tweet”)
FileOutputFormat.setOutputPath(jobConf, new Path("-"))
jobconf
}
Add a schema
curl -XPUT 'http://localhost:
9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string"},
"hashTags" : {"type" : "string"},
"location" : {"type" : "geo_point"}
}
}
}
'
Lets format our tweets
def prepareTweets(tweet: twitter4j.Status) = {
val loc = tweet.getGeoLocation()
val lat = loc.getLatitude()
val lon = loc.getLongitude()
val hashTags = tweet.getHashtagEntities().map(_.getText())
HashMap(
"docid" -> tweet.getId().toString,
"message" -> tweet.getText(),
"hashTags" -> hashTags.mkString(" "),
"location" -> s"$lat,$lon"
)
}
}
// Convert to HadoopWritable types
mapToOutput(fields)
}
And save them...
tweets.foreachRDD{(tweetRDD, time) =>
val sc = tweetRDD.context
// The jobConf isn’t serilizable so we create it here
val jobConf = SharedESConfig.setupEsOnSparkContext(sc,
esResource, Some(esNodes))
// Convert our tweets to something that can be indexed
val tweetsAsMap = tweetRDD.map(
SharedIndex.prepareTweets)
tweetsAsMap.saveAsHadoopDataset(jobConf)
}
Now let’s query them!
{"filtered" : {
"query" : {
"match_all" : {}
}
,"filter" :
{"geo_distance" :
{
"distance" : "${dist}km",
"location" :
{
"lat" : "${lat}",
"lon" : "${lon}"
}}}}}}
Now let’s find the hash tags :)
jobConf.set("es.query", query)
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]],
classOf[Object], classOf[MapWritable])
val tweets = currentTweets.map{ case (key, value) =>
SharedIndex.mapWritableToInput(value) }
val hashTags = tweets.flatMap{t =>
t.getOrElse("hashTags", "").split(" ")
}
println(hashTags.countByValue())
oh wait :(
Sad panda by Jose Antonio Tovar
Now let’s find some common
words….
// Extract the top words
val words = tweets.flatMap{tweet =>
tweet.flatMap{elem =>
elem._2 match {
case null => Nil
case _ => elem._2.split(" ")
}}}
val wordCounts = words.countByValue()
println("------")
wordCounts.foreach{ case(key, value) => println(key +
":" + value) }
println("------")
object WordCountOrdering extends Ordering[(String, Int)]
{
def compare(a: (String, Int), b: (String, Int)) = {
b._2 compare a._2
}
}
val wc = words.map(x => (x, 1)).reduceByKey((x,y) =>
x+y)
val topWords = wc.takeOrdered(40)(WordCountOrdering)
ok, fine, the “top” words
NYC words
I,144
a,83
the,76
to,75
you,66
my,56
and,52
me,47
that,39
of,38
in,38
on,37
so,36
like,35
is,32
this,30
was,29
it,28
with,27
for,25
I'm,24
i,23
but,22
just,22
at,21
be,21
are,20
don't,19
have,18
lol,17
out,17
your,16
love,16
up,16
all,16
her,15
when,14
not,13
it's,13
SF words
I,18
the,16
to,15
you,9
a,9
my,8
me,8
for,8
at,7
of,7
want,6
Just,6
in,6
is,6
this,5
got,5
she,5
when,5
was,4
so,4
what,4
&,4
he,4
your,4
as,4
they,4
it,4
@,4
get,4
and,4
are,4
say,3
w/,3
do,3
dont,3
going,3
fuck,3
i,3
know,3
or,3
Indexing Part 2
(electric boogaloo)
Writing directly to a node with the correct shard saves us network overhead
Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
private int detectCurrentInstance(Configuration conf) {
if (sparkInstance != null) {
if (log.isDebugEnabled()) {
log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri));
}
return sparkInstance;
}
….
}
Slight hack time
We clone the connector and do update EsOutputFormat.java [see https://github.
com/holdenk/elasticsearch-hadoop ]
public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored,
JobConf job, String name, Progressable progress) {
EsRecordWriter writer = new EsRecordWriter(job, progress);
// This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our
// jobconf asks for it we use this partition number as the shard number.
if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) {
writer.setSparkInstance(Integer.valueOf(name.substring(5)));
}
return writer;
}
So what does that give us?
Spark sets the file name to part-[partition number]
If we have same partitioner we write directly
Likely the best place to use this is in re-indexing data
Re-index all the things*
// Read in our data set
val currentTweets = sc.hadoopRDD(jobConf,
classOf[EsInputFormat[Object, MapWritable]],
classOf[Object], classOf[MapWritable])
// Fetch them from twitter
val t4jt = tweets.flatMap{ tweet =>
val twitter = TwitterFactory.getSingleton()
val tweetID = tweet.getOrElse("docid", "")
Option(twitter.showStatus(tweetID.toLong))
}
t4jt.map(SharedIndex.prepareTweets)
.saveAsHadoopDataset(jobConf)
*Until you hit your twitter rate limit…. oops
Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61-
6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/
“Useful” links
● Feedback: holden@databricks.com
● Customized ES connector*: https://github.
com/holdenk/elasticsearch-hadoop
● Demo code: https://github.com/holdenk/elasticsearchspark
● Elasticsearch: http://www.elasticsearch.org/
● Spark: http://spark.apache.org/
● Spark streaming: http://spark.apache.org/streaming/

Contenu connexe

Tendances

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 

Tendances (20)

Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in action
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hood
 

Similaire à 2014 spark with elastic search

Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
Ramamohan Chokkam
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 

Similaire à 2014 spark with elastic search (20)

Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
A hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file formatA hands-on introduction to the ELF Object file format
A hands-on introduction to the ELF Object file format
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
Fazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearchFazendo mágica com ElasticSearch
Fazendo mágica com ElasticSearch
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Forcelandia 2016 PK Chunking
Forcelandia 2016 PK ChunkingForcelandia 2016 PK Chunking
Forcelandia 2016 PK Chunking
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
Publishing a Perl6 Module
Publishing a Perl6 ModulePublishing a Perl6 Module
Publishing a Perl6 Module
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and java
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHP
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

2014 spark with elastic search

  • 1. APACHE SPARK & ELASTICSEARCH Holden Karau Reducing duplicated code and saving on network overhead
  • 2. Who am I? Holden Karau ● Software Engineer @ Databricks ● I’ve worked with Elasticsearch before ● I prefer she/her for pronouns ● Author of a book on Spark and co-writing another* ● github https://github.com/holdenk ● e-mail holden@databricks.com ● @holdenkarau *Which is why I might be sleepy today.
  • 3. What is Spark & Elasticsearch Spark ● Apache Spark™ is a fast and general engine for large- scale data processing. ● http://spark.apache.org/ Elasticsearch ● Elasticsearch is a real-time distributed search and analytics engine. ● http://www.elasticsearch.org/
  • 4. Talk overview Goal: understand how to work with ES & Hadoop ● Spark & Spark streaming let us re-use indexing code ● Its a bit ugly right now…. ● Demo* with tweets & top hash tags per region ● We can customize the ES connector to write to the shard based on partition ● This is an early version of the talk (feedback welcome!) Assumptions: ● Familiar(ish) with Elasticsearch or at least Solr ● Can read Scala *Demo gods willing.
  • 5. Why should you care? Small differences between off-line and on-line Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File: Spot_the_difference.png
  • 6. Leads to fire works photo by Hailey Toft
  • 7. Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 8. Lets start with the on-line pipeline val ssc = new StreamingContext(master, "IndexTweetsLive", Seconds(1)) // Set up the system properties for twitter System.setProperty("twitter4j.oauth.consumerKey", cK) System.setProperty("twitter4j.oauth.consumerSecret", cS) System.setProperty("twitter4j.oauth.accessToken", aT) System.setProperty("twitter4j.oauth.accessTokenSecret", ats) val tweets = TwitterUtils.createStream(ssc, None)
  • 9. Lets get ready to write the data into Elasticsearch Photo by Cloned Milkmen
  • 10. Lets get ready to write the data into Elasticsearch def setupEsOnSparkContext(sc: SparkContext) = { val jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat") jobConf.setOutputCommitter(classOf[FileOutputCommitter]) jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, “twitter/tweet”) FileOutputFormat.setOutputPath(jobConf, new Path("-")) jobconf }
  • 11. Add a schema curl -XPUT 'http://localhost: 9200/twitter/tweet/_mapping' -d ' { "tweet" : { "properties" : { "message" : {"type" : "string"}, "hashTags" : {"type" : "string"}, "location" : {"type" : "geo_point"} } } } '
  • 12. Lets format our tweets def prepareTweets(tweet: twitter4j.Status) = { val loc = tweet.getGeoLocation() val lat = loc.getLatitude() val lon = loc.getLongitude() val hashTags = tweet.getHashtagEntities().map(_.getText()) HashMap( "docid" -> tweet.getId().toString, "message" -> tweet.getText(), "hashTags" -> hashTags.mkString(" "), "location" -> s"$lat,$lon" ) } } // Convert to HadoopWritable types mapToOutput(fields) }
  • 13. And save them... tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val jobConf = SharedESConfig.setupEsOnSparkContext(sc, esResource, Some(esNodes)) // Convert our tweets to something that can be indexed val tweetsAsMap = tweetRDD.map( SharedIndex.prepareTweets) tweetsAsMap.saveAsHadoopDataset(jobConf) }
  • 14. Now let’s query them! {"filtered" : { "query" : { "match_all" : {} } ,"filter" : {"geo_distance" : { "distance" : "${dist}km", "location" : { "lat" : "${lat}", "lon" : "${lon}" }}}}}}
  • 15. Now let’s find the hash tags :) jobConf.set("es.query", query) val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) val tweets = currentTweets.map{ case (key, value) => SharedIndex.mapWritableToInput(value) } val hashTags = tweets.flatMap{t => t.getOrElse("hashTags", "").split(" ") } println(hashTags.countByValue())
  • 16. oh wait :( Sad panda by Jose Antonio Tovar
  • 17. Now let’s find some common words…. // Extract the top words val words = tweets.flatMap{tweet => tweet.flatMap{elem => elem._2 match { case null => Nil case _ => elem._2.split(" ") }}} val wordCounts = words.countByValue() println("------") wordCounts.foreach{ case(key, value) => println(key + ":" + value) } println("------")
  • 18. object WordCountOrdering extends Ordering[(String, Int)] { def compare(a: (String, Int), b: (String, Int)) = { b._2 compare a._2 } } val wc = words.map(x => (x, 1)).reduceByKey((x,y) => x+y) val topWords = wc.takeOrdered(40)(WordCountOrdering) ok, fine, the “top” words
  • 21. Indexing Part 2 (electric boogaloo) Writing directly to a node with the correct shard saves us network overhead Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/
  • 22. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] private int detectCurrentInstance(Configuration conf) { if (sparkInstance != null) { if (log.isDebugEnabled()) { log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri)); } return sparkInstance; } …. }
  • 23. Slight hack time We clone the connector and do update EsOutputFormat.java [see https://github. com/holdenk/elasticsearch-hadoop ] public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) { EsRecordWriter writer = new EsRecordWriter(job, progress); // This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our // jobconf asks for it we use this partition number as the shard number. if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) { writer.setSparkInstance(Integer.valueOf(name.substring(5))); } return writer; }
  • 24. So what does that give us? Spark sets the file name to part-[partition number] If we have same partitioner we write directly Likely the best place to use this is in re-indexing data
  • 25. Re-index all the things* // Read in our data set val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable]) // Fetch them from twitter val t4jt = tweets.flatMap{ tweet => val twitter = TwitterFactory.getSingleton() val tweetID = tweet.getOrElse("docid", "") Option(twitter.showStatus(tweetID.toLong)) } t4jt.map(SharedIndex.prepareTweets) .saveAsHadoopDataset(jobConf) *Until you hit your twitter rate limit…. oops
  • 26. Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61- 6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/
  • 27. “Useful” links ● Feedback: holden@databricks.com ● Customized ES connector*: https://github. com/holdenk/elasticsearch-hadoop ● Demo code: https://github.com/holdenk/elasticsearchspark ● Elasticsearch: http://www.elasticsearch.org/ ● Spark: http://spark.apache.org/ ● Spark streaming: http://spark.apache.org/streaming/