SlideShare a Scribd company logo
1 of 47
1
Analyzing Twitter Data with Hadoop
Open Analytics Summit, March 2013
Joey Echeverria | Principal Solutions Architect
joey@cloudera.com | @fwiffo
©2013 Cloudera, Inc.
About Joey
• Principal Solutions Architect
• 2 years @ Cloudera
• 5 years of Hadoop
• Local
2
Analyzing Twitter Data with Hadoop
BUILDING A BIG DATA SOLUTION
3 ©2013 Cloudera, Inc.
Big Data
• Big
• Larger volume than you’ve handled before
• No litmus test
• High value, under utilized
• Data
• Structured
• Unstructured
• Semi-structured
• Hadoop
• Distributed file system
• Distributed, batch computation
4 ©2013 Cloudera, Inc.
Data Management Systems
5 ©2013 Cloudera, Inc.
Data Source Data Storage
Data
Ingestion
Data
Processing
Relational Data Management Systems
6 ©2013 Cloudera, Inc.
Data Source RDBMSETL
Reporting
A Canonical Hadoop Architecture
7 ©2013 Cloudera, Inc.
Data Source HDFSFlume
Hive
(Impala)
Analyzing Twitter Data with Hadoop
AN EXAMPLE USE CASE
8 ©2013 Cloudera, Inc.
Analyzing Twitter
• Social media popular with marketing teams
• Twitter is an effective tool for promotion
• Who is influential?
• Tweets
• Followers
• Retweets
• Similar to e-mail forwarding
• Which twitter user gets the most retweets?
• Who is influential in our industry?
9 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HOW DO WE ANSWER THESE
QUESTIONS?
10 ©2013 Cloudera, Inc.
Techniques
• SQL
• Filtering
• Aggregation
• Sorting
• Complex data
• Deeply nested
• Variable schema
11
Architecture
12 ©2013 Cloudera, Inc.
Twitter
HDFSFlume Hive
Custom
Flume
Source
Sink to
HDFS
JSON SerDe
Parses Data
Oozie
Add
Partitions
Hourly
Analyzing Twitter Data with Hadoop
TWITTER SOURCE
13 ©2013 Cloudera, Inc.
Flume
• Streaming data flow
• Sources
• Push or pull
• Sinks
• Event based
14 ©2013 Cloudera, Inc.
Pulling Data From Twitter
• Custom source, using twitter4j
• Sources process data as discrete events
Loading Data Into HDFS
• HDFS Sink comes stock with Flume
• Easily separate files by creation time
• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Flume Source
17 ©2013 Cloudera, Inc.
public class TwitterSource extends AbstractSource
implements EventDrivenSource, Configurable {
...
// The initialization method for the Source. The context contains all
// the Flume configuration info
@Override
public void configure(Context context) {
...
}
...
// Start processing events. Uses the Twitter Streaming API to sample
// Twitter, and process tweets.
@Override
public void start() {
...
}
...
// Stops Source's event processing and shuts down the Twitter stream.
@Override
public void stop() {
...
}
}
Twitter API
• Callback mechanism for catching new tweets
18 ©2013 Cloudera, Inc.
/** The actual Twitter stream. It's set up to collect raw JSON data */
private final TwitterStream twitterStream = new TwitterStreamFactory(
new ConfigurationBuilder().setJSONStoreEnabled(true).build())
.getInstance();
...
// The StatusListener is a twitter4j API that can be added to a stream,
// and will call a method every time a message is sent to the stream.
StatusListener listener = new StatusListener() {
// The onStatus method is executed every time a new tweet comes in.
public void onStatus(Status status) {
...
}
}
...
// Set up the stream's listener (defined above), and set any necessary
// security information.
twitterStream.addListener(listener);
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);
AccessToken token = new AccessToken(accessToken, accessTokenSecret);
twitterStream.setOAuthAccessToken(token);
JSON Data
• JSON data is processed as an event and written to
HDFS
19 ©2013 Cloudera, Inc.
public void onStatus(Status status) {
// The EventBuilder is used to build an event using the headers and
// the raw JSON of a tweet
headers.put("timestamp", String.valueOf(
status.getCreatedAt().getTime()));
Event event = EventBuilder.withBody(
DataObjectFactory.getRawJSON(status).getBytes(), headers);
channel.processEvent(event);
}
Analyzing Twitter Data with Hadoop
FLUME DEMO
20 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HIVE
21 ©2013 Cloudera, Inc.
What is Hive?
• Created at Facebook
• HiveQL
• SQL like interface
• Hive interpreter
converts HiveQL to
MapReduce code
• Returns results to the
client
22 ©2013 Cloudera, Inc.
Hive Details
• Schema on read
• Scalar types (int, float, double, boolean, string)
• Complex types (struct, map, array)
• Metastore contains table definitions
• Stored in a relational database
• Similar to catalog tables in other DBs
23
Complex Data
24 ©2013 Cloudera, Inc.
SELECT
t.retweet_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name AS retweet_screen_name,
retweeted_status.text,
max(retweeted_status.retweet_count) AS retweets
FROM tweets
GROUP BY
retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweet_screen_name
ORDER BY total_retweets DESC
LIMIT 10;
Analyzing Twitter Data with Hadoop
JSON INTERLUDE
25 ©2013 Cloudera, Inc.
What is JSON?
• Complex, semi-structured data
• Based on JavaScript’s data syntax
• Rich, nested data types:
• number
• string
• Array
• object
• true, false
• null
26 ©2013 Cloudera, Inc.
What is JSON?
27 ©2013 Cloudera, Inc.
{
"retweeted_status": {
"contributors": null,
"text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest
alternative routes when a road is clogged. #bigdata",
"retweeted": false,
"entities": {
"hashtags": [
{
"text": "Crowdsourcing",
"indices": [0, 14]
},
{
"text": "bigdata",
"indices": [129,137]
}
],
"user_mentions": []
}
}
}
Hive Serializers and Deserializers
• Instructs Hive on how to interpret data
• JSONSerDe
28 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HIVE DEMO
29 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
IT’S A TRAP!
30 ©2013 Cloudera, Inc.
Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
Not a Database
31 ©2013 Cloudera, Inc.
RDBMS Hive
Language
Generally >= SQL-92
Subset of SQL-92 plus
Hive specific
extensions
Update Capabilities
INSERT, UPDATE,
DELETE
INSERT OVERWRITE
no UPDATE, DELETE
Transactions Yes No
Latency Sub-second Minutes
Indexes Yes Yes
Data size Terabytes Petabytes
Analyzing Twitter Data with Hadoop
IMPALA ASIDE
32 ©2013 Cloudera, Inc.
Cloudera Impala
33
Real-Time Query for Data Stored in Hadoop.
Supports Hive SQL
4-30X faster than Hive over MapReduce
Uses existing drivers, integrates with existing
metastore, works with leading BI tools
Flexible, cost-effective, no lock-in
Deploy & operate with
Cloudera Enterprise RTQ
Supports multiple storage engines &
file formats
©2013 Cloudera, Inc.
Benefits of Cloudera Impala
34
Real-Time Query for Data Stored in Hadoop
• Real-time queries run directly on source data
• No ETL delays
• No jumping between data silos
• No double storage with EDW/RDBMS
• Unlock analysis on more data
• No need to create and maintain complex ETL between systems
• No need to preplan schemas
• All data available for interactive queries
• No loss of fidelity from fixed data schemas
• Single metadata store from origination through analysis
• No need to hunt through multiple data silos
©2013 Cloudera, Inc.
Cloudera Impala Details
35 ©2013 Cloudera, Inc.
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP
Distributed
Local Direct Reads
State Store
HDFS NN
Hive
Metastore YARN
Common Hive SQL and interface
Unified metadata and scheduler
Low-latency scheduler and cache
(low-impact failures)
Analyzing Twitter Data with Hadoop
OOZIE AUTOMATION
36 ©2013 Cloudera, Inc.
Oozie: Everything in its Right Place
Oozie for Partition Management
• Once an hour, add a partition
• Takes advantage of advanced Hive functionality
Analyzing Twitter Data with Hadoop
OOZIE DEMO
39 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
PUTTING IT ALL TOGETHER
40 ©2013 Cloudera, Inc.
Complete Architecture
41 ©2013 Cloudera, Inc.
Twitter
HDFSFlume Hive
Custom
Flume
Source
Sink to
HDFS
JSON SerDe
Parses Data
Oozie
Add
Partitions
Hourly
Analyzing Twitter Data with Hadoop
MORE DEMOS
42 ©2013 Cloudera, Inc.
What next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Cloudera provides pre-loaded VMs
• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma
nager+Free+Edition+Demo+VM
• Clone the source repo
• https://github.com/cloudera/cdh-twitter-example
My personal preference
• Cloudera Manager
• https://ccp.cloudera.com/display/SUPPORT/Downloads
• Free up to 50 unlimited nodes!
Shout Out
• Jon Natkins
• @nattyice
• Blog posts
• http://blog.cloudera.com/blog/2013/09/analyzing-twitter-
data-with-hadoop/
• http://blog.cloudera.com/blog/2013/10/analyzing-twitter-
data-with-hadoop-part-2-gathering-data-with-flume/
• http://blog.cloudera.com/blog/2013/11/analyzing-twitter-
data-with-hadoop-part-3-querying-semi-structured-data-
with-hive/
Questions?
• Contact me!
• Joey Echeverria
• joey@cloudera.com
• @fwiffo
• We’re hiring!
47 ©2013 Cloudera, Inc.

More Related Content

More from Joey Echeverria

Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

More from Joey Echeverria (6)

Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Analyzing Twitter Data with Hadoop

  • 1. 1 Analyzing Twitter Data with Hadoop Open Analytics Summit, March 2013 Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo ©2013 Cloudera, Inc.
  • 2. About Joey • Principal Solutions Architect • 2 years @ Cloudera • 5 years of Hadoop • Local 2
  • 3. Analyzing Twitter Data with Hadoop BUILDING A BIG DATA SOLUTION 3 ©2013 Cloudera, Inc.
  • 4. Big Data • Big • Larger volume than you’ve handled before • No litmus test • High value, under utilized • Data • Structured • Unstructured • Semi-structured • Hadoop • Distributed file system • Distributed, batch computation 4 ©2013 Cloudera, Inc.
  • 5. Data Management Systems 5 ©2013 Cloudera, Inc. Data Source Data Storage Data Ingestion Data Processing
  • 6. Relational Data Management Systems 6 ©2013 Cloudera, Inc. Data Source RDBMSETL Reporting
  • 7. A Canonical Hadoop Architecture 7 ©2013 Cloudera, Inc. Data Source HDFSFlume Hive (Impala)
  • 8. Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE 8 ©2013 Cloudera, Inc.
  • 9. Analyzing Twitter • Social media popular with marketing teams • Twitter is an effective tool for promotion • Who is influential? • Tweets • Followers • Retweets • Similar to e-mail forwarding • Which twitter user gets the most retweets? • Who is influential in our industry? 9 ©2013 Cloudera, Inc.
  • 10. Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 10 ©2013 Cloudera, Inc.
  • 11. Techniques • SQL • Filtering • Aggregation • Sorting • Complex data • Deeply nested • Variable schema 11
  • 12. Architecture 12 ©2013 Cloudera, Inc. Twitter HDFSFlume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add Partitions Hourly
  • 13. Analyzing Twitter Data with Hadoop TWITTER SOURCE 13 ©2013 Cloudera, Inc.
  • 14. Flume • Streaming data flow • Sources • Push or pull • Sinks • Event based 14 ©2013 Cloudera, Inc.
  • 15. Pulling Data From Twitter • Custom source, using twitter4j • Sources process data as discrete events
  • 16. Loading Data Into HDFS • HDFS Sink comes stock with Flume • Easily separate files by creation time • hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • 17. Flume Source 17 ©2013 Cloudera, Inc. public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() { ... } }
  • 18. Twitter API • Callback mechanism for catching new tweets 18 ©2013 Cloudera, Inc. /** The actual Twitter stream. It's set up to collect raw JSON data */ private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance(); ... // The StatusListener is a twitter4j API that can be added to a stream, // and will call a method every time a message is sent to the stream. StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... } } ... // Set up the stream's listener (defined above), and set any necessary // security information. twitterStream.addListener(listener); twitterStream.setOAuthConsumer(consumerKey, consumerSecret); AccessToken token = new AccessToken(accessToken, accessTokenSecret); twitterStream.setOAuthAccessToken(token);
  • 19. JSON Data • JSON data is processed as an event and written to HDFS 19 ©2013 Cloudera, Inc. public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event); }
  • 20. Analyzing Twitter Data with Hadoop FLUME DEMO 20 ©2013 Cloudera, Inc.
  • 21. Analyzing Twitter Data with Hadoop HIVE 21 ©2013 Cloudera, Inc.
  • 22. What is Hive? • Created at Facebook • HiveQL • SQL like interface • Hive interpreter converts HiveQL to MapReduce code • Returns results to the client 22 ©2013 Cloudera, Inc.
  • 23. Hive Details • Schema on read • Scalar types (int, float, double, boolean, string) • Complex types (struct, map, array) • Metastore contains table definitions • Stored in a relational database • Similar to catalog tables in other DBs 23
  • 24. Complex Data 24 ©2013 Cloudera, Inc. SELECT t.retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweeted_status.retweet_count) AS retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweet_screen_name ORDER BY total_retweets DESC LIMIT 10;
  • 25. Analyzing Twitter Data with Hadoop JSON INTERLUDE 25 ©2013 Cloudera, Inc.
  • 26. What is JSON? • Complex, semi-structured data • Based on JavaScript’s data syntax • Rich, nested data types: • number • string • Array • object • true, false • null 26 ©2013 Cloudera, Inc.
  • 27. What is JSON? 27 ©2013 Cloudera, Inc. { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } } }
  • 28. Hive Serializers and Deserializers • Instructs Hive on how to interpret data • JSONSerDe 28 ©2013 Cloudera, Inc.
  • 29. Analyzing Twitter Data with Hadoop HIVE DEMO 29 ©2013 Cloudera, Inc.
  • 30. Analyzing Twitter Data with Hadoop IT’S A TRAP! 30 ©2013 Cloudera, Inc. Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
  • 31. Not a Database 31 ©2013 Cloudera, Inc. RDBMS Hive Language Generally >= SQL-92 Subset of SQL-92 plus Hive specific extensions Update Capabilities INSERT, UPDATE, DELETE INSERT OVERWRITE no UPDATE, DELETE Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes
  • 32. Analyzing Twitter Data with Hadoop IMPALA ASIDE 32 ©2013 Cloudera, Inc.
  • 33. Cloudera Impala 33 Real-Time Query for Data Stored in Hadoop. Supports Hive SQL 4-30X faster than Hive over MapReduce Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Enterprise RTQ Supports multiple storage engines & file formats ©2013 Cloudera, Inc.
  • 34. Benefits of Cloudera Impala 34 Real-Time Query for Data Stored in Hadoop • Real-time queries run directly on source data • No ETL delays • No jumping between data silos • No double storage with EDW/RDBMS • Unlock analysis on more data • No need to create and maintain complex ETL between systems • No need to preplan schemas • All data available for interactive queries • No loss of fidelity from fixed data schemas • Single metadata store from origination through analysis • No need to hunt through multiple data silos ©2013 Cloudera, Inc.
  • 35. Cloudera Impala Details 35 ©2013 Cloudera, Inc. HDFS DN Query Exec Engine Query Coordinator Query Planner HBase ODBC SQL App HDFS DN Query Exec Engine Query Coordinator Query Planner HBaseHDFS DN Query Exec Engine Query Coordinator Query Planner HBase Fully MPP Distributed Local Direct Reads State Store HDFS NN Hive Metastore YARN Common Hive SQL and interface Unified metadata and scheduler Low-latency scheduler and cache (low-impact failures)
  • 36. Analyzing Twitter Data with Hadoop OOZIE AUTOMATION 36 ©2013 Cloudera, Inc.
  • 37. Oozie: Everything in its Right Place
  • 38. Oozie for Partition Management • Once an hour, add a partition • Takes advantage of advanced Hive functionality
  • 39. Analyzing Twitter Data with Hadoop OOZIE DEMO 39 ©2013 Cloudera, Inc.
  • 40. Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER 40 ©2013 Cloudera, Inc.
  • 41. Complete Architecture 41 ©2013 Cloudera, Inc. Twitter HDFSFlume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add Partitions Hourly
  • 42. Analyzing Twitter Data with Hadoop MORE DEMOS 42 ©2013 Cloudera, Inc.
  • 43. What next? • Download Hadoop! • CDH available at www.cloudera.com • Cloudera provides pre-loaded VMs • https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM • Clone the source repo • https://github.com/cloudera/cdh-twitter-example
  • 44. My personal preference • Cloudera Manager • https://ccp.cloudera.com/display/SUPPORT/Downloads • Free up to 50 unlimited nodes!
  • 45. Shout Out • Jon Natkins • @nattyice • Blog posts • http://blog.cloudera.com/blog/2013/09/analyzing-twitter- data-with-hadoop/ • http://blog.cloudera.com/blog/2013/10/analyzing-twitter- data-with-hadoop-part-2-gathering-data-with-flume/ • http://blog.cloudera.com/blog/2013/11/analyzing-twitter- data-with-hadoop-part-3-querying-semi-structured-data- with-hive/
  • 46. Questions? • Contact me! • Joey Echeverria • joey@cloudera.com • @fwiffo • We’re hiring!