Contenu connexe Similaire à Analyzing twitter data with hadoop Similaire à Analyzing twitter data with hadoop (20) Plus de Joey Echeverria (11) Analyzing twitter data with hadoop1. 1
Analyzing
Twi,er
Data
with
Hadoop
Data
Science
Maryland,
May
2013
Joey
Echeverria
|
Director
Federal
FTS
joey@cloudera.com
|
@fwiffo
©2013 Cloudera, Inc.
2. About
Joey
• Director
Federal
FTS
• Technical
guy
playing
manager
• 2
years
@
Cloudera
• 5
years
of
Hadoop
• Local
2
4. Big
Data
• Big
• Larger
volume
than
you’ve
handled
before
• No
litmus
test
• High
value,
under
uRlized
• Data
• Structured
• Unstructured
• Semi-‐structured
• Hadoop
• Distributed
file
system
• Distributed,
batch
computaRon
4 ©2013 Cloudera, Inc.
7. A
Canonical
Hadoop
Architecture
7 ©2013 Cloudera, Inc.
Data
Source
HDFS
Flume
Hive
(Impala)
9. Analyzing
Twi,er
• Social
media
popular
with
markeRng
teams
• Twi,er
is
an
effecRve
tool
for
promoRon
• Who
is
influenRal?
• Tweets
• Followers
• Retweets
• Similar
to
e-‐mail
forwarding
• Which
twi,er
user
gets
the
most
retweets?
• Who
is
influenRal
in
our
industry?
9 ©2013 Cloudera, Inc.
11. Techniques
• SQL
• Filtering
• AggregaRon
• SorRng
• Complex
data
• Deeply
nested
• Variable
schema
11
12. Architecture
12 ©2013 Cloudera, Inc.
Twi,er
HDFS
Flume
Hive
Custom
Flume
Source
Sink
to
HDFS
JSON
SerDe
Parses
Data
Oozie
Add
ParRRons
Hourly
Impala
Queries
Queries
and
ETL
14. Flume
• Streaming
data
flow
• Sources
• Push
or
pull
• Sinks
• Event
based
14 ©2013 Cloudera, Inc.
15. Pulling
Data
From
Twi,er
• Custom
source,
using
twi,er4j
• Sources
process
data
as
discrete
events
16. Loading
Data
Into
HDFS
• HDFS
Sink
comes
stock
with
Flume
• Easily
separate
files
by
creaRon
Rme
• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
17. Flume
Source
17 ©2013 Cloudera, Inc.
public class TwitterSource extends AbstractSource!
implements EventDrivenSource, Configurable {!
...!
// The initialization method for the Source. The context contains all !
// the Flume configuration info!
@Override!
public void configure(Context context) {!
...!
}!
...!
// Start processing events. Uses the Twitter Streaming API to sample!
// Twitter, and process tweets.!
@Override!
public void start() {!
...!
}!
...!
// Stops Source's event processing and shuts down the Twitter stream.!
@Override!
public void stop() {!
...!
}!
}!
!
18. Twi,er
API
• Callback
mechanism
for
catching
new
tweets
18 ©2013 Cloudera, Inc.
/** The actual Twitter stream. It's set up to collect raw JSON data */!
private final TwitterStream twitterStream = new TwitterStreamFactory(!
new ConfigurationBuilder().setJSONStoreEnabled(true).build())!
.getInstance();!
...!
// The StatusListener is a twitter4j API that can be added to a stream,!
// and will call a method every time a message is sent to the stream.!
StatusListener listener = new StatusListener() {!
// The onStatus method is executed every time a new tweet comes in.!
public void onStatus(Status status) {!
... !
}!
}!
...!
// Set up the stream's listener (defined above), and set any necessary!
// security information.!
twitterStream.addListener(listener);!
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);!
AccessToken token = new AccessToken(accessToken, accessTokenSecret);!
twitterStream.setOAuthAccessToken(token);!
!
19. JSON
Data
• JSON
data
is
processed
as
an
event
and
wri,en
to
HDFS
19 ©2013 Cloudera, Inc.
public void onStatus(Status status) {!
// The EventBuilder is used to build an event using the headers and!
// the raw JSON of a tweet!
!
headers.put("timestamp", String.valueOf(!
status.getCreatedAt().getTime()));!
Event event = EventBuilder.withBody(!
DataObjectFactory.getRawJSON(status).getBytes(), headers);!
!
channel.processEvent(event);!
}!
22. What
is
Hive?
• Created
at
Facebook
• HiveQL
• SQL
like
interface
• Hive
interpreter
converts
HiveQL
to
MapReduce
code
• Returns
results
to
the
client
22
©2013 Cloudera, Inc.
23. Hive
Details
• Schema
on
read
• Scalar
types
(int,
float,
double,
boolean,
string)
• Complex
types
(struct,
map,
array)
• Metastore
contains
table
definiRons
• Stored
in
a
relaRonal
database
• Similar
to
catalog
tables
in
other
DBs
23
24. Complex
Data
24 ©2013 Cloudera, Inc.
SELECT!
t.retweet_screen_name,!
sum(retweets) AS total_retweets,!
count(*) AS tweet_count!
FROM (SELECT!
retweeted_status.user.screen_name AS retweet_screen_name,!
retweeted_status.text,!
max(retweeted_status.retweet_count) AS retweets!
FROM tweets!
GROUP BY!
retweeted_status.user.screen_name,!
retweeted_status.text) t!
GROUP BY t.retweet_screen_name!
ORDER BY total_retweets DESC!
LIMIT 10;!
26. What
is
JSON?
• Complex,
semi-‐structured
data
• Based
on
JavaScript’s
data
syntax
• Rich,
nested
data
types:
• number
• string
• Array
• object
• true,
false
• null
26 ©2013 Cloudera, Inc.
27. What
is
JSON?
27 ©2013 Cloudera, Inc.
{!
"retweeted_status": {!
"contributors": null,!
"text": "#Crowdsourcing – drivers already generate traffic data for
your smartphone to suggest alternative routes when a road is clogged.
#bigdata",!
"retweeted": false,!
"entities": {!
"hashtags": [!
{!
"text": "Crowdsourcing",!
"indices": [0, 14]!
},!
{!
"text": "bigdata",!
"indices": [129,137]!
}!
],!
"user_mentions": []!
}!
}!
}!
28. Hive
Serializers
and
Deserializers
• Instructs
Hive
on
how
to
interpret
data
• JSONSerDe
28 ©2013 Cloudera, Inc.
30. Analyzing
Twi,er
Data
with
Hadoop
IT’S
A
TRAP!
30
©2013 Cloudera, Inc.
Photo
from
h,p://www.flickr.com/photos/vanf/6798065626/
Some
rights
reserved
31. Not
a
Database
31 ©2013 Cloudera, Inc.
RDBMS
Hive
Language
Generally >= SQL-92
Subset of SQL-92 plus
Hive specific extensions
Update Capabilities
INSERT, UPDATE,
DELETE
INSERT OVERWRITE
no UPDATE, DELETE
Transactions
Yes
No
Latency
Sub-second
Minutes
Indexes
Yes
Yes
Data size
Terabytes
Petabytes
32. ETL
• Hive
works
great
for
SQL-‐based
ETL
• JSON
-‐>
SequenceFiles
32 ©2013 Cloudera, Inc.
34. Cloudera
Impala
34
Real-‐Time
Query
for
Data
Stored
in
Hadoop.
Supports
Hive
SQL
4-‐30X
faster
than
Hive
over
MapReduce
Uses
exisRng
drivers,
integrates
with
exisRng
metastore,
works
with
leading
BI
tools
Flexible,
cost-‐effecRve,
no
lock-‐in
Deploy
&
operate
with
Cloudera
Enterprise
RTQ
Supports
mulRple
storage
engines
&
file
formats
©2013 Cloudera, Inc.
35. Benefits
of
Cloudera
Impala
35
Real-‐Time
Query
for
Data
Stored
in
Hadoop
• Real-‐Rme
queries
run
directly
on
source
data
• No
ETL
delays
• No
jumping
between
data
silos
•
No
double
storage
with
EDW/RDBMS
•
Unlock
analysis
on
more
data
•
No
need
to
create
and
maintain
complex
ETL
between
systems
•
No
need
to
preplan
schemas
•
All
data
available
for
interacRve
queries
• No
loss
of
fidelity
from
fixed
data
schemas
• Single
metadata
store
from
originaRon
through
analysis
• No
need
to
hunt
through
mulRple
data
silos
©2013 Cloudera, Inc.
36. Cloudera
Impala
Details
36
©2013 Cloudera, Inc.
HDFS
DN
Query
Exec
Engine
Query
Coordinator
Query
Planner
HBase
ODBC
SQL
App
HDFS
DN
Query
Exec
Engine
Query
Coordinator
Query
Planner
HBase
HDFS
DN
Query
Exec
Engine
Query
Coordinator
Query
Planner
HBase
Fully
MPP
Distributed
Local
Direct
Reads
State
Store
HDFS
NN
Hive
Metastore
YARN
Common
Hive
SQL
and
interface
Unified
metadata
and
scheduler
Low-‐latency
scheduler
and
cache
(low-‐impact
failures)
40. Oozie
for
ParRRon
Management
• Once
an
hour,
add
a
parRRon
• Takes
advantage
of
advanced
Hive
funcRonality
43. Complete
Architecture
43 ©2013 Cloudera, Inc.
Twi,er
HDFS
Flume
Hive
Custom
Flume
Source
Sink
to
HDFS
JSON
SerDe
Parses
Data
Oozie
Add
ParRRons
Hourly
Impala
Queries
Queries
and
ETL
45. What
next?
• Cloudera
University
• Download
Hadoop!
• CDH
available
at
www.cloudera.com
• Cloudera
provides
pre-‐loaded
VMs
• h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera
+Manager+Free+EdiRon+Demo+VM
• Clone
the
source
repo
• h,ps://github.com/cloudera/cdh-‐twi,er-‐example
46. My
personal
preference
• Cloudera
Manager
• h,ps://ccp.cloudera.com/display/SUPPORT/Downloads
• Free
up
to
50
unlimited
nodes!
47. Shout
Out
• Jon
Natkins
• @na,yice
• Blog
posts
• h,p://blog.cloudera.com/blog/2013/09/analyzing-‐twi,er-‐
data-‐with-‐hadoop/
• h,p://blog.cloudera.com/blog/2013/10/analyzing-‐twi,er-‐
data-‐with-‐hadoop-‐part-‐2-‐gathering-‐data-‐with-‐flume/
• h,p://blog.cloudera.com/blog/2013/11/analyzing-‐twi,er-‐
data-‐with-‐hadoop-‐part-‐3-‐querying-‐semi-‐structured-‐data-‐
with-‐hive/