Analyzing twitter data with hadoop

1
Analyzing
Twi,er
Data
with
Hadoop

Data
Science
Maryland,
May
2013

Joey
Echeverria
|
Director
Federal
FTS

joey@cloudera.com
|
@fwiﬀo

©2013 Cloudera, Inc.

About
Joey

•  Director
Federal
FTS

•  Technical
guy
playing
manager

•  2
years
@
Cloudera

•  5
years
of
Hadoop

•  Local

2

Analyzing
Twi,er
Data
with
Hadoop

BUILDING
A
BIG
DATA
SOLUTION

3

Big
Data

•  Big

•  Larger
volume
than
you’ve
handled
before

•  No
litmus
test

•  High
value,
under
uRlized

•  Data

•  Structured

•  Unstructured

•  Semi-‐structured

•  Hadoop

•  Distributed
ﬁle
system

•  Distributed,
batch
computaRon

4 ©2013 Cloudera, Inc.

Data
Management
Systems

Data
Source
Data
Storage

Data

IngesRon

Data

Processing

RelaRonal
Data
Management
Systems

Data
Source
RDBMS
ETL

ReporRng

A
Canonical
Hadoop
Architecture

Data
Source
HDFS
Flume

Hive

(Impala)

Analyzing
Twi,er
Data
with
Hadoop

AN
EXAMPLE
USE
CASE

8

Analyzing
Twi,er

•  Social
media
popular
with
markeRng
teams

•  Twi,er
is
an
effecRve
tool
for
promoRon

•  Who
is
influenRal?

•  Tweets

•  Followers

•  Retweets

•  Similar
to
e-‐mail
forwarding

•  Which
twi,er
user
gets
the
most
retweets?

•  Who
is
influenRal
in
our
industry?


Analyzing
Twi,er
Data
with
Hadoop

HOW
DO
WE
ANSWER
THESE

QUESTIONS?

10

Techniques

•  SQL

•  Filtering

•  AggregaRon

•  SorRng

•  Complex
data

•  Deeply
nested

•  Variable
schema

11

Architecture

Twi,er

HDFS
Flume
Hive

Custom

Flume

Source

Sink
to

HDFS

JSON
SerDe

Parses
Data

Oozie

Add

ParRRons

Hourly

Impala

Queries

Queries

and
ETL

Analyzing
Twi,er
Data
with
Hadoop

TWITTER
SOURCE

13

Flume

•  Streaming
data
ﬂow

•  Sources

•  Push
or
pull

•  Sinks

•  Event
based


Pulling
Data
From
Twi,er

•  Custom
source,
using
twi,er4j

•  Sources
process
data
as
discrete
events

Loading
Data
Into
HDFS

•  HDFS
Sink
comes
stock
with
Flume

•  Easily
separate
ﬁles
by
creaRon
Rme

•  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

Flume
Source

public class TwitterSource extends AbstractSource!
implements EventDrivenSource, Configurable {!
...!
// The initialization method for the Source. The context contains all !
// the Flume configuration info!
@Override!
public void configure(Context context) {!
...!
}!
...!
// Start processing events. Uses the Twitter Streaming API to sample!
// Twitter, and process tweets.!
@Override!
public void start() {!
...!
}!
...!
// Stops Source's event processing and shuts down the Twitter stream.!
@Override!
public void stop() {!
...!
}!
}!
!

Twi,er
API

•  Callback
mechanism
for
catching
new
tweets

/** The actual Twitter stream. It's set up to collect raw JSON data */!
private final TwitterStream twitterStream = new TwitterStreamFactory(!
new ConfigurationBuilder().setJSONStoreEnabled(true).build())!
.getInstance();!
...!
// The StatusListener is a twitter4j API that can be added to a stream,!
// and will call a method every time a message is sent to the stream.!
StatusListener listener = new StatusListener() {!
// The onStatus method is executed every time a new tweet comes in.!
public void onStatus(Status status) {!
... !
}!
}!
...!
// Set up the stream's listener (defined above), and set any necessary!
// security information.!
twitterStream.addListener(listener);!
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);!
AccessToken token = new AccessToken(accessToken, accessTokenSecret);!
twitterStream.setOAuthAccessToken(token);!
!

JSON
Data

•  JSON
data
is
processed
as
an
event
and
wri,en
to

HDFS

public void onStatus(Status status) {!
// The EventBuilder is used to build an event using the headers and!
// the raw JSON of a tweet!
!
headers.put("timestamp", String.valueOf(!
status.getCreatedAt().getTime()));!
Event event = EventBuilder.withBody(!
DataObjectFactory.getRawJSON(status).getBytes(), headers);!
!
channel.processEvent(event);!
}!

Analyzing
Twi,er
Data
with
Hadoop

FLUME
DEMO

20

Analyzing
Twi,er
Data
with
Hadoop

HIVE

21

What
is
Hive?

•  Created
at
Facebook

•  HiveQL

•  SQL
like
interface

•  Hive
interpreter

converts
HiveQL
to

MapReduce
code

•  Returns
results
to
the

client

22

Hive
Details

•  Schema
on
read

•  Scalar
types
(int,
ﬂoat,
double,
boolean,
string)

•  Complex
types
(struct,
map,
array)

•  Metastore
contains
table
deﬁniRons

•  Stored
in
a
relaRonal
database

•  Similar
to
catalog
tables
in
other
DBs

23

Complex
Data

SELECT!
t.retweet_screen_name,!
sum(retweets) AS total_retweets,!
count(*) AS tweet_count!
FROM (SELECT!
retweeted_status.user.screen_name AS retweet_screen_name,!
    retweeted_status.text,!
    max(retweeted_status.retweet_count) AS retweets!
FROM tweets!
GROUP BY!
retweeted_status.user.screen_name,!
      retweeted_status.text) t!
GROUP BY t.retweet_screen_name!
ORDER BY total_retweets DESC!
LIMIT 10;!

Analyzing
Twi,er
Data
with
Hadoop

JSON
INTERLUDE

25

What
is
JSON?

•  Complex,
semi-‐structured
data

•  Based
on
JavaScript’s
data
syntax

•  Rich,
nested
data
types:

•  number

•  string

•  Array

•  object

•  true,
false

•  null


What
is
JSON?

{!
"retweeted_status": {!
"contributors": null,!
"text": "#Crowdsourcing – drivers already generate traffic data for
your smartphone to suggest alternative routes when a road is clogged.
#bigdata",!
"retweeted": false,!
"entities": {!
"hashtags": [!
{!
"text": "Crowdsourcing",!
"indices": [0, 14]!
},!
{!
"text": "bigdata",!
"indices": [129,137]!
}!
],!
"user_mentions": []!
}!
}!
}!

Hive
Serializers
and
Deserializers

•  Instructs
Hive
on
how
to
interpret
data

•  JSONSerDe


Analyzing
Twi,er
Data
with
Hadoop

HIVE
DEMO

29

Analyzing
Twi,er
Data
with
Hadoop

IT’S
A
TRAP!

30
Photo
from
h,p://www.ﬂickr.com/photos/vanf/6798065626/
Some
rights
reserved

Not
a
Database

RDBMS
Hive
Language
Generally >= SQL-92

Subset of SQL-92 plus
Hive speciﬁc extensions
Update Capabilities
INSERT, UPDATE,
DELETE
INSERT OVERWRITE
no UPDATE, DELETE
Transactions
Yes
No
Latency
Sub-second
Minutes
Indexes
Yes
Yes
Data size
Terabytes
Petabytes

ETL

•  Hive
works
great
for
SQL-‐based
ETL

•  JSON
-‐>
SequenceFiles


Analyzing
Twi,er
Data
with
Hadoop

IMPALA

33

Cloudera
Impala

34
Real-‐Time
Query
for
Data
Stored
in
Hadoop.

Supports
Hive
SQL

4-‐30X
faster
than
Hive
over
MapReduce

Uses
exisRng
drivers,
integrates
with
exisRng

metastore,
works
with
leading
BI
tools

Flexible,
cost-‐eﬀecRve,
no
lock-‐in

Deploy
&
operate
with

Cloudera
Enterprise
RTQ

Supports
mulRple
storage
engines
&

ﬁle
formats


Benefits
of
Cloudera
Impala

35
Real-‐Time
Query
for
Data
Stored
in
Hadoop

• Real-‐Rme
queries
run
directly
on
source
data

• No
ETL
delays

• No
jumping
between
data
silos

• 
No
double
storage
with
EDW/RDBMS

• 
Unlock
analysis
on
more
data

• 
No
need
to
create
and
maintain
complex
ETL
between
systems

• 
No
need
to
preplan
schemas

• 
All
data
available
for
interacRve
queries

• No
loss
of
fidelity
from
fixed
data
schemas

• Single
metadata
store
from
originaRon

through
analysis

• No
need
to
hunt
through
mulRple
data
silos


Cloudera
Impala
Details

36
HDFS
DN

Query
Exec
Engine

Query
Coordinator

Query
Planner

HBase

ODBC

SQL
App

HDFS
DN

Query
Exec
Engine

Query
Coordinator

Query
Planner

HBase
HDFS
DN

Query
Exec
Engine

Query
Coordinator

Query
Planner

HBase

Fully
MPP

Distributed

Local
Direct
Reads

State
Store

HDFS
NN

Hive

Metastore
YARN

Common
Hive
SQL
and
interface

Uniﬁed
metadata
and
scheduler

Low-‐latency
scheduler
and
cache

(low-‐impact
failures)

Analyzing
Twi,er
Data
with
Hadoop

IMPALA
DEMO

37

Analyzing
Twi,er
Data
with
Hadoop

OOZIE
AUTOMATION

38

Oozie:
Everything
in
its
Right
Place

Oozie
for
ParRRon
Management

•  Once
an
hour,
add
a
parRRon

•  Takes
advantage
of
advanced
Hive
funcRonality

Analyzing
Twi,er
Data
with
Hadoop

OOZIE
DEMO

41

Analyzing
Twi,er
Data
with
Hadoop

PUTTING
IT
ALL
TOGETHER

42

Complete
Architecture

Twi,er

HDFS
Flume
Hive

Custom

Flume

Source

Sink
to

HDFS

JSON
SerDe

Parses
Data

Oozie

Add

ParRRons

Hourly

Impala

Queries

Queries

and
ETL

Analyzing
Twi,er
Data
with
Hadoop

MORE
DEMOS

44

What
next?

•  Cloudera
University

•  Download
Hadoop!

•  CDH
available
at
www.cloudera.com

•  Cloudera
provides
pre-‐loaded
VMs

•  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera
+Manager+Free+EdiRon+Demo+VM

•  Clone
the
source
repo

•  h,ps://github.com/cloudera/cdh-‐twi,er-‐example

My
personal
preference

•  Cloudera
Manager

•  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads

•  Free
up
to
50

unlimited
nodes!

Shout
Out

•  Jon
Natkins

•  @na,yice

•  Blog
posts

•  h,p://blog.cloudera.com/blog/2013/09/analyzing-‐twi,er-‐
data-‐with-‐hadoop/

data-‐with-‐hadoop-‐part-‐2-‐gathering-‐data-‐with-‐ﬂume/

data-‐with-‐hadoop-‐part-‐3-‐querying-‐semi-‐structured-‐data-‐
with-‐hive/

QuesRons?

•  Contact
me!

•  Joey
Echeverria

•  joey@cloudera.com

•  @fwiﬀo

•  We’re
hiring!

Analyzing twitter data with hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Analyzing twitter data with hadoop

Similaire à Analyzing twitter data with hadoop (20)

Plus de Joey Echeverria

Plus de Joey Echeverria (11)

Analyzing twitter data with hadoop