SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
1
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
Data	
  Science	
  Maryland,	
  May	
  2013	
  
Joey	
  Echeverria	
  |	
  Director	
  Federal	
  FTS	
  
joey@cloudera.com	
  |	
  @fwiffo	
  
©2013 Cloudera, Inc.
About	
  Joey	
  
•  Director	
  Federal	
  FTS	
  
•  Technical	
  guy	
  playing	
  manager	
  
•  2	
  years	
  @	
  Cloudera	
  
•  5	
  years	
  of	
  Hadoop	
  
•  Local	
  
2
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
BUILDING	
  A	
  BIG	
  DATA	
  SOLUTION	
  
3	
   ©2013 Cloudera, Inc.
Big	
  Data	
  
•  Big	
  
•  Larger	
  volume	
  than	
  you’ve	
  handled	
  before	
  
•  No	
  litmus	
  test	
  
•  High	
  value,	
  under	
  uRlized	
  
•  Data	
  
•  Structured	
  
•  Unstructured	
  
•  Semi-­‐structured	
  
•  Hadoop	
  
•  Distributed	
  file	
  system	
  
•  Distributed,	
  batch	
  computaRon	
  
4 ©2013 Cloudera, Inc.
Data	
  Management	
  Systems	
  
5 ©2013 Cloudera, Inc.
Data	
  Source	
   Data	
  Storage	
  
Data	
  
IngesRon	
  
Data	
  
Processing	
  
RelaRonal	
  Data	
  Management	
  Systems	
  
6 ©2013 Cloudera, Inc.
Data	
  Source	
   RDBMS	
  ETL	
  
ReporRng	
  
A	
  Canonical	
  Hadoop	
  Architecture	
  
7 ©2013 Cloudera, Inc.
Data	
  Source	
   HDFS	
  Flume	
  
Hive	
  
(Impala)	
  
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
AN	
  EXAMPLE	
  USE	
  CASE	
  
8	
   ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  
•  Social	
  media	
  popular	
  with	
  markeRng	
  teams	
  
•  Twi,er	
  is	
  an	
  effecRve	
  tool	
  for	
  promoRon	
  
•  Who	
  is	
  influenRal?	
  
•  Tweets	
  
•  Followers	
  
•  Retweets	
  
•  Similar	
  to	
  e-­‐mail	
  forwarding	
  
•  Which	
  twi,er	
  user	
  gets	
  the	
  most	
  retweets?	
  
•  Who	
  is	
  influenRal	
  in	
  our	
  industry?	
  
9 ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
HOW	
  DO	
  WE	
  ANSWER	
  THESE	
  
QUESTIONS?	
  
10	
   ©2013 Cloudera, Inc.
Techniques	
  
•  SQL	
  
•  Filtering	
  
•  AggregaRon	
  
•  SorRng	
  
•  Complex	
  data	
  
•  Deeply	
  nested	
  
•  Variable	
  schema	
  
11
Architecture	
  
12 ©2013 Cloudera, Inc.
Twi,er	
  
HDFS	
  Flume	
   Hive	
  
Custom	
  
Flume	
  
Source	
  
Sink	
  to	
  
HDFS	
  
JSON	
  SerDe	
  
Parses	
  Data	
  
Oozie	
  
Add	
  
ParRRons	
  
Hourly	
  
Impala	
  
Queries	
  
Queries	
  
and	
  ETL	
  
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
TWITTER	
  SOURCE	
  
13	
   ©2013 Cloudera, Inc.
Flume	
  
•  Streaming	
  data	
  flow	
  
•  Sources	
  
•  Push	
  or	
  pull	
  
•  Sinks	
  
•  Event	
  based	
  
14 ©2013 Cloudera, Inc.
Pulling	
  Data	
  From	
  Twi,er	
  
•  Custom	
  source,	
  using	
  twi,er4j	
  
•  Sources	
  process	
  data	
  as	
  discrete	
  events	
  
Loading	
  Data	
  Into	
  HDFS	
  
•  HDFS	
  Sink	
  comes	
  stock	
  with	
  Flume	
  
•  Easily	
  separate	
  files	
  by	
  creaRon	
  Rme	
  
•  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Flume	
  Source	
  
17 ©2013 Cloudera, Inc.
public class TwitterSource extends AbstractSource!
implements EventDrivenSource, Configurable {!
...!
// The initialization method for the Source. The context contains all !
// the Flume configuration info!
@Override!
public void configure(Context context) {!
...!
}!
...!
// Start processing events. Uses the Twitter Streaming API to sample!
// Twitter, and process tweets.!
@Override!
public void start() {!
...!
}!
...!
// Stops Source's event processing and shuts down the Twitter stream.!
@Override!
public void stop() {!
...!
}!
}!
!
Twi,er	
  API	
  
•  Callback	
  mechanism	
  for	
  catching	
  new	
  tweets	
  
18 ©2013 Cloudera, Inc.
/** The actual Twitter stream. It's set up to collect raw JSON data */!
private final TwitterStream twitterStream = new TwitterStreamFactory(!
new ConfigurationBuilder().setJSONStoreEnabled(true).build())!
.getInstance();!
...!
// The StatusListener is a twitter4j API that can be added to a stream,!
// and will call a method every time a message is sent to the stream.!
StatusListener listener = new StatusListener() {!
// The onStatus method is executed every time a new tweet comes in.!
public void onStatus(Status status) {!
... !
}!
}!
...!
// Set up the stream's listener (defined above), and set any necessary!
// security information.!
twitterStream.addListener(listener);!
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);!
AccessToken token = new AccessToken(accessToken, accessTokenSecret);!
twitterStream.setOAuthAccessToken(token);!
!
JSON	
  Data	
  
•  JSON	
  data	
  is	
  processed	
  as	
  an	
  event	
  and	
  wri,en	
  to	
  
HDFS	
  
19 ©2013 Cloudera, Inc.
public void onStatus(Status status) {!
// The EventBuilder is used to build an event using the headers and!
// the raw JSON of a tweet!
!
headers.put("timestamp", String.valueOf(!
status.getCreatedAt().getTime()));!
Event event = EventBuilder.withBody(!
DataObjectFactory.getRawJSON(status).getBytes(), headers);!
!
channel.processEvent(event);!
}!
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
FLUME	
  DEMO	
  
20	
   ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
HIVE	
  
21	
   ©2013 Cloudera, Inc.
What	
  is	
  Hive?	
  
•  Created	
  at	
  Facebook	
  
•  HiveQL	
  
•  SQL	
  like	
  interface	
  
•  Hive	
  interpreter	
  
converts	
  HiveQL	
  to	
  
MapReduce	
  code	
  
•  Returns	
  results	
  to	
  the	
  
client	
  
22	
   ©2013 Cloudera, Inc.
Hive	
  Details	
  
•  Schema	
  on	
  read	
  
•  Scalar	
  types	
  (int,	
  float,	
  double,	
  boolean,	
  string)	
  
•  Complex	
  types	
  (struct,	
  map,	
  array)	
  
•  Metastore	
  contains	
  table	
  definiRons	
  
•  Stored	
  in	
  a	
  relaRonal	
  database	
  
•  Similar	
  to	
  catalog	
  tables	
  in	
  other	
  DBs	
  
23
Complex	
  Data	
  
24 ©2013 Cloudera, Inc.
SELECT!
 t.retweet_screen_name,!
 sum(retweets) AS total_retweets,!
 count(*) AS tweet_count!
FROM (SELECT!
  retweeted_status.user.screen_name AS retweet_screen_name,!
    retweeted_status.text,!
    max(retweeted_status.retweet_count) AS retweets!
FROM tweets!
  GROUP BY!
retweeted_status.user.screen_name,!
      retweeted_status.text) t!
GROUP BY t.retweet_screen_name!
ORDER BY total_retweets DESC!
LIMIT 10;!
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
JSON	
  INTERLUDE	
  
25	
   ©2013 Cloudera, Inc.
What	
  is	
  JSON?	
  
•  Complex,	
  semi-­‐structured	
  data	
  
•  Based	
  on	
  JavaScript’s	
  data	
  syntax	
  
•  Rich,	
  nested	
  data	
  types:	
  
•  number	
  
•  string	
  
•  Array	
  
•  object	
  
•  true,	
  false	
  
•  null	
  
26 ©2013 Cloudera, Inc.
What	
  is	
  JSON?	
  
27 ©2013 Cloudera, Inc.
{!
"retweeted_status": {!
"contributors": null,!
"text": "#Crowdsourcing – drivers already generate traffic data for
your smartphone to suggest alternative routes when a road is clogged.
#bigdata",!
"retweeted": false,!
"entities": {!
"hashtags": [!
{!
"text": "Crowdsourcing",!
"indices": [0, 14]!
},!
{!
"text": "bigdata",!
"indices": [129,137]!
}!
],!
"user_mentions": []!
}!
}!
}!
Hive	
  Serializers	
  and	
  Deserializers	
  	
  
•  Instructs	
  Hive	
  on	
  how	
  to	
  interpret	
  data	
  
•  JSONSerDe	
  
28 ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
HIVE	
  DEMO	
  
29	
   ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
IT’S	
  A	
  TRAP!	
  
30	
   ©2013 Cloudera, Inc.
Photo	
  from	
  h,p://www.flickr.com/photos/vanf/6798065626/	
  Some	
  rights	
  reserved	
  
Not	
  a	
  Database	
  
31 ©2013 Cloudera, Inc.
RDBMS
 Hive
Language
Generally >= SQL-92

Subset of SQL-92 plus
Hive specific extensions
Update Capabilities
INSERT, UPDATE,
DELETE
INSERT OVERWRITE
no UPDATE, DELETE
Transactions
 Yes
 No
Latency
 Sub-second
 Minutes
Indexes
 Yes
 Yes
Data size
 Terabytes
 Petabytes
ETL	
  
•  Hive	
  works	
  great	
  for	
  SQL-­‐based	
  ETL	
  
•  JSON	
  -­‐>	
  SequenceFiles	
  
32 ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
IMPALA	
  
33	
   ©2013 Cloudera, Inc.
Cloudera	
  Impala	
  
34
Real-­‐Time	
  Query	
  for	
  Data	
  Stored	
  in	
  Hadoop.	
  
Supports	
  Hive	
  SQL	
  
4-­‐30X	
  faster	
  than	
  Hive	
  over	
  MapReduce	
  
Uses	
  exisRng	
  drivers,	
  integrates	
  with	
  exisRng	
  
metastore,	
  works	
  with	
  leading	
  BI	
  tools	
  
Flexible,	
  cost-­‐effecRve,	
  no	
  lock-­‐in	
  
Deploy	
  &	
  operate	
  with	
  
Cloudera	
  Enterprise	
  RTQ	
  
Supports	
  mulRple	
  storage	
  engines	
  &	
  	
  
file	
  formats	
  
©2013 Cloudera, Inc.
Benefits	
  of	
  Cloudera	
  Impala	
  
35
Real-­‐Time	
  Query	
  for	
  Data	
  Stored	
  in	
  Hadoop	
  
• Real-­‐Rme	
  queries	
  run	
  directly	
  on	
  source	
  data	
  
• No	
  ETL	
  delays	
  
• No	
  jumping	
  between	
  data	
  silos	
  
• 	
  No	
  double	
  storage	
  with	
  EDW/RDBMS	
  
• 	
  Unlock	
  analysis	
  on	
  more	
  data	
  
• 	
  No	
  need	
  to	
  create	
  and	
  maintain	
  complex	
  ETL	
  between	
  systems	
  
• 	
  No	
  need	
  to	
  preplan	
  schemas	
  
• 	
  All	
  data	
  available	
  for	
  interacRve	
  queries	
  
• No	
  loss	
  of	
  fidelity	
  from	
  fixed	
  data	
  schemas	
  
• Single	
  metadata	
  store	
  from	
  originaRon	
  	
  through	
  analysis	
  
• No	
  need	
  to	
  hunt	
  through	
  mulRple	
  data	
  silos	
  
©2013 Cloudera, Inc.
Cloudera	
  Impala	
  Details	
  
36	
   ©2013 Cloudera, Inc.
HDFS	
  DN	
  
Query	
  Exec	
  Engine	
  
Query	
  Coordinator	
  
Query	
  Planner	
  
HBase	
  
ODBC	
  
SQL	
  App	
  
HDFS	
  DN	
  
Query	
  Exec	
  Engine	
  
Query	
  Coordinator	
  
Query	
  Planner	
  
HBase	
  HDFS	
  DN	
  
Query	
  Exec	
  Engine	
  
Query	
  Coordinator	
  
Query	
  Planner	
  
HBase	
  
Fully	
  MPP	
  
Distributed	
  
Local	
  Direct	
  Reads	
  
State	
  Store	
  
HDFS	
  NN	
  
Hive	
  
Metastore	
   YARN	
  
Common	
  Hive	
  SQL	
  and	
  interface	
  
Unified	
  metadata	
  and	
  scheduler	
  
Low-­‐latency	
  scheduler	
  and	
  cache	
  
(low-­‐impact	
  failures)	
  
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
IMPALA	
  DEMO	
  
37	
   ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
OOZIE	
  AUTOMATION	
  
38	
   ©2013 Cloudera, Inc.
Oozie:	
  Everything	
  in	
  its	
  Right	
  Place	
  
Oozie	
  for	
  ParRRon	
  Management	
  
•  Once	
  an	
  hour,	
  add	
  a	
  parRRon	
  
•  Takes	
  advantage	
  of	
  advanced	
  Hive	
  funcRonality	
  
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
OOZIE	
  DEMO	
  
41	
   ©2013 Cloudera, Inc.
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
PUTTING	
  IT	
  ALL	
  TOGETHER	
  
42	
   ©2013 Cloudera, Inc.
Complete	
  Architecture	
  
43 ©2013 Cloudera, Inc.
Twi,er	
  
HDFS	
  Flume	
   Hive	
  
Custom	
  
Flume	
  
Source	
  
Sink	
  to	
  
HDFS	
  
JSON	
  SerDe	
  
Parses	
  Data	
  
Oozie	
  
Add	
  
ParRRons	
  
Hourly	
  
Impala	
  
Queries	
  
Queries	
  
and	
  ETL	
  
Analyzing	
  Twi,er	
  Data	
  with	
  Hadoop	
  
MORE	
  DEMOS	
  
44	
   ©2013 Cloudera, Inc.
What	
  next?	
  
•  Cloudera	
  University	
  
•  Download	
  Hadoop!	
  
•  CDH	
  available	
  at	
  www.cloudera.com	
  
•  Cloudera	
  provides	
  pre-­‐loaded	
  VMs	
  
•  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera
+Manager+Free+EdiRon+Demo+VM	
  
•  Clone	
  the	
  source	
  repo	
  
•  h,ps://github.com/cloudera/cdh-­‐twi,er-­‐example	
  
My	
  personal	
  preference	
  
•  Cloudera	
  Manager	
  
•  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads	
  
•  Free	
  up	
  to	
  50	
  	
  unlimited	
  nodes!	
  
Shout	
  Out	
  
•  Jon	
  Natkins	
  
•  @na,yice	
  
•  Blog	
  posts	
  
•  h,p://blog.cloudera.com/blog/2013/09/analyzing-­‐twi,er-­‐
data-­‐with-­‐hadoop/	
  
•  h,p://blog.cloudera.com/blog/2013/10/analyzing-­‐twi,er-­‐
data-­‐with-­‐hadoop-­‐part-­‐2-­‐gathering-­‐data-­‐with-­‐flume/	
  
•  h,p://blog.cloudera.com/blog/2013/11/analyzing-­‐twi,er-­‐
data-­‐with-­‐hadoop-­‐part-­‐3-­‐querying-­‐semi-­‐structured-­‐data-­‐
with-­‐hive/	
  
QuesRons?	
  
•  Contact	
  me!	
  
•  Joey	
  Echeverria	
  
•  joey@cloudera.com	
  
•  @fwiffo	
  
•  We’re	
  hiring!	
  
49 ©2013 Cloudera, Inc.

Contenu connexe

Tendances

Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...DataWorks Summit
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaLucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera, Inc.
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Lucidworks
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Sematext Group, Inc.
 
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...Lucidworks
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2Rafał Kuć
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 

Tendances (20)

Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger Insights
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 

Similaire à Analyzing twitter data with hadoop

Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Bringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache HadoopBringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache HadoopDataWorks Summit
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoopNiel Dunnage
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全Jianwei Li
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Cloudera, Inc.
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 

Similaire à Analyzing twitter data with hadoop (20)

Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Bringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache HadoopBringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache Hadoop
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoop
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 

Plus de Joey Echeverria

Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

Plus de Joey Echeverria (11)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Analyzing twitter data with hadoop

  • 1. 1 Analyzing  Twi,er  Data  with  Hadoop   Data  Science  Maryland,  May  2013   Joey  Echeverria  |  Director  Federal  FTS   joey@cloudera.com  |  @fwiffo   ©2013 Cloudera, Inc.
  • 2. About  Joey   •  Director  Federal  FTS   •  Technical  guy  playing  manager   •  2  years  @  Cloudera   •  5  years  of  Hadoop   •  Local   2
  • 3. Analyzing  Twi,er  Data  with  Hadoop   BUILDING  A  BIG  DATA  SOLUTION   3   ©2013 Cloudera, Inc.
  • 4. Big  Data   •  Big   •  Larger  volume  than  you’ve  handled  before   •  No  litmus  test   •  High  value,  under  uRlized   •  Data   •  Structured   •  Unstructured   •  Semi-­‐structured   •  Hadoop   •  Distributed  file  system   •  Distributed,  batch  computaRon   4 ©2013 Cloudera, Inc.
  • 5. Data  Management  Systems   5 ©2013 Cloudera, Inc. Data  Source   Data  Storage   Data   IngesRon   Data   Processing  
  • 6. RelaRonal  Data  Management  Systems   6 ©2013 Cloudera, Inc. Data  Source   RDBMS  ETL   ReporRng  
  • 7. A  Canonical  Hadoop  Architecture   7 ©2013 Cloudera, Inc. Data  Source   HDFS  Flume   Hive   (Impala)  
  • 8. Analyzing  Twi,er  Data  with  Hadoop   AN  EXAMPLE  USE  CASE   8   ©2013 Cloudera, Inc.
  • 9. Analyzing  Twi,er   •  Social  media  popular  with  markeRng  teams   •  Twi,er  is  an  effecRve  tool  for  promoRon   •  Who  is  influenRal?   •  Tweets   •  Followers   •  Retweets   •  Similar  to  e-­‐mail  forwarding   •  Which  twi,er  user  gets  the  most  retweets?   •  Who  is  influenRal  in  our  industry?   9 ©2013 Cloudera, Inc.
  • 10. Analyzing  Twi,er  Data  with  Hadoop   HOW  DO  WE  ANSWER  THESE   QUESTIONS?   10   ©2013 Cloudera, Inc.
  • 11. Techniques   •  SQL   •  Filtering   •  AggregaRon   •  SorRng   •  Complex  data   •  Deeply  nested   •  Variable  schema   11
  • 12. Architecture   12 ©2013 Cloudera, Inc. Twi,er   HDFS  Flume   Hive   Custom   Flume   Source   Sink  to   HDFS   JSON  SerDe   Parses  Data   Oozie   Add   ParRRons   Hourly   Impala   Queries   Queries   and  ETL  
  • 13. Analyzing  Twi,er  Data  with  Hadoop   TWITTER  SOURCE   13   ©2013 Cloudera, Inc.
  • 14. Flume   •  Streaming  data  flow   •  Sources   •  Push  or  pull   •  Sinks   •  Event  based   14 ©2013 Cloudera, Inc.
  • 15. Pulling  Data  From  Twi,er   •  Custom  source,  using  twi,er4j   •  Sources  process  data  as  discrete  events  
  • 16. Loading  Data  Into  HDFS   •  HDFS  Sink  comes  stock  with  Flume   •  Easily  separate  files  by  creaRon  Rme   •  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • 17. Flume  Source   17 ©2013 Cloudera, Inc. public class TwitterSource extends AbstractSource! implements EventDrivenSource, Configurable {! ...! // The initialization method for the Source. The context contains all ! // the Flume configuration info! @Override! public void configure(Context context) {! ...! }! ...! // Start processing events. Uses the Twitter Streaming API to sample! // Twitter, and process tweets.! @Override! public void start() {! ...! }! ...! // Stops Source's event processing and shuts down the Twitter stream.! @Override! public void stop() {! ...! }! }! !
  • 18. Twi,er  API   •  Callback  mechanism  for  catching  new  tweets   18 ©2013 Cloudera, Inc. /** The actual Twitter stream. It's set up to collect raw JSON data */! private final TwitterStream twitterStream = new TwitterStreamFactory(! new ConfigurationBuilder().setJSONStoreEnabled(true).build())! .getInstance();! ...! // The StatusListener is a twitter4j API that can be added to a stream,! // and will call a method every time a message is sent to the stream.! StatusListener listener = new StatusListener() {! // The onStatus method is executed every time a new tweet comes in.! public void onStatus(Status status) {! ... ! }! }! ...! // Set up the stream's listener (defined above), and set any necessary! // security information.! twitterStream.addListener(listener);! twitterStream.setOAuthConsumer(consumerKey, consumerSecret);! AccessToken token = new AccessToken(accessToken, accessTokenSecret);! twitterStream.setOAuthAccessToken(token);! !
  • 19. JSON  Data   •  JSON  data  is  processed  as  an  event  and  wri,en  to   HDFS   19 ©2013 Cloudera, Inc. public void onStatus(Status status) {! // The EventBuilder is used to build an event using the headers and! // the raw JSON of a tweet! ! headers.put("timestamp", String.valueOf(! status.getCreatedAt().getTime()));! Event event = EventBuilder.withBody(! DataObjectFactory.getRawJSON(status).getBytes(), headers);! ! channel.processEvent(event);! }!
  • 20. Analyzing  Twi,er  Data  with  Hadoop   FLUME  DEMO   20   ©2013 Cloudera, Inc.
  • 21. Analyzing  Twi,er  Data  with  Hadoop   HIVE   21   ©2013 Cloudera, Inc.
  • 22. What  is  Hive?   •  Created  at  Facebook   •  HiveQL   •  SQL  like  interface   •  Hive  interpreter   converts  HiveQL  to   MapReduce  code   •  Returns  results  to  the   client   22   ©2013 Cloudera, Inc.
  • 23. Hive  Details   •  Schema  on  read   •  Scalar  types  (int,  float,  double,  boolean,  string)   •  Complex  types  (struct,  map,  array)   •  Metastore  contains  table  definiRons   •  Stored  in  a  relaRonal  database   •  Similar  to  catalog  tables  in  other  DBs   23
  • 24. Complex  Data   24 ©2013 Cloudera, Inc. SELECT!  t.retweet_screen_name,!  sum(retweets) AS total_retweets,!  count(*) AS tweet_count! FROM (SELECT!   retweeted_status.user.screen_name AS retweet_screen_name,!     retweeted_status.text,!     max(retweeted_status.retweet_count) AS retweets! FROM tweets!   GROUP BY! retweeted_status.user.screen_name,!       retweeted_status.text) t! GROUP BY t.retweet_screen_name! ORDER BY total_retweets DESC! LIMIT 10;!
  • 25. Analyzing  Twi,er  Data  with  Hadoop   JSON  INTERLUDE   25   ©2013 Cloudera, Inc.
  • 26. What  is  JSON?   •  Complex,  semi-­‐structured  data   •  Based  on  JavaScript’s  data  syntax   •  Rich,  nested  data  types:   •  number   •  string   •  Array   •  object   •  true,  false   •  null   26 ©2013 Cloudera, Inc.
  • 27. What  is  JSON?   27 ©2013 Cloudera, Inc. {! "retweeted_status": {! "contributors": null,! "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata",! "retweeted": false,! "entities": {! "hashtags": [! {! "text": "Crowdsourcing",! "indices": [0, 14]! },! {! "text": "bigdata",! "indices": [129,137]! }! ],! "user_mentions": []! }! }! }!
  • 28. Hive  Serializers  and  Deserializers     •  Instructs  Hive  on  how  to  interpret  data   •  JSONSerDe   28 ©2013 Cloudera, Inc.
  • 29. Analyzing  Twi,er  Data  with  Hadoop   HIVE  DEMO   29   ©2013 Cloudera, Inc.
  • 30. Analyzing  Twi,er  Data  with  Hadoop   IT’S  A  TRAP!   30   ©2013 Cloudera, Inc. Photo  from  h,p://www.flickr.com/photos/vanf/6798065626/  Some  rights  reserved  
  • 31. Not  a  Database   31 ©2013 Cloudera, Inc. RDBMS Hive Language Generally >= SQL-92 Subset of SQL-92 plus Hive specific extensions Update Capabilities INSERT, UPDATE, DELETE INSERT OVERWRITE no UPDATE, DELETE Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes
  • 32. ETL   •  Hive  works  great  for  SQL-­‐based  ETL   •  JSON  -­‐>  SequenceFiles   32 ©2013 Cloudera, Inc.
  • 33. Analyzing  Twi,er  Data  with  Hadoop   IMPALA   33   ©2013 Cloudera, Inc.
  • 34. Cloudera  Impala   34 Real-­‐Time  Query  for  Data  Stored  in  Hadoop.   Supports  Hive  SQL   4-­‐30X  faster  than  Hive  over  MapReduce   Uses  exisRng  drivers,  integrates  with  exisRng   metastore,  works  with  leading  BI  tools   Flexible,  cost-­‐effecRve,  no  lock-­‐in   Deploy  &  operate  with   Cloudera  Enterprise  RTQ   Supports  mulRple  storage  engines  &     file  formats   ©2013 Cloudera, Inc.
  • 35. Benefits  of  Cloudera  Impala   35 Real-­‐Time  Query  for  Data  Stored  in  Hadoop   • Real-­‐Rme  queries  run  directly  on  source  data   • No  ETL  delays   • No  jumping  between  data  silos   •   No  double  storage  with  EDW/RDBMS   •   Unlock  analysis  on  more  data   •   No  need  to  create  and  maintain  complex  ETL  between  systems   •   No  need  to  preplan  schemas   •   All  data  available  for  interacRve  queries   • No  loss  of  fidelity  from  fixed  data  schemas   • Single  metadata  store  from  originaRon    through  analysis   • No  need  to  hunt  through  mulRple  data  silos   ©2013 Cloudera, Inc.
  • 36. Cloudera  Impala  Details   36   ©2013 Cloudera, Inc. HDFS  DN   Query  Exec  Engine   Query  Coordinator   Query  Planner   HBase   ODBC   SQL  App   HDFS  DN   Query  Exec  Engine   Query  Coordinator   Query  Planner   HBase  HDFS  DN   Query  Exec  Engine   Query  Coordinator   Query  Planner   HBase   Fully  MPP   Distributed   Local  Direct  Reads   State  Store   HDFS  NN   Hive   Metastore   YARN   Common  Hive  SQL  and  interface   Unified  metadata  and  scheduler   Low-­‐latency  scheduler  and  cache   (low-­‐impact  failures)  
  • 37. Analyzing  Twi,er  Data  with  Hadoop   IMPALA  DEMO   37   ©2013 Cloudera, Inc.
  • 38. Analyzing  Twi,er  Data  with  Hadoop   OOZIE  AUTOMATION   38   ©2013 Cloudera, Inc.
  • 39. Oozie:  Everything  in  its  Right  Place  
  • 40. Oozie  for  ParRRon  Management   •  Once  an  hour,  add  a  parRRon   •  Takes  advantage  of  advanced  Hive  funcRonality  
  • 41. Analyzing  Twi,er  Data  with  Hadoop   OOZIE  DEMO   41   ©2013 Cloudera, Inc.
  • 42. Analyzing  Twi,er  Data  with  Hadoop   PUTTING  IT  ALL  TOGETHER   42   ©2013 Cloudera, Inc.
  • 43. Complete  Architecture   43 ©2013 Cloudera, Inc. Twi,er   HDFS  Flume   Hive   Custom   Flume   Source   Sink  to   HDFS   JSON  SerDe   Parses  Data   Oozie   Add   ParRRons   Hourly   Impala   Queries   Queries   and  ETL  
  • 44. Analyzing  Twi,er  Data  with  Hadoop   MORE  DEMOS   44   ©2013 Cloudera, Inc.
  • 45. What  next?   •  Cloudera  University   •  Download  Hadoop!   •  CDH  available  at  www.cloudera.com   •  Cloudera  provides  pre-­‐loaded  VMs   •  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera +Manager+Free+EdiRon+Demo+VM   •  Clone  the  source  repo   •  h,ps://github.com/cloudera/cdh-­‐twi,er-­‐example  
  • 46. My  personal  preference   •  Cloudera  Manager   •  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads   •  Free  up  to  50    unlimited  nodes!  
  • 47. Shout  Out   •  Jon  Natkins   •  @na,yice   •  Blog  posts   •  h,p://blog.cloudera.com/blog/2013/09/analyzing-­‐twi,er-­‐ data-­‐with-­‐hadoop/   •  h,p://blog.cloudera.com/blog/2013/10/analyzing-­‐twi,er-­‐ data-­‐with-­‐hadoop-­‐part-­‐2-­‐gathering-­‐data-­‐with-­‐flume/   •  h,p://blog.cloudera.com/blog/2013/11/analyzing-­‐twi,er-­‐ data-­‐with-­‐hadoop-­‐part-­‐3-­‐querying-­‐semi-­‐structured-­‐data-­‐ with-­‐hive/  
  • 48. QuesRons?   •  Contact  me!   •  Joey  Echeverria   •  joey@cloudera.com   •  @fwiffo   •  We’re  hiring!