SlideShare une entreprise Scribd logo
1  sur  47
Analyzing Twitter Data with Hadoop
    Open Analytics Summit, March 2013
    Joey Echeverria | Principal Solutions Architect
    joey@cloudera.com | @fwiffo




1                               ©2012 Cloudera, Inc.
About Joey

    • Principal Solutions Architect
    • 18 months
    • 4+ years
    • Local




2
Analyzing Twitter Data with Hadoop




     BUILDING A BIG DATA SOLUTION




3                   ©2012 Cloudera, Inc.
Big Data

    •   Big
        •   Larger volume than you’ve handled before
              •   No litmus test
        •   High value, under utilized
    •   Data
        •   Structured
        •   Unstructured
        •   Semi-structured
    •   Hadoop
        • Distributed file system
        • Distributed, batch computation


4                                  ©2012 Cloudera, Inc.
Data Management Systems




                                                               Data
                                                            Processing
       Data Source
                       Data
                                             Data Storage
                     Ingestion




5                                ©2012 Cloudera, Inc.
Relational Data Management Systems




                                                  Reporting

       Data Source   ETL                 RDBMS




6                          ©2012 Cloudera, Inc.
A Canonical Hadoop Architecture




                                                      Hive
                                                    (Impala)
       Data Source   Flume                  HDFS




7                            ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




     AN EXAMPLE USE CASE




8                   ©2012 Cloudera, Inc.
Analyzing Twitter

    • Social media popular with marketing teams
    • Twitter is an effective tool for promotion
    • Who is influential?
        •   Tweets
        •   Followers
        •   Retweets
             •   Similar to e-mail forwarding
    • Which twitter user gets the most retweets?
    • Who is influential in our industry?


9                                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      HOW DO WE ANSWER THESE
      QUESTIONS?




10                   ©2012 Cloudera, Inc.
Techniques

     •   SQL
         •   Filtering
         •   Aggregation
         •   Sorting
     •   Complex data
         •   Deeply nested
         •   Variable schema




11
Architecture



             Twitter                                                  Oozie

        Custom                                                             Add
        Flume                                                            Partitions
        Source                                                            Hourly
                       Sink to                          JSON SerDe
                        HDFS                            Parses Data
            Flume                        HDFS                         Hive




12                               ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      TWITTER SOURCE




13                   ©2012 Cloudera, Inc.
Flume

     • Streaming data flow
     • Sources
         •   Push or pull
     • Sinks
     • Event based




14                          ©2012 Cloudera, Inc.
Pulling Data From Twitter

• Custom source, using twitter4j
• Sources process data as discrete events
Loading Data Into HDFS

• HDFS Sink comes stock with Flume
• Easily separate files by creation time
    •   hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Flume Source
     public class TwitterSource extends AbstractSource
         implements EventDrivenSource, Configurable {
       ...
       // The initialization method for the Source. The context contains all
       // the Flume configuration info
       @Override
       public void configure(Context context) {
         ...
       }
       ...
       // Start processing events. Uses the Twitter Streaming API to sample
       // Twitter, and process tweets.
       @Override
       public void start() {
         ...
       }
       ...
       // Stops Source's event processing and shuts down the Twitter stream.
       @Override
       public void stop() {
         ...
       }
     }

17                                          ©2012 Cloudera, Inc.
Twitter API

     •   Callback mechanism for catching new tweets
     /** The actual Twitter stream. It's set up to collect raw JSON data */
     private final TwitterStream twitterStream = new TwitterStreamFactory(
       new ConfigurationBuilder().setJSONStoreEnabled(true).build())
         .getInstance();
     ...
     // The StatusListener is a twitter4j API that can be added to a stream,
     // and will call a method every time a message is sent to the stream.
     StatusListener listener = new StatusListener() {
       // The onStatus method is executed every time a new tweet comes in.
       public void onStatus(Status status) {
         ...
       }
     }
     ...
     // Set up the stream's listener (defined above), and set any necessary
     // security information.
     twitterStream.addListener(listener);
     twitterStream.setOAuthConsumer(consumerKey, consumerSecret);
     AccessToken token = new AccessToken(accessToken, accessTokenSecret);
     twitterStream.setOAuthAccessToken(token);

18                                       ©2012 Cloudera, Inc.
JSON Data

     •     JSON data is processed as an event and written to
           HDFS
     public void onStatus(Status status) {
      // The EventBuilder is used to build an event using the headers and
      // the raw JSON of a tweet

         headers.put("timestamp", String.valueOf(
          status.getCreatedAt().getTime()));
         Event event = EventBuilder.withBody(
          DataObjectFactory.getRawJSON(status).getBytes(), headers);

         channel.processEvent(event);
     }




19                                          ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      FLUME DEMO




20                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      HIVE




21                   ©2012 Cloudera, Inc.
What is Hive?

     • Created at Facebook
     • HiveQL
         •   SQL like interface
     • Hive interpreter
       converts HiveQL to
       MapReduce code
     • Returns results to the
       client


22                                ©2012 Cloudera, Inc.
Hive Details

     • Schema on read
     • Scalar types (int, float, double, boolean, string)
     • Complex types (struct, map, array)
     • Metastore contains table definitions
         •   Stored in a relational database
         •   Similar to catalog tables in other DBs




23
Complex Data

     SELECT
      t.retweeted_screen_name,
      sum(retweets) AS total_retweets,
      count(*) AS tweet_count
     FROM (SELECT
           retweeted_status.user.screen_name AS retweet_screen_name,
           retweeted_status.text,
           max(retweet_count) AS retweets
         FROM tweets
         GROUP BY
             retweeted_status.user.screen_name,
          retweeted_status.text) t
     GROUP BY t.retweet_screen_name
     ORDER BY total_retweets DESC
     LIMIT 10;



24                                  ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      JSON INTERLUDE




25                   ©2012 Cloudera, Inc.
What is JSON?

     • Complex, semi-structured data
     • Based on JavaScript’s data syntax
     • Rich, nested data types:
         •   number
         •   string
         •   Array
         •   object
         •   true, false
         •   null


26                         ©2012 Cloudera, Inc.
What is JSON?
     {
       "retweeted_status": {
         "contributors": null,
         "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest
     alternative routes when a road is clogged. #bigdata",
         "retweeted": false,
         "entities": {
           "hashtags": [
             {
               "text": "Crowdsourcing",
               "indices": [0, 14]
             },
             {
               "text": "bigdata",
               "indices": [129,137]
             }
           ],
           "user_mentions": []
         }
       }
     }



27                                           ©2012 Cloudera, Inc.
Hive Serializers and Deserializers

     • Instructs Hive on how to interpret data
     • JSONSerDe




28                        ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      HIVE DEMO




29                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      IT’S A TRAP




30                   ©2012 Cloudera, Inc.
Not a Database

                       RDBMS                        Hive
                                                    Subset of SQL-92 plus
                       Generally >= SQL-92
 Language                                           Hive specific
                                                    extensions
                       INSERT, UPDATE,              INSERT OVERWRITE
 Update Capabilities
                       DELETE                       no UPDATE, DELETE
 Transactions          Yes                          No
 Latency               Sub-second                   Minutes
 Indexes               Yes                          Yes
 Data size             Terabytes                    Petabytes



31                           ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      IMPALA ASIDE




32                   ©2012 Cloudera, Inc.
Cloudera Impala
     Real-Time Query for Data Stored in Hadoop.

                                   Supports Hive SQL

                                   4-30X faster than Hive over MapReduce

                                   Supports multiple storage engines &
                                   file formats

                                   Uses existing drivers, integrates with existing
                                   metastore, works with leading BI tools

                                   Flexible, cost-effective, no lock-in

                                   Deploy & operate with
                                   Cloudera Enterprise RTQ

33                           ©2012 Cloudera, Inc.
Benefits of Cloudera Impala
     Real-Time Query for Data Stored in Hadoop

                            • Real-time queries run directly on source data
                            • No ETL delays
                            • No jumping between data silos

                            •   No double storage with EDW/RDBMS
                            •   Unlock analysis on more data
                            •   No need to create and maintain complex ETL between systems
                            •   No need to preplan schemas

                            • All data available for interactive queries
                            • No loss of fidelity from fixed data schemas


                            • Single metadata store from origination through analysis
                            • No need to hunt through multiple data silos



34                               ©2012 Cloudera, Inc.
Cloudera Impala Details
                                                          Unified metadata and scheduler
                                                      Hive
                                                    Metastore         YARN        HDFS NN
                         SQL App

                          ODBC                                     State Store

                                                                       Low-latency scheduler and cache
 Common Hive SQL and interface                                               (low-impact failures)

       Query Planner                 Query Planner              Fully MPP        Query Planner
                                                                Distributed
     Query Coordinator             Query Coordinator                          Query Coordinator

     Query Exec Engine             Query Exec Engine                          Query Exec Engine

     HDFS DN   HBase               HDFS DN       HBase                        HDFS DN       HBase
                                                                      Local Direct Reads


35                                 ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      OOZIE AUTOMATION




36                   ©2012 Cloudera, Inc.
Oozie: Everything in its Right Place
Oozie for Partition Management

• Once an hour, add a partition
• Takes advantage of advanced Hive functionality
Analyzing Twitter Data with Hadoop




      OOZIE DEMO




39                   ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      PUTTING IT ALL TOGETHER




40                   ©2012 Cloudera, Inc.
Complete Architecture



             Twitter                                                  Oozie

        Custom                                                             Add
        Flume                                                            Partitions
        Source                                                            Hourly
                       Sink to                          JSON SerDe
                        HDFS                            Parses Data
            Flume                        HDFS                         Hive




41                               ©2012 Cloudera, Inc.
Analyzing Twitter Data with Hadoop




      MORE DEMOS




42                   ©2012 Cloudera, Inc.
What next?

• Download Hadoop!
• CDH available at www.cloudera.com
• Cloudera provides pre-loaded VMs
    •   https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma
        nager+Free+Edition+Demo+VM
•   Clone the source repo
    •   https://github.com/cloudera/cdh-twitter-example
My personal preference

•   Cloudera Manager
    •   https://ccp.cloudera.com/display/SUPPORT/Downloads
•   Free up to 50 nodes
Shout Out

• Jon Natkins
• @nattybnatkins
• Blog posts
    •   http://blog.cloudera.com/blog/2012/09/analyzing-twitter-
        data-with-hadoop/
    •   http://blog.cloudera.com/blog/2012/10/analyzing-twitter-
        data-with-hadoop-part-2-gathering-data-with-flume/
    •   http://blog.cloudera.com/blog/2012/11/analyzing-twitter-
        data-with-hadoop-part-3-querying-semi-structured-data-
        with-hive/
Questions?

• Contact me!
• Joey Echeverria
• joey@cloudera.com
• @fwiffo




•   We’re hiring!
47   ©2012 Cloudera, Inc.

Contenu connexe

En vedette

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMBig Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
 
Icons and Stencils for Hadoop
Icons and Stencils for HadoopIcons and Stencils for Hadoop
Icons and Stencils for HadoopHortonworks
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big DataDATAVERSITY
 
Alta White Paper D2C eCommerce Case Study 2016
Alta White Paper D2C eCommerce Case Study 2016Alta White Paper D2C eCommerce Case Study 2016
Alta White Paper D2C eCommerce Case Study 2016Patrick Nicholson
 

En vedette (13)

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMBig Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VM
 
Icons and Stencils for Hadoop
Icons and Stencils for HadoopIcons and Stencils for Hadoop
Icons and Stencils for Hadoop
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
Alta White Paper D2C eCommerce Case Study 2016
Alta White Paper D2C eCommerce Case Study 2016Alta White Paper D2C eCommerce Case Study 2016
Alta White Paper D2C eCommerce Case Study 2016
 
"15 Business Story Ideas to Jump on Now"
"15 Business Story Ideas to Jump on Now""15 Business Story Ideas to Jump on Now"
"15 Business Story Ideas to Jump on Now"
 
Information från Läkemedelsverket #5 2013
Information från Läkemedelsverket #5 2013Information från Läkemedelsverket #5 2013
Information från Läkemedelsverket #5 2013
 
cathy resume
cathy resumecathy resume
cathy resume
 

Plus de Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationOpen Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsOpen Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital EconomyOpen Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYCOpen Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupOpen Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupOpen Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalOpen Analytics
 

Plus de Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Analyzing twitter data with hadoop

  • 1. Analyzing Twitter Data with Hadoop Open Analytics Summit, March 2013 Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo 1 ©2012 Cloudera, Inc.
  • 2. About Joey • Principal Solutions Architect • 18 months • 4+ years • Local 2
  • 3. Analyzing Twitter Data with Hadoop BUILDING A BIG DATA SOLUTION 3 ©2012 Cloudera, Inc.
  • 4. Big Data • Big • Larger volume than you’ve handled before • No litmus test • High value, under utilized • Data • Structured • Unstructured • Semi-structured • Hadoop • Distributed file system • Distributed, batch computation 4 ©2012 Cloudera, Inc.
  • 5. Data Management Systems Data Processing Data Source Data Data Storage Ingestion 5 ©2012 Cloudera, Inc.
  • 6. Relational Data Management Systems Reporting Data Source ETL RDBMS 6 ©2012 Cloudera, Inc.
  • 7. A Canonical Hadoop Architecture Hive (Impala) Data Source Flume HDFS 7 ©2012 Cloudera, Inc.
  • 8. Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE 8 ©2012 Cloudera, Inc.
  • 9. Analyzing Twitter • Social media popular with marketing teams • Twitter is an effective tool for promotion • Who is influential? • Tweets • Followers • Retweets • Similar to e-mail forwarding • Which twitter user gets the most retweets? • Who is influential in our industry? 9 ©2012 Cloudera, Inc.
  • 10. Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 10 ©2012 Cloudera, Inc.
  • 11. Techniques • SQL • Filtering • Aggregation • Sorting • Complex data • Deeply nested • Variable schema 11
  • 12. Architecture Twitter Oozie Custom Add Flume Partitions Source Hourly Sink to JSON SerDe HDFS Parses Data Flume HDFS Hive 12 ©2012 Cloudera, Inc.
  • 13. Analyzing Twitter Data with Hadoop TWITTER SOURCE 13 ©2012 Cloudera, Inc.
  • 14. Flume • Streaming data flow • Sources • Push or pull • Sinks • Event based 14 ©2012 Cloudera, Inc.
  • 15. Pulling Data From Twitter • Custom source, using twitter4j • Sources process data as discrete events
  • 16. Loading Data Into HDFS • HDFS Sink comes stock with Flume • Easily separate files by creation time • hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • 17. Flume Source public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() { ... } } 17 ©2012 Cloudera, Inc.
  • 18. Twitter API • Callback mechanism for catching new tweets /** The actual Twitter stream. It's set up to collect raw JSON data */ private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance(); ... // The StatusListener is a twitter4j API that can be added to a stream, // and will call a method every time a message is sent to the stream. StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... } } ... // Set up the stream's listener (defined above), and set any necessary // security information. twitterStream.addListener(listener); twitterStream.setOAuthConsumer(consumerKey, consumerSecret); AccessToken token = new AccessToken(accessToken, accessTokenSecret); twitterStream.setOAuthAccessToken(token); 18 ©2012 Cloudera, Inc.
  • 19. JSON Data • JSON data is processed as an event and written to HDFS public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event); } 19 ©2012 Cloudera, Inc.
  • 20. Analyzing Twitter Data with Hadoop FLUME DEMO 20 ©2012 Cloudera, Inc.
  • 21. Analyzing Twitter Data with Hadoop HIVE 21 ©2012 Cloudera, Inc.
  • 22. What is Hive? • Created at Facebook • HiveQL • SQL like interface • Hive interpreter converts HiveQL to MapReduce code • Returns results to the client 22 ©2012 Cloudera, Inc.
  • 23. Hive Details • Schema on read • Scalar types (int, float, double, boolean, string) • Complex types (struct, map, array) • Metastore contains table definitions • Stored in a relational database • Similar to catalog tables in other DBs 23
  • 24. Complex Data SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweet_count) AS retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweet_screen_name ORDER BY total_retweets DESC LIMIT 10; 24 ©2012 Cloudera, Inc.
  • 25. Analyzing Twitter Data with Hadoop JSON INTERLUDE 25 ©2012 Cloudera, Inc.
  • 26. What is JSON? • Complex, semi-structured data • Based on JavaScript’s data syntax • Rich, nested data types: • number • string • Array • object • true, false • null 26 ©2012 Cloudera, Inc.
  • 27. What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } } } 27 ©2012 Cloudera, Inc.
  • 28. Hive Serializers and Deserializers • Instructs Hive on how to interpret data • JSONSerDe 28 ©2012 Cloudera, Inc.
  • 29. Analyzing Twitter Data with Hadoop HIVE DEMO 29 ©2012 Cloudera, Inc.
  • 30. Analyzing Twitter Data with Hadoop IT’S A TRAP 30 ©2012 Cloudera, Inc.
  • 31. Not a Database RDBMS Hive Subset of SQL-92 plus Generally >= SQL-92 Language Hive specific extensions INSERT, UPDATE, INSERT OVERWRITE Update Capabilities DELETE no UPDATE, DELETE Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes 31 ©2012 Cloudera, Inc.
  • 32. Analyzing Twitter Data with Hadoop IMPALA ASIDE 32 ©2012 Cloudera, Inc.
  • 33. Cloudera Impala Real-Time Query for Data Stored in Hadoop. Supports Hive SQL 4-30X faster than Hive over MapReduce Supports multiple storage engines & file formats Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Enterprise RTQ 33 ©2012 Cloudera, Inc.
  • 34. Benefits of Cloudera Impala Real-Time Query for Data Stored in Hadoop • Real-time queries run directly on source data • No ETL delays • No jumping between data silos • No double storage with EDW/RDBMS • Unlock analysis on more data • No need to create and maintain complex ETL between systems • No need to preplan schemas • All data available for interactive queries • No loss of fidelity from fixed data schemas • Single metadata store from origination through analysis • No need to hunt through multiple data silos 34 ©2012 Cloudera, Inc.
  • 35. Cloudera Impala Details Unified metadata and scheduler Hive Metastore YARN HDFS NN SQL App ODBC State Store Low-latency scheduler and cache Common Hive SQL and interface (low-impact failures) Query Planner Query Planner Fully MPP Query Planner Distributed Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads 35 ©2012 Cloudera, Inc.
  • 36. Analyzing Twitter Data with Hadoop OOZIE AUTOMATION 36 ©2012 Cloudera, Inc.
  • 37. Oozie: Everything in its Right Place
  • 38. Oozie for Partition Management • Once an hour, add a partition • Takes advantage of advanced Hive functionality
  • 39. Analyzing Twitter Data with Hadoop OOZIE DEMO 39 ©2012 Cloudera, Inc.
  • 40. Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER 40 ©2012 Cloudera, Inc.
  • 41. Complete Architecture Twitter Oozie Custom Add Flume Partitions Source Hourly Sink to JSON SerDe HDFS Parses Data Flume HDFS Hive 41 ©2012 Cloudera, Inc.
  • 42. Analyzing Twitter Data with Hadoop MORE DEMOS 42 ©2012 Cloudera, Inc.
  • 43. What next? • Download Hadoop! • CDH available at www.cloudera.com • Cloudera provides pre-loaded VMs • https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM • Clone the source repo • https://github.com/cloudera/cdh-twitter-example
  • 44. My personal preference • Cloudera Manager • https://ccp.cloudera.com/display/SUPPORT/Downloads • Free up to 50 nodes
  • 45. Shout Out • Jon Natkins • @nattybnatkins • Blog posts • http://blog.cloudera.com/blog/2012/09/analyzing-twitter- data-with-hadoop/ • http://blog.cloudera.com/blog/2012/10/analyzing-twitter- data-with-hadoop-part-2-gathering-data-with-flume/ • http://blog.cloudera.com/blog/2012/11/analyzing-twitter- data-with-hadoop-part-3-querying-semi-structured-data- with-hive/
  • 46. Questions? • Contact me! • Joey Echeverria • joey@cloudera.com • @fwiffo • We’re hiring!
  • 47. 47 ©2012 Cloudera, Inc.