SlideShare a Scribd company logo
1 of 63
BigData using Hadoop and
           Pig

                Sudar Muthu
             Research Engineer
                Yahoo Labs
          http://sudarmuthu.com
      http://twitter.com/sudarmuthu
Who am I?
   Research Engineer at Yahoo Labs
   Mines useful information from huge datasets
   Worked on both structured and unstructured
    data.
   Builds robots as hobby ;)
What we will see today?
   What is BigData?
   Get our hands dirty with Hadoop
   See some code
   Try out Pig
   Glimpse of Hbase and Hive
What is BigData?
“   Big data is a collection of data sets so large   ”
    and complex that it becomes difficult to
    process using on-hand database
    management tools



        http://en.wikipedia.org/wiki/Big_data
How big is BigData?
1GB today is not the same
as 1GB just 10 years before
Anything that doesn’t fit
into the RAM of a single
         machine
Types of Big Data
Data in Movement (streams)
   Twitter/Facebook comments
   Stock market data
   Access logs of a busy web server
   Sensors: Vital signs of a newly born
Data at rest (Oceans)
   Collection of what has streamed
   Emails or IM messages
   Social Media
   Unstructured documents: forms, claims
We have all this data and
 need to find a way to
    process them
Traditional way of scaling
               (Scaling up)
   Make the machine more powerful
     Add more RAM
     Add more cores to CPU

   It is going to be very expensive
   Will be limited by disk seek and read time
   Single point of failure
New way to scale up (Scale out)
   Add more instances of the same machine
   Cost is less compared to scaling up
   Immune to failure of a single or a set of nodes
   Disk seek and write time is not going to be
    bottleneck
   Future safe (to some extend)
Is it fit for ALL types of
         problems?
Divide and conquer
Hadoop
A scalable, fault-tolerant
 grid operating system for
data storage and processing
What is Hadoop?
   Runs on Commodity hardware
   HDFS: Fault-tolerant high-bandwidth clustered
    storage
   MapReduce: Distributed data processing
   Works with structured and unstructured data
   Open source, Apache license
   Master (named-node) – Slave architecture
Design Principles
   System shall manage and heal itself
   Performance shall scale linearly
   Algorithm should move to data
       Lower latency, lower bandwidth
   Simple core, modular and extensible
Components of Hadoop
   HDFS
   Map Reduce
   PIG
   HBase
   Hive
Getting started with
      Hadoop
What I am not going to cover?
   Installation or setting up Hadoop
       Will be running all the code in a single node instance
   Monitoring of the clusters
   Performance tuning
   User authentication or quota
Before we get into code,
 let’s understand some
        concepts
Map Reduce
Framework for distributed
processing of large datasets
MapReduce
Consists of two functions
 Map
       Filter and transform the input, which the reducer
        can understand
   Reduce
       Aggregate over the input provided by the Map
        function
Formal definition
Map
<k1, v1> -> list(<k2,v2>)



Reduce
<k2, list(v2)>   -> list <k3, v3>
Let’s see some examples
Count number of words in files
Map
<file_name, file_contents> => list<word, count>

Reduce
<word, list(count)> => <word, sum_of_counts>
Count number of words in files
Map
<“file1”, “to be or not to be”> =>
{<“to”,1>,
<“be”,1>,
<“or”,1>,
<“not”,1>,
<“to,1>,
<“be”,1>}
Count number of words in files
Reduce
{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>,
<“not”,<1>>}

=>

{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
Max temperature in a year
Map
<file_name, file_contents> => <year, temp>

Reduce
<year, list(temp)> => <year, max_temp>
HDFS
HDFS
   Distributed file system
   Data is distributed over different nodes
   Will be replicated for fail over
   Is abstracted out for the algorithms
HDFS Commands
HDFS Commands
   hadoop fs –mkdir <dir_name>
   hadoop fs –ls <dir_name>
   hadoop fs –rmr <dir_name>
   hadoop fs –put <local_file> <remote_dir>
   hadoop fs –get <remote_file> <local_dir>
   hadoop fs –cat <remote_file>
   hadoop fs –help
Let’s write some code
Count Words Demo
   Create a mapper class
       Override map() method
   Create a reducer class
       Override reduce() method
   Create a main method
   Create JAR
   Run it on Hadoop
Map Method
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {

  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);

   while (itr.hasMoreTokens()) {
     context.write(new Text(itr.nextToken()), new
IntWritable(1));
   }
}
Reduce Method
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException,
InterruptedException {

    int sum = 0;
    for (IntWritable value : values) {
       sum += value.get();
    }
    context.write(key, new IntWritable(sum));
}
Main Method
Job job = new Job();
job.setJarByClass(CountWords.class);
job.setJobName("Count Words");

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(CountWordsMapper.class);

job.setReducerClass(CountWordsReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Run it on Hadoop


hadoop jar dist/countwords.jar
com.sudarmuthu.hadoop.countwords.CountWord
s input/ output/
Output
at          1
be          3
can         7
can't       1
code        2
command     1
connect     1
consider    1
continued   1
control     4
could       1
couple      1
courtesy    1
desktop,    1
detailed    1
details     1
…..
…..
Pig
What is Pig?
Pig provides an abstraction for processing large
datasets

Consists of
 Pig Latin – Language to express data flows

 Execution environment
Why we need Pig?
   MapReduce can get complex if your data needs
    lot of processing/transformations
   MapReduce provides primitive data structures
   Pig provides rich data structures
   Supports complex operations like joins
Running Pig programs
   In an interactive shell called Grunt
   As a Pig Script
   Embedded into Java programs (like JDBC)
Grunt – Interactive Shell
Grunt shell
   fs commands – like hadoop fs
     fs –ls
     Fs –mkdir

   fs copyToLocal <file>
   fs copyFromLocal <local_file> <dest>
   exec – execute Pig scripts
   sh – execute shell scripts
Let’s see them in action
Pig Latin
   LOAD – Read files
   DUMP – Dump data in the console
   JOIN – Do a join on data sets
   FILTER – Filter data sets
   SORT – Sort data
   STORE – Store data back in files
Let’s see some code
Sort words based on count
Filter words present in a list
HBase
What is Hbase?
   Distributed, column-oriented database built on
    top of HDFS
   Useful when real-time read/write random-access
    to very large datasets is needed.
   Can handle billions of rows with millions of
    columns
Hive
What is Hive?
   Useful for managing and querying structured
    data
   Provides SQL like syntax
   Meta data is stored in a RDBMS
   Extensible with types, functions , scripts etc
Hadoop                           Relational Databases
   Affordable                      Interactive response times
    Storage/Compute                 ACID
   Structured or Unstructured      Structured data
   Resilient Auto Scalability      Cost/Scale prohibitive
Thank You

More Related Content

What's hot

Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin IGuixing Bai
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
SPL: The Undiscovered Library - DataStructures
SPL: The Undiscovered Library -  DataStructuresSPL: The Undiscovered Library -  DataStructures
SPL: The Undiscovered Library - DataStructuresMark Baker
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern PerlDave Cross
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup Alberto Paro
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational BiologyAtreyiB
 

What's hot (20)

Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Bioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperlBioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperl
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
 
Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin I
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Intro to The PHP SPL
Intro to The PHP SPLIntro to The PHP SPL
Intro to The PHP SPL
 
SPL: The Undiscovered Library - DataStructures
SPL: The Undiscovered Library -  DataStructuresSPL: The Undiscovered Library -  DataStructures
SPL: The Undiscovered Library - DataStructures
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
 
Howto argparse
Howto argparseHowto argparse
Howto argparse
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational Biology
 

Viewers also liked

Coscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoopCoscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoopWisely chen
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Swiss Big Data User Group
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 

Viewers also liked (7)

Coscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoopCoscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoop
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 

Similar to Hands on Hadoop and pig

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 

Similar to Hands on Hadoop and pig (20)

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 

More from Sudar Muthu

A quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress MeetupA quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress MeetupSudar Muthu
 
WordPress Developer tools
WordPress Developer toolsWordPress Developer tools
WordPress Developer toolsSudar Muthu
 
WordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivityWordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivitySudar Muthu
 
Unit testing for WordPress
Unit testing for WordPressUnit testing for WordPress
Unit testing for WordPressSudar Muthu
 
Unit testing in php
Unit testing in phpUnit testing in php
Unit testing in phpSudar Muthu
 
Using arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of thingsUsing arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of thingsSudar Muthu
 
How arduino helped me in life
How arduino helped me in lifeHow arduino helped me in life
How arduino helped me in lifeSudar Muthu
 
Having fun with hardware
Having fun with hardwareHaving fun with hardware
Having fun with hardwareSudar Muthu
 
Getting started with arduino workshop
Getting started with arduino workshopGetting started with arduino workshop
Getting started with arduino workshopSudar Muthu
 
Python in raspberry pi
Python in raspberry piPython in raspberry pi
Python in raspberry piSudar Muthu
 
Hack 101 at IIT Kanpur
Hack 101 at IIT KanpurHack 101 at IIT Kanpur
Hack 101 at IIT KanpurSudar Muthu
 
PureCSS open hack 2013
PureCSS open hack 2013PureCSS open hack 2013
PureCSS open hack 2013Sudar Muthu
 
Arduino Robotics workshop day2
Arduino Robotics workshop day2Arduino Robotics workshop day2
Arduino Robotics workshop day2Sudar Muthu
 
Arduino Robotics workshop Day1
Arduino Robotics workshop Day1Arduino Robotics workshop Day1
Arduino Robotics workshop Day1Sudar Muthu
 
Lets make robots
Lets make robotsLets make robots
Lets make robotsSudar Muthu
 
Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)Sudar Muthu
 
Controlling robots using javascript
Controlling robots using javascriptControlling robots using javascript
Controlling robots using javascriptSudar Muthu
 
Picture perfect hacks with flickr API
Picture perfect hacks with flickr APIPicture perfect hacks with flickr API
Picture perfect hacks with flickr APISudar Muthu
 
Capabilities of Arduino
Capabilities of ArduinoCapabilities of Arduino
Capabilities of ArduinoSudar Muthu
 

More from Sudar Muthu (20)

A quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress MeetupA quick preview of WP CLI - Chennai WordPress Meetup
A quick preview of WP CLI - Chennai WordPress Meetup
 
WordPress Developer tools
WordPress Developer toolsWordPress Developer tools
WordPress Developer tools
 
WordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivityWordPress Developer Tools to increase productivity
WordPress Developer Tools to increase productivity
 
Unit testing for WordPress
Unit testing for WordPressUnit testing for WordPress
Unit testing for WordPress
 
Unit testing in php
Unit testing in phpUnit testing in php
Unit testing in php
 
Using arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of thingsUsing arduino and raspberry pi for internet of things
Using arduino and raspberry pi for internet of things
 
How arduino helped me in life
How arduino helped me in lifeHow arduino helped me in life
How arduino helped me in life
 
Having fun with hardware
Having fun with hardwareHaving fun with hardware
Having fun with hardware
 
Getting started with arduino workshop
Getting started with arduino workshopGetting started with arduino workshop
Getting started with arduino workshop
 
Python in raspberry pi
Python in raspberry piPython in raspberry pi
Python in raspberry pi
 
Hack 101 at IIT Kanpur
Hack 101 at IIT KanpurHack 101 at IIT Kanpur
Hack 101 at IIT Kanpur
 
PureCSS open hack 2013
PureCSS open hack 2013PureCSS open hack 2013
PureCSS open hack 2013
 
Arduino Robotics workshop day2
Arduino Robotics workshop day2Arduino Robotics workshop day2
Arduino Robotics workshop day2
 
Arduino Robotics workshop Day1
Arduino Robotics workshop Day1Arduino Robotics workshop Day1
Arduino Robotics workshop Day1
 
Lets make robots
Lets make robotsLets make robots
Lets make robots
 
Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)Capabilities of Arduino (including Due)
Capabilities of Arduino (including Due)
 
Controlling robots using javascript
Controlling robots using javascriptControlling robots using javascript
Controlling robots using javascript
 
Picture perfect hacks with flickr API
Picture perfect hacks with flickr APIPicture perfect hacks with flickr API
Picture perfect hacks with flickr API
 
Hacking 101
Hacking 101Hacking 101
Hacking 101
 
Capabilities of Arduino
Capabilities of ArduinoCapabilities of Arduino
Capabilities of Arduino
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Hands on Hadoop and pig

  • 1. BigData using Hadoop and Pig Sudar Muthu Research Engineer Yahoo Labs http://sudarmuthu.com http://twitter.com/sudarmuthu
  • 2. Who am I?  Research Engineer at Yahoo Labs  Mines useful information from huge datasets  Worked on both structured and unstructured data.  Builds robots as hobby ;)
  • 3. What we will see today?  What is BigData?  Get our hands dirty with Hadoop  See some code  Try out Pig  Glimpse of Hbase and Hive
  • 5. Big data is a collection of data sets so large ” and complex that it becomes difficult to process using on-hand database management tools http://en.wikipedia.org/wiki/Big_data
  • 6. How big is BigData?
  • 7. 1GB today is not the same as 1GB just 10 years before
  • 8. Anything that doesn’t fit into the RAM of a single machine
  • 10. Data in Movement (streams)  Twitter/Facebook comments  Stock market data  Access logs of a busy web server  Sensors: Vital signs of a newly born
  • 11. Data at rest (Oceans)  Collection of what has streamed  Emails or IM messages  Social Media  Unstructured documents: forms, claims
  • 12. We have all this data and need to find a way to process them
  • 13. Traditional way of scaling (Scaling up)  Make the machine more powerful  Add more RAM  Add more cores to CPU  It is going to be very expensive  Will be limited by disk seek and read time  Single point of failure
  • 14. New way to scale up (Scale out)  Add more instances of the same machine  Cost is less compared to scaling up  Immune to failure of a single or a set of nodes  Disk seek and write time is not going to be bottleneck  Future safe (to some extend)
  • 15. Is it fit for ALL types of problems?
  • 18. A scalable, fault-tolerant grid operating system for data storage and processing
  • 19. What is Hadoop?  Runs on Commodity hardware  HDFS: Fault-tolerant high-bandwidth clustered storage  MapReduce: Distributed data processing  Works with structured and unstructured data  Open source, Apache license  Master (named-node) – Slave architecture
  • 20. Design Principles  System shall manage and heal itself  Performance shall scale linearly  Algorithm should move to data  Lower latency, lower bandwidth  Simple core, modular and extensible
  • 21. Components of Hadoop  HDFS  Map Reduce  PIG  HBase  Hive
  • 23. What I am not going to cover?  Installation or setting up Hadoop  Will be running all the code in a single node instance  Monitoring of the clusters  Performance tuning  User authentication or quota
  • 24. Before we get into code, let’s understand some concepts
  • 27. MapReduce Consists of two functions  Map  Filter and transform the input, which the reducer can understand  Reduce  Aggregate over the input provided by the Map function
  • 28. Formal definition Map <k1, v1> -> list(<k2,v2>) Reduce <k2, list(v2)> -> list <k3, v3>
  • 29. Let’s see some examples
  • 30. Count number of words in files Map <file_name, file_contents> => list<word, count> Reduce <word, list(count)> => <word, sum_of_counts>
  • 31. Count number of words in files Map <“file1”, “to be or not to be”> => {<“to”,1>, <“be”,1>, <“or”,1>, <“not”,1>, <“to,1>, <“be”,1>}
  • 32. Count number of words in files Reduce {<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>, <“not”,<1>>} => {<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
  • 33. Max temperature in a year Map <file_name, file_contents> => <year, temp> Reduce <year, list(temp)> => <year, max_temp>
  • 34. HDFS
  • 35. HDFS  Distributed file system  Data is distributed over different nodes  Will be replicated for fail over  Is abstracted out for the algorithms
  • 36.
  • 37.
  • 39. HDFS Commands  hadoop fs –mkdir <dir_name>  hadoop fs –ls <dir_name>  hadoop fs –rmr <dir_name>  hadoop fs –put <local_file> <remote_dir>  hadoop fs –get <remote_file> <local_dir>  hadoop fs –cat <remote_file>  hadoop fs –help
  • 41. Count Words Demo  Create a mapper class  Override map() method  Create a reducer class  Override reduce() method  Create a main method  Create JAR  Run it on Hadoop
  • 42. Map Method public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); } }
  • 43. Reduce Method public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); }
  • 44. Main Method Job job = new Job(); job.setJarByClass(CountWords.class); job.setJobName("Count Words"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(CountWordsMapper.class); job.setReducerClass(CountWordsReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
  • 45. Run it on Hadoop hadoop jar dist/countwords.jar com.sudarmuthu.hadoop.countwords.CountWord s input/ output/
  • 46. Output at 1 be 3 can 7 can't 1 code 2 command 1 connect 1 consider 1 continued 1 control 4 could 1 couple 1 courtesy 1 desktop, 1 detailed 1 details 1 ….. …..
  • 47. Pig
  • 48. What is Pig? Pig provides an abstraction for processing large datasets Consists of  Pig Latin – Language to express data flows  Execution environment
  • 49. Why we need Pig?  MapReduce can get complex if your data needs lot of processing/transformations  MapReduce provides primitive data structures  Pig provides rich data structures  Supports complex operations like joins
  • 50. Running Pig programs  In an interactive shell called Grunt  As a Pig Script  Embedded into Java programs (like JDBC)
  • 52. Grunt shell  fs commands – like hadoop fs  fs –ls  Fs –mkdir  fs copyToLocal <file>  fs copyFromLocal <local_file> <dest>  exec – execute Pig scripts  sh – execute shell scripts
  • 53. Let’s see them in action
  • 54. Pig Latin  LOAD – Read files  DUMP – Dump data in the console  JOIN – Do a join on data sets  FILTER – Filter data sets  SORT – Sort data  STORE – Store data back in files
  • 56. Sort words based on count
  • 57. Filter words present in a list
  • 58. HBase
  • 59. What is Hbase?  Distributed, column-oriented database built on top of HDFS  Useful when real-time read/write random-access to very large datasets is needed.  Can handle billions of rows with millions of columns
  • 60. Hive
  • 61. What is Hive?  Useful for managing and querying structured data  Provides SQL like syntax  Meta data is stored in a RDBMS  Extensible with types, functions , scripts etc
  • 62. Hadoop Relational Databases  Affordable  Interactive response times Storage/Compute  ACID  Structured or Unstructured  Structured data  Resilient Auto Scalability  Cost/Scale prohibitive