SlideShare une entreprise Scribd logo
1  sur  21
Comparing Hadoop Data Storage
                (HDFS, HBase, Hive and Pig)

Rakesh Jadhav
SAS
Agenda

 •   Hadoop Ecosystem
 •   HDFS
 •   HBase
 •   Hive
 •   Pig
Hadoop Ecosystem
Hadoop Ecosystem Components
   HDFS:      Hadoop Distributed File System
   MapReduce: Hadoop Distributed Programming Paradigm
   HBase:     Hadoop Column Oriented Database for Random
                  Access Read/Write of Smaller Data
   Hive:      Hadoop Petabyte scalable Data Warehousing
                         Infrastructure
   Pig:       Hadoop Data Flow/Analysis Infrastructure
   Zookeeper: Hadoop Co-ordination service, Configuration Service
            Infrastructure
   Chukwa:    Hadoop Monitoring Service
   Avro:         Hadoop Data Serialization De-Serialization
              Infrastructure
   Mahout:      Hadoop Scalable Machine Learning Library
HDFS (Data Storage)
     Design Features

 •   Failure Is Norm
 •   Designed For Large Datasets than Small
 •   Designed For Batch Processing than Interactive
 •   Supports Write Once- Read Many
 •   Provides Interfaces to Move Processing Closer
     To Data
HDFS

 APPLICATION AREAS
  • Large Log Processing
  • Web search indexing
 LIMITATIONS
  •   Small Size Problem
  •   Single Node Of Failure
  •   No Random Access
  •   No Write Support
HBase (Data Storage)
  Design Features
 • Key-Value Store (Like Map)
 • Semi Structured Data
 • Column Family, Time Stamp
 • Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp
 • De-normalized Data
 • Faster Data Retrieval Using Column Families
 • Static Column Families, Dynamic Columns
RDBMS v/s HBase: Example
RDBMS
ID  Name Age       Birth-    Marital         Location Weight     Employer
                   Place     Status
1   Sam    35      Mumbai    Married         Pune     76         XYZ
2   Bob    56      Chicago   Married         New      79         PQR
                                             York
HBase
Row                   Personal Information                      Other Information
Key                     (Column Family)                         (Column Family)

1   Nam    Age:     Birth-Place   Marital       Weight:T2   Locatio    Employer:T1=
    e:     T2=      :T1=Mumbai    Status        = 76        n: T2=     XYZ
    T1=S   35                     :T2=                      Pune
    am                            Married       Weight:T1
           Age:                                 = 65        Locatio
           T1:=2                  Marital                   n:
           5                      Status:                   T1:=Mu
                                  T1=                       mbai
                                  Unmarried

2   …      …        …             …             …           …          …
HBase: Application Areas

 • Applications which need Store/Access/Search
   using Key
 • Need Fast Random Access/Update to scalable
   structured data
 • Applications Needing Flexible Table Schema
 • Applications Needing range-search capabilities
   supported by key ordering
HBase: Limitations

 •   Expensive Full Row Read
 •   No Secondary Keys
 •   No SQL Support
 •   Not Efficient for Big Cell Values
Hive (Data Access)
  Design Features

  • Scalable data warehouse on top of Hadoop
    developed by Facebook
  • SQL like Query Language HiveQL
  • Limited JDBC support
  • Support for rich data types
  • Ability to insert custom map-reduce jobs
Hive: Application Areas

 • Adhoc analysis on huge structured data, not
   having any requirement of low latency
 • Log processing
 • Text Mining
 • Document Indexing
 • Customer Facing business intelligence (Google
   analytics)
 • Predictive Modeling, hypothesis testing
Hive: Limitations

 • No Support To Update Data
 • Only Bulk Load Support
 • Not Efficient For Small Data
Hive: Example

 • create table employee (id bigint, name string,
   age int…) ROW FORMAT DELIMITED
   FIELDS TERMINATED BY 't' STORED AS
   TEXTFILE;
 • LOAD DATA LOCAL INPATH
   '/sas/employee.txt' OVERWRITE INTO
   TABLE employee; 
 • INSERT OVERWRITE TABLE oldest_employee
   SELECT * FROM employee SORT BY age
   DESC LIMIT 100;
Pig(Data Access)

  • Pig Latin High level data flow language.
  • Client side library, no server side deployment needed.
  • Batch processing large unstructured data
  • Procedural language
  • Runtime Schema Creation, Check point ability, Splits pipeline support
  • Customer code support
  • Rich data types
  • Support for Joins
Pig: Application Areas

 • Extract Transform Load (ETL)
 • Unstructured Data Analysis
PIG: Limitations

 • Not efficient for processing small datasets
PIG: Example

 Load Emplyee data from text file, filter it using
  age and joining year and group using joining
  year.
 1. records = LOAD 'sas/input/files/employee.txt'
   AS (joiningYear:chararray, employeeId:int, age:int);
 2. filtered_records = FILTER records BY age> 30 AND
  ( joiningYear >=2000 OR joiningYear <= 2012);
 3. grouped_records = GROUP filtered_records BY joiningYear;
   max_age = FOREACH grouped_records GENERATE group,
   MAX(filtered_records.age);
   DUMP max_age;
Conclusion

 Organizations
 •Revisit data strategy
 •Evaluate Hadoop Ecosystem
 •Build economical, scalable solutions for Big Data problems
References

• Hadoop: Definitive Guide, By Tom White
• http://hadoop.apache.org/
• http://developer.yahoo.com/hadoop/tutorial/
• http://www-
  01.ibm.com/software/data/infosphere/hadoop/
• http://www.information-
  management.com/blogs/
• http://www.mckinsey.com/insights/mgi/researc
  h/technology_and_innovation/big_data_the_next
  _frontier_for_innovation
Thank You




            21

Contenu connexe

Similaire à Indic threads pune12-comparing hadoop data storage

Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base
智杰 付
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 

Similaire à Indic threads pune12-comparing hadoop data storage (20)

Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
ACS DataMart_ppt
ACS DataMart_pptACS DataMart_ppt
ACS DataMart_ppt
 
ACS DataMart_ppt
ACS DataMart_pptACS DataMart_ppt
ACS DataMart_ppt
 
Apache hive
Apache hiveApache hive
Apache hive
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase ArcheTypes
HBase ArcheTypesHBase ArcheTypes
HBase ArcheTypes
 
Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Relational to Graph - Import
Relational to Graph - ImportRelational to Graph - Import
Relational to Graph - Import
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 

Plus de IndicThreads

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
IndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
IndicThreads
 

Plus de IndicThreads (20)

Http2 is here! And why the web needs it
Http2 is here! And why the web needs itHttp2 is here! And why the web needs it
Http2 is here! And why the web needs it
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsUnderstanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
 
Go Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayGo Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang way
 
Building Resilient Microservices
Building Resilient Microservices Building Resilient Microservices
Building Resilient Microservices
 
App using golang indicthreads
App using golang  indicthreadsApp using golang  indicthreads
App using golang indicthreads
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreads
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before Reacting
 
Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Indic threads pune12-comparing hadoop data storage

  • 1. Comparing Hadoop Data Storage (HDFS, HBase, Hive and Pig) Rakesh Jadhav SAS
  • 2. Agenda • Hadoop Ecosystem • HDFS • HBase • Hive • Pig
  • 4. Hadoop Ecosystem Components  HDFS: Hadoop Distributed File System  MapReduce: Hadoop Distributed Programming Paradigm  HBase: Hadoop Column Oriented Database for Random Access Read/Write of Smaller Data  Hive: Hadoop Petabyte scalable Data Warehousing Infrastructure  Pig: Hadoop Data Flow/Analysis Infrastructure  Zookeeper: Hadoop Co-ordination service, Configuration Service Infrastructure  Chukwa: Hadoop Monitoring Service  Avro: Hadoop Data Serialization De-Serialization Infrastructure  Mahout: Hadoop Scalable Machine Learning Library
  • 5. HDFS (Data Storage) Design Features • Failure Is Norm • Designed For Large Datasets than Small • Designed For Batch Processing than Interactive • Supports Write Once- Read Many • Provides Interfaces to Move Processing Closer To Data
  • 6. HDFS APPLICATION AREAS • Large Log Processing • Web search indexing LIMITATIONS • Small Size Problem • Single Node Of Failure • No Random Access • No Write Support
  • 7. HBase (Data Storage) Design Features • Key-Value Store (Like Map) • Semi Structured Data • Column Family, Time Stamp • Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp • De-normalized Data • Faster Data Retrieval Using Column Families • Static Column Families, Dynamic Columns
  • 8. RDBMS v/s HBase: Example RDBMS ID Name Age Birth- Marital Location Weight Employer Place Status 1 Sam 35 Mumbai Married Pune 76 XYZ 2 Bob 56 Chicago Married New 79 PQR York HBase Row Personal Information Other Information Key (Column Family) (Column Family) 1 Nam Age: Birth-Place Marital Weight:T2 Locatio Employer:T1= e: T2= :T1=Mumbai Status = 76 n: T2= XYZ T1=S 35 :T2= Pune am Married Weight:T1 Age: = 65 Locatio T1:=2 Marital n: 5 Status: T1:=Mu T1= mbai Unmarried 2 … … … … … … …
  • 9. HBase: Application Areas • Applications which need Store/Access/Search using Key • Need Fast Random Access/Update to scalable structured data • Applications Needing Flexible Table Schema • Applications Needing range-search capabilities supported by key ordering
  • 10. HBase: Limitations • Expensive Full Row Read • No Secondary Keys • No SQL Support • Not Efficient for Big Cell Values
  • 11. Hive (Data Access) Design Features • Scalable data warehouse on top of Hadoop developed by Facebook • SQL like Query Language HiveQL • Limited JDBC support • Support for rich data types • Ability to insert custom map-reduce jobs
  • 12. Hive: Application Areas • Adhoc analysis on huge structured data, not having any requirement of low latency • Log processing • Text Mining • Document Indexing • Customer Facing business intelligence (Google analytics) • Predictive Modeling, hypothesis testing
  • 13. Hive: Limitations • No Support To Update Data • Only Bulk Load Support • Not Efficient For Small Data
  • 14. Hive: Example • create table employee (id bigint, name string, age int…) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; • LOAD DATA LOCAL INPATH '/sas/employee.txt' OVERWRITE INTO TABLE employee;  • INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;
  • 15. Pig(Data Access) • Pig Latin High level data flow language. • Client side library, no server side deployment needed. • Batch processing large unstructured data • Procedural language • Runtime Schema Creation, Check point ability, Splits pipeline support • Customer code support • Rich data types • Support for Joins
  • 16. Pig: Application Areas • Extract Transform Load (ETL) • Unstructured Data Analysis
  • 17. PIG: Limitations • Not efficient for processing small datasets
  • 18. PIG: Example Load Emplyee data from text file, filter it using age and joining year and group using joining year. 1. records = LOAD 'sas/input/files/employee.txt' AS (joiningYear:chararray, employeeId:int, age:int); 2. filtered_records = FILTER records BY age> 30 AND ( joiningYear >=2000 OR joiningYear <= 2012); 3. grouped_records = GROUP filtered_records BY joiningYear; max_age = FOREACH grouped_records GENERATE group, MAX(filtered_records.age); DUMP max_age;
  • 19. Conclusion Organizations •Revisit data strategy •Evaluate Hadoop Ecosystem •Build economical, scalable solutions for Big Data problems
  • 20. References • Hadoop: Definitive Guide, By Tom White • http://hadoop.apache.org/ • http://developer.yahoo.com/hadoop/tutorial/ • http://www- 01.ibm.com/software/data/infosphere/hadoop/ • http://www.information- management.com/blogs/ • http://www.mckinsey.com/insights/mgi/researc h/technology_and_innovation/big_data_the_next _frontier_for_innovation
  • 21. Thank You 21