SlideShare a Scribd company logo
1 of 19
MapReduce@DirectI

     amkiray: ramki.g@directi.com
    uvdhray: dhruv.m@directi.com
Lets start with an example…

 access.log
 timestamp,url,response_code,response_time


 products.dat
 date, product_id, price
Requirement:

    Number of requests in the last 30 days.




    $> ls –rt *.log | tail -30 | xargs “wc –l”
Requirement:

    Busiest 30 minutes in last 30 days.





    $> ls –rt   *.log| tail -30 | xargs “./count_30min.sh“
Requirement:

    Number of failed buy requests for products worth

    more than $30 in the last 30 days .

    Import data to an RDBMS;

    SELECT COUNT(*) FROM logs, products
    WHERE GET_REQUEST_TYPE(logs.url)=„BUY‟
    AND
    GET_PRODUCT_ID(logs.url)=products.product_id
    AND product.price>30
    AND DATE(log.timestamp) = products.date;
Now gimme the number of
                  
                      failed buy requests for products
                      worth more than $30 in the
It might take a
                      last 1 Year.
     while!
2 days later….




               On its way!
             Inserting data
            into database…
5 days later…

                 $> mysqladmin processlist
           +-----------------------------------+
           |   Query   |   Copy to Temp Table |
           +-----------------------------------+

                       May be its Joining!!
                     Or may be its dead…
         Or may be my replacement will see the result .. 
Hadoop    god@internal.directi.com
Go use
         bloody!
But Why?!
    Distributed data processing cluster

    A distributed file system

    Data location sensitive Task scheduler

    MapReduce paradigm

    Handles Parallelizable and distributable tasks

    Failover capability

    Web based monitoring capabilities

    More value for your time!

Where to start??

    MapReduce:

    Q: Sum of all squares of a=[1,2,3,4,3,2,7]

    Simple….




    fold( map(a, square()),                          sum())



      You can do that in any functional programming language….
Now do it for an array
   of 100 million
  elements…….
This is where Hadoop comes in..
    Distributed File System : HDFS

    Distributed Computation/Task Processing: Hadoop

    Name Node + Data Node

    Task Tracker + Job Tracker

Task Tracker
How MapReduce Works…
An example
     Word Count


hadoop jar contrib/streaming/hadoop-0.*-streaming.jar
      -jobconf mapred.data.field.separator=quot;,quot;
      -input 'wc.eg.in'
      -output 'wc.eg.out'
      -mapper 'wc -w'
      -reducer quot;awk „{ sum+=$1 } END{ print sum}‟quot;




    Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp
    Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp
Pig and Pig Latin
    A procedural language for MapReduce operations.


logs = LOAD 'access.logs' USING PigStorage(',') AS
         (ts:int, URL:chararray, resp:chararray, resp_time:int);

products = LOAD 'products.dat' USING PigStorage(',') AS
         (date:int, pid:int, price:int);

l1 = FOREACH logs GENERATE GetDate(ts) as req_date,
         GetProductID(URL) as prod_id,
         GetRequestType(URL) as rtype, resp, resp_time;

j1 = JOIN l1 BY prod_id, products by pid;

j2 = FILTER j1 BY req_date==date AND price>30.0F;

j3 = GROUP j2 ALL;

j4 = FOREACH j3 GENERATE COUNT(j2);

DUMP j4
Q&A

More Related Content

What's hot

What's hot (20)

Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
 
PostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQPostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQ
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco Systems
 
Raw system logs processing with hive
Raw system logs processing with hiveRaw system logs processing with hive
Raw system logs processing with hive
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
GCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow IntroductionGCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow Introduction
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
Google App Engine BeCamp 2008
Google App Engine BeCamp 2008Google App Engine BeCamp 2008
Google App Engine BeCamp 2008
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Our Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.doOur Story With ClickHouse at seo.do
Our Story With ClickHouse at seo.do
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
 
Prashant de-ny-project-s1
Prashant de-ny-project-s1Prashant de-ny-project-s1
Prashant de-ny-project-s1
 

Similar to MapReduce@DirectI

Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
Chester Chen
 

Similar to MapReduce@DirectI (20)

03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Xadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsXadoop - new approaches to data analytics
Xadoop - new approaches to data analytics
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4Why and How Powershell will rule the Command Line - Barcamp LA 4
Why and How Powershell will rule the Command Line - Barcamp LA 4
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
 
Hadoop mapreduce user_group_daniel_sikar_presentation_06.09.2010
Hadoop mapreduce user_group_daniel_sikar_presentation_06.09.2010Hadoop mapreduce user_group_daniel_sikar_presentation_06.09.2010
Hadoop mapreduce user_group_daniel_sikar_presentation_06.09.2010
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web site
 
Gur1009
Gur1009Gur1009
Gur1009
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
 
Handout3o
Handout3oHandout3o
Handout3o
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
 
10 things I learned building Nomad packs
10 things I learned building Nomad packs10 things I learned building Nomad packs
10 things I learned building Nomad packs
 

More from Directi Group

Hr coverage directi 2012
Hr coverage directi 2012Hr coverage directi 2012
Hr coverage directi 2012
Directi Group
 
MDI - Mandevian Knights
MDI - Mandevian KnightsMDI - Mandevian Knights
MDI - Mandevian Knights
Directi Group
 
FMS - Riders on the Storm
FMS - Riders on the StormFMS - Riders on the Storm
FMS - Riders on the Storm
Directi Group
 
ISB - Beirut Film Fiesta
ISB - Beirut Film FiestaISB - Beirut Film Fiesta
ISB - Beirut Film Fiesta
Directi Group
 
Great Lakes - Synergy
Great Lakes - SynergyGreat Lakes - Synergy
Great Lakes - Synergy
Directi Group
 
Great Lakes - Fabulous Four
Great Lakes - Fabulous FourGreat Lakes - Fabulous Four
Great Lakes - Fabulous Four
Directi Group
 
IIM C - Baker Street
IIM C - Baker StreetIIM C - Baker Street
IIM C - Baker Street
Directi Group
 
Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012
Directi Group
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
Directi Group
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
Directi Group
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
Directi Group
 

More from Directi Group (20)

Hr coverage directi 2012
Hr coverage directi 2012Hr coverage directi 2012
Hr coverage directi 2012
 
IIM L - ConArtists
IIM L - ConArtistsIIM L - ConArtists
IIM L - ConArtists
 
MDI - Mandevian Knights
MDI - Mandevian KnightsMDI - Mandevian Knights
MDI - Mandevian Knights
 
ISB - Pikturewale
ISB - PikturewaleISB - Pikturewale
ISB - Pikturewale
 
FMS - Riders on the Storm
FMS - Riders on the StormFMS - Riders on the Storm
FMS - Riders on the Storm
 
IIM L - Inferno
IIM L - InfernoIIM L - Inferno
IIM L - Inferno
 
ISB - Beirut Film Fiesta
ISB - Beirut Film FiestaISB - Beirut Film Fiesta
ISB - Beirut Film Fiesta
 
Great Lakes - Synergy
Great Lakes - SynergyGreat Lakes - Synergy
Great Lakes - Synergy
 
Great Lakes - Fabulous Four
Great Lakes - Fabulous FourGreat Lakes - Fabulous Four
Great Lakes - Fabulous Four
 
IIM C - Baker Street
IIM C - Baker StreetIIM C - Baker Street
IIM C - Baker Street
 
Directi Case Study Contest - Team idate from MDI Gurgaon
Directi Case Study Contest -  Team idate from MDI GurgaonDirecti Case Study Contest -  Team idate from MDI Gurgaon
Directi Case Study Contest - Team idate from MDI Gurgaon
 
Directi Case Study Contest - Relationships Matter from ISB Hyderabad
Directi Case Study Contest - Relationships Matter from ISB HyderabadDirecti Case Study Contest - Relationships Matter from ISB Hyderabad
Directi Case Study Contest - Relationships Matter from ISB Hyderabad
 
Directi Case Study Contest - Team Goodfellas from ISB Hyderabad
Directi Case Study Contest - Team Goodfellas from ISB HyderabadDirecti Case Study Contest - Team Goodfellas from ISB Hyderabad
Directi Case Study Contest - Team Goodfellas from ISB Hyderabad
 
Directi Case Study Contest- Team Joka warriors from IIM C
Directi Case Study Contest- Team Joka warriors from IIM CDirecti Case Study Contest- Team Joka warriors from IIM C
Directi Case Study Contest- Team Joka warriors from IIM C
 
Directi Case Study Contest - Team Alkaline Jazz from IIFT
Directi Case Study Contest - Team Alkaline Jazz from IIFTDirecti Case Study Contest - Team Alkaline Jazz from IIFT
Directi Case Study Contest - Team Alkaline Jazz from IIFT
 
Directi Case Study Contest - Singles 360 by Team Awesome from IIM A
Directi Case Study Contest - Singles 360 by Team Awesome from IIM ADirecti Case Study Contest - Singles 360 by Team Awesome from IIM A
Directi Case Study Contest - Singles 360 by Team Awesome from IIM A
 
Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

MapReduce@DirectI

  • 1. MapReduce@DirectI amkiray: ramki.g@directi.com uvdhray: dhruv.m@directi.com
  • 2. Lets start with an example… access.log timestamp,url,response_code,response_time products.dat date, product_id, price
  • 3. Requirement:  Number of requests in the last 30 days.  $> ls –rt *.log | tail -30 | xargs “wc –l”
  • 4. Requirement:  Busiest 30 minutes in last 30 days.  $> ls –rt *.log| tail -30 | xargs “./count_30min.sh“
  • 5. Requirement:  Number of failed buy requests for products worth  more than $30 in the last 30 days . Import data to an RDBMS; SELECT COUNT(*) FROM logs, products WHERE GET_REQUEST_TYPE(logs.url)=„BUY‟ AND GET_PRODUCT_ID(logs.url)=products.product_id AND product.price>30 AND DATE(log.timestamp) = products.date;
  • 6. Now gimme the number of  failed buy requests for products worth more than $30 in the It might take a last 1 Year. while!
  • 7. 2 days later…. On its way! Inserting data into database…
  • 8. 5 days later… $> mysqladmin processlist +-----------------------------------+ | Query | Copy to Temp Table | +-----------------------------------+ May be its Joining!! Or may be its dead… Or may be my replacement will see the result .. 
  • 9. Hadoop god@internal.directi.com Go use bloody!
  • 10. But Why?! Distributed data processing cluster  A distributed file system  Data location sensitive Task scheduler  MapReduce paradigm  Handles Parallelizable and distributable tasks  Failover capability  Web based monitoring capabilities  More value for your time! 
  • 11. Where to start?? MapReduce:  Q: Sum of all squares of a=[1,2,3,4,3,2,7]  Simple….  fold( map(a, square()), sum()) You can do that in any functional programming language….
  • 12. Now do it for an array of 100 million elements…….
  • 13. This is where Hadoop comes in.. Distributed File System : HDFS  Distributed Computation/Task Processing: Hadoop  Name Node + Data Node  Task Tracker + Job Tracker 
  • 14.
  • 17. An example Word Count  hadoop jar contrib/streaming/hadoop-0.*-streaming.jar -jobconf mapred.data.field.separator=quot;,quot; -input 'wc.eg.in' -output 'wc.eg.out' -mapper 'wc -w' -reducer quot;awk „{ sum+=$1 } END{ print sum}‟quot; Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp
  • 18. Pig and Pig Latin A procedural language for MapReduce operations.  logs = LOAD 'access.logs' USING PigStorage(',') AS (ts:int, URL:chararray, resp:chararray, resp_time:int); products = LOAD 'products.dat' USING PigStorage(',') AS (date:int, pid:int, price:int); l1 = FOREACH logs GENERATE GetDate(ts) as req_date, GetProductID(URL) as prod_id, GetRequestType(URL) as rtype, resp, resp_time; j1 = JOIN l1 BY prod_id, products by pid; j2 = FILTER j1 BY req_date==date AND price>30.0F; j3 = GROUP j2 ALL; j4 = FOREACH j3 GENERATE COUNT(j2); DUMP j4
  • 19. Q&A