SlideShare a Scribd company logo
1 of 56
Apache Hadoop
  an introduction

    Todd Lipcon
   todd@cloudera.com
   @tlipcon @cloudera
     March 24, 2011
Introductions
Software Engineer at
Apache Hadoop, HBase, Thrift
committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems
Outline
Why should you care? (Intro)
What is Hadoop?
How does it work?
The Hadoop Ecosystem
Use Cases
Experiences as a developer
Data is the difference.

    What‟s data?
Photo by C.C Chapman (CC BY-NC-ND)
http://www.flickr.com/photos/cc_chapman/3342268874/
“Every two days we create as
much information as we did
from the dawn of civilization
up until 2003.”

                    Eric Schmidt
“I keep saying that the sexy
job in the next 10 years will be
statisticians. And I‟m not
kidding.”

                              Hal Varian
             (Google‟s chief economist)
Are you throwing
  away data?
 Data comes in many shapes and
 sizes: relational tuples, log files,
 semistructured textual data (e.g., e-
 mail), … .
 Are you throwing it away because it
 doesn‟t „fit‟?
So, what‟s
Hadoop?


    The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
So, what‟s
    Hadoop?


The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Apache Hadoop is an
        open-source system
   to reliably store and process
       GOBS of data
across many commodity computers.
Two Core
        Components
      Store            Process

     HDFS            Map/Reduce
   Self-healing      Fault-tolerant
 high-bandwidth       distributed
clustered storage.    processing.
What makes
Hadoop special?
Falsehood #1: Machines can be reliable…




Image: MadMan the Mighty CC BY-NC-SA
Hadoop separates
distributed system fault-
  tolerance code from
   application logic.
                    Unicorns




    Systems
                Statisticians
  Programmers
Falsehood #2: Machines deserve identities...




                                      Image:Laughing Squid CC BY-NC-SA
Hadoop lets you interact
                with a cluster, not a
                bunch of machines.




Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
Falsehood #3: Your analysis fits on one machine…




                                    Image: Matthew J. Stinson CC-BY-NC
Hadoop scales linearly
     with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:

  Extensive machine learning on <100GB of image
  data

  Simple SQL queries on >100TB of clickstream
  data

Hadoop works for both applications!
Hadoop sounds like
     magic.

               Coincidentally, today is
               Houdini‟s birthday, though
               he was not a Hadoop
               committer.




How is it possible?
A Typical Look...
5-4000 commodity servers
(8-core, 24GB RAM, 4-12 TB, gig-E)
2-level network architecture
 20-40 nodes per rack
Cluster nodes
Master nodes (1 each)

        NameNode (metadata server and database)

        JobTracker (scheduler)


Slave nodes (1-4000 each)

           DataNodes              TaskTrackers
         (block storage)         (task execution)
HDFS API
FileSystem fs =
  FileSystem.get(conf);
InputStream in = fs.open(new
  Path(“/foo/bar”));
OutputStream os = fs.create(new
  Path(“/baz”));
fs.delete(…), fs.listStatus(…)
HDFS Data Storage
           /logs/weblog.txt      DN 1
        64MB




                    blk_29232
                                 DN 2
158MB
        30MB 64MB




                    blk_19231

                                 DN 3
                    blk_329432


        NameNode                 DN 4
HDFS Write Path
•   HDFS has split the file into
    64MB blocks and stored it on
    the DataNodes.


•   Now, we want to process that
    data.
The MapReduce
 Programming
    Model
You specify map()
  and reduce()
   functions.

 The framework
 does the rest.
map()
       map: K₁,V₁→list K₂,V₂
Key:   byte offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”


Key:   userimage
Value: 2326 bytes

The map function runs on the same node as the data
was stored!
Input Format
•    Wait! HDFS is not a Key-Value store!
•    InputFormat interprets bytes as a Key
     and Value
    127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700]
    "GET /userimage/123 HTTP/1.0" 200 2326


    Key:   log offset 193284
    Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
    -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
The Shuffle

Each map output is assigned to a
“reducer” based on its key


map output is grouped and
sorted by key
reduce()
    K₂, iter(V₂)→list(K₃,V₃)
Key:     userimage
Value:   2326 bytes   (from map task 0001)
Value:   1000 bytes   (from map task 0008)
Value:   3020 bytes   (from map task 0120)
                          Reducer function
Key:   userimage
Value: 6346 bytes
                          TextOutputFormat
userimage t 6346
Putting it together...
Hadoop is
not just MapReduce
      (NoNoSQL!)

        Hive project adds SQL
        support to Hadoop
        HiveQL (SQL dialect)
        compiles to a query plan
        Query plan executes as
        MapReduce jobs
Hive Example
CREATE TABLE movie_rating_data (
  userid INT, movieid INT, rating INT, unixtime STRING
) ROW FORMAT DELIMITED
  FIELDS TERMINATED BY 't„
  STORED AS TEXTFILE;

LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE
movie_rating_data;

CREATE TABLE average_ratings AS
SELECT movieid, AVG(rating) FROM movie_rating_data
GROUP BY movieid;
The Hadoop
 Ecosystem


(Column DB)
Hadoop in the Wild
     (yes, it‟s used in production)

Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ‟09)

Facebook: 15TB new data per day;
1200 machines, 21PB in one cluster

Twitter: >1TB per day, ~120 nodes

Lots of 5-40 node clusters at companies without
petabytes of data (web, retail, finance, telecom,
research)
Use Cases
Product
         Recommendations
•   Naïve approach: Users who bought toothpaste bought
    toothbrushes.

•   Hadoop approach

    •   What products did a user browse, hover over, rate,
        add to cart (but not buy), etc in the last 2 months?

    •   What are the attributes of the user?

    •   What are our margins, promotions, inventory, etc?
Production
    Recommendations
•   A lot of data!

    •   Activity: ~20GB/day x ~60 days = 1.2TB

    •   User Data: 2GB

    •   Purchase Data: ~5GB

•   Pre-aggregating loses fidelity for individual users.
Hadoop and Java
              (the good)


Integration, integration, integration!
Tooling: IDEs, JCarder, AspectJ,
Maven/Ivy
Developer accessibility
Hadoop and Java
             (the bad)


Java is great for applications. Hadoop is
systems programming.
JNI is our hammer
 Compression, Security, FS access
C++ wrapper for setuid task execution
Hadoop and Java
             (the ugly)


JVM bugs!
Garbage Collection pauses on 50GB
heaps
WORA is a giant lie for systems – worst
of both worlds?
Ok, fine, what next?
Get Hadoop!
 CDH - Cloudera‟s Distribution
 including Apache Hadoop
 http://cloudera.com/
 http://hadoop.apache.org/
Try it out! (Locally, VM, or EC2)
Watch free training videos on
http://cloudera.com/
Thanks!
•   todd@cloudera.com

•   @tlipcon

•   (feedback? yes!)

•   (hiring? yes!)

More Related Content

What's hot

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 

What's hot (20)

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 

Viewers also liked

Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】NTT DATA OSS Professional Services
 
Hadoop概要説明
Hadoop概要説明Hadoop概要説明
Hadoop概要説明Satoshi Noto
 
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)NTT DATA OSS Professional Services
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 

Viewers also liked (13)

Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】
 
Hadoop概要説明
Hadoop概要説明Hadoop概要説明
Hadoop概要説明
 
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
分散処理基盤ApacheHadoop入門とHadoopエコシステムの最新技術動向(OSC2015 Kansai発表資料)
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Big Data
Big DataBig Data
Big Data
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to EclipseCon Keynote: Apache Hadoop - An Introduction

Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 

Similar to EclipseCon Keynote: Apache Hadoop - An Introduction (20)

Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Final deck
Final deckFinal deck
Final deck
 
Hadoop
HadoopHadoop
Hadoop
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

EclipseCon Keynote: Apache Hadoop - An Introduction

  • 1. Apache Hadoop an introduction Todd Lipcon todd@cloudera.com @tlipcon @cloudera March 24, 2011
  • 2. Introductions Software Engineer at Apache Hadoop, HBase, Thrift committer Previously: systems programming, operations, large scale data analysis I love data and data systems
  • 3. Outline Why should you care? (Intro) What is Hadoop? How does it work? The Hadoop Ecosystem Use Cases Experiences as a developer
  • 4.
  • 5.
  • 6. Data is the difference. What‟s data?
  • 7. Photo by C.C Chapman (CC BY-NC-ND) http://www.flickr.com/photos/cc_chapman/3342268874/
  • 8.
  • 9.
  • 10.
  • 11. “Every two days we create as much information as we did from the dawn of civilization up until 2003.” Eric Schmidt
  • 12. “I keep saying that the sexy job in the next 10 years will be statisticians. And I‟m not kidding.” Hal Varian (Google‟s chief economist)
  • 13. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  • 14. So, what‟s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  • 15. So, what‟s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  • 16. Apache Hadoop is an open-source system to reliably store and process GOBS of data across many commodity computers.
  • 17. Two Core Components Store Process HDFS Map/Reduce Self-healing Fault-tolerant high-bandwidth distributed clustered storage. processing.
  • 19. Falsehood #1: Machines can be reliable… Image: MadMan the Mighty CC BY-NC-SA
  • 20. Hadoop separates distributed system fault- tolerance code from application logic. Unicorns Systems Statisticians Programmers
  • 21. Falsehood #2: Machines deserve identities... Image:Laughing Squid CC BY-NC-SA
  • 22. Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  • 23. Falsehood #3: Your analysis fits on one machine… Image: Matthew J. Stinson CC-BY-NC
  • 24. Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream data Hadoop works for both applications!
  • 25. Hadoop sounds like magic. Coincidentally, today is Houdini‟s birthday, though he was not a Hadoop committer. How is it possible?
  • 26. A Typical Look... 5-4000 commodity servers (8-core, 24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack
  • 27. Cluster nodes Master nodes (1 each) NameNode (metadata server and database) JobTracker (scheduler) Slave nodes (1-4000 each) DataNodes TaskTrackers (block storage) (task execution)
  • 28. HDFS API FileSystem fs = FileSystem.get(conf); InputStream in = fs.open(new Path(“/foo/bar”)); OutputStream os = fs.create(new Path(“/baz”)); fs.delete(…), fs.listStatus(…)
  • 29. HDFS Data Storage /logs/weblog.txt DN 1 64MB blk_29232 DN 2 158MB 30MB 64MB blk_19231 DN 3 blk_329432 NameNode DN 4
  • 31. HDFS has split the file into 64MB blocks and stored it on the DataNodes. • Now, we want to process that data.
  • 33. You specify map() and reduce() functions. The framework does the rest.
  • 34. map() map: K₁,V₁→list K₂,V₂ Key: byte offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326” Key: userimage Value: 2326 bytes The map function runs on the same node as the data was stored!
  • 35. Input Format • Wait! HDFS is not a Key-Value store! • InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  • 36. The Shuffle Each map output is assigned to a “reducer” based on its key map output is grouped and sorted by key
  • 37. reduce() K₂, iter(V₂)→list(K₃,V₃) Key: userimage Value: 2326 bytes (from map task 0001) Value: 1000 bytes (from map task 0008) Value: 3020 bytes (from map task 0120) Reducer function Key: userimage Value: 6346 bytes TextOutputFormat userimage t 6346
  • 39. Hadoop is not just MapReduce (NoNoSQL!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  • 40. Hive Example CREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't„ STORED AS TEXTFILE; LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE movie_rating_data; CREATE TABLE average_ratings AS SELECT movieid, AVG(rating) FROM movie_rating_data GROUP BY movieid;
  • 42. Hadoop in the Wild (yes, it‟s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ‟09) Facebook: 15TB new data per day; 1200 machines, 21PB in one cluster Twitter: >1TB per day, ~120 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research)
  • 44. Product Recommendations • Naïve approach: Users who bought toothpaste bought toothbrushes. • Hadoop approach • What products did a user browse, hover over, rate, add to cart (but not buy), etc in the last 2 months? • What are the attributes of the user? • What are our margins, promotions, inventory, etc?
  • 45. Production Recommendations • A lot of data! • Activity: ~20GB/day x ~60 days = 1.2TB • User Data: 2GB • Purchase Data: ~5GB • Pre-aggregating loses fidelity for individual users.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52. Hadoop and Java (the good) Integration, integration, integration! Tooling: IDEs, JCarder, AspectJ, Maven/Ivy Developer accessibility
  • 53. Hadoop and Java (the bad) Java is great for applications. Hadoop is systems programming. JNI is our hammer Compression, Security, FS access C++ wrapper for setuid task execution
  • 54. Hadoop and Java (the ugly) JVM bugs! Garbage Collection pauses on 50GB heaps WORA is a giant lie for systems – worst of both worlds?
  • 55. Ok, fine, what next? Get Hadoop! CDH - Cloudera‟s Distribution including Apache Hadoop http://cloudera.com/ http://hadoop.apache.org/ Try it out! (Locally, VM, or EC2) Watch free training videos on http://cloudera.com/
  • 56. Thanks! • todd@cloudera.com • @tlipcon • (feedback? yes!) • (hiring? yes!)