SlideShare a Scribd company logo
1 of 51
Download to read offline
Apache Hadoop
                          an introduction

                             Todd Lipcon
                           todd@cloudera.com
                           @tlipcon @cloudera
                              May 27, 2010



Thursday, May 27, 2010
Hi there!
                         Software Engineer at
                         Hadoop contributor, HBase committer
                         Previously: systems programming,
                         operations, large scale data analysis
                         I love data and data systems




Thursday, May 27, 2010
Outline
                         Why should you care? (Intro)
                         What is Hadoop?
                         The MapReduce Model
                         HDFS, Hadoop Map/Reduce
                         The Hadoop Ecosystem
                         Questions

Thursday, May 27, 2010
Data is everywhere.

                         Data is important.


Thursday, May 27, 2010
Thursday, May 27, 2010
Thursday, May 27, 2010
Thursday, May 27, 2010
Thursday, May 27, 2010
“I keep saying that the sexy job
                 in the next 10 years will be
              statisticians, and I’m not kidding.”

                                             Hal Varian
                            (Google’s chief economist)




Thursday, May 27, 2010
Are you throwing
                            away data?
                         Data comes in many shapes and sizes:
                         relational tuples, log files, semistructured
                         textual data (e.g., e-mail), … .
                         Are you throwing it away because it
                         doesn’t ‘fit’?
Thursday, May 27, 2010
So, what’s Hadoop?


                                The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry




Thursday, May 27, 2010
Apache Hadoop is an
                             open-source system
                         to reliably store and process
                           gobs of information
         across many commodity computers.




                             The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry



Thursday, May 27, 2010
Two Core
                           Components
                         HDFS           Map/Reduce

                    Self-healing
                 high-bandwidth           Fault-tolerant
                clustered storage.   distributed computing.



Thursday, May 27, 2010
What makes
                         Hadoop special?


Thursday, May 27, 2010
Assumption 1: Machines can be reliable...




Image: MadMan the Mighty CC BY-NC-SA
Thursday, May 27, 2010
Hadoop separates distributed
                     system fault-tolerance code
                        from application logic.
                                           Unicorns




                           Systems
                                       Statisticians
                         Programmers


Thursday, May 27, 2010
Assumption 2: Machines have identities...




                                     Image:Laughing Squid CC BY-
                                     NC-SA
Thursday, May 27, 2010
Hadoop lets you interact
                   with a cluster, not a bunch
                          of machines.




  Image:Yahoo! Hadoop cluster
 [ OSCON ’07 ]
Thursday, May 27, 2010
Assumption 3: Your analysis fits on one machine




                                  Image: Matthew J. Stinson CC-BY-NC
Thursday, May 27, 2010
Hadoop scales linearly
                               with data size
                           or analysis complexity.
                         Data-parallel or compute-parallel. For example:

                           Extensive machine learning on <100GB of image data

                           Simple SQL-style queries on >100TB of clickstream
                           data

                         One Hadoop works for both applications!


Thursday, May 27, 2010
A Typical Look...
                         5-4000 commodity servers
                         (8-core, 8-24GB RAM, 4-12 TB, gig-E)
                         2-level network architecture
                          20-40 nodes per rack




Thursday, May 27, 2010
Image: Josh Hough CC BY-NC-SA




                               STOP!
                           REAL METAL?
                              Isn’t this some kind of
                         “Cloud Computing” conference?



    Hadoop runs as a cloud (a cluster)
     and maybe in a cloud (eg EC2).
Thursday, May 27, 2010
Hadoop sounds like
                               magic.




                         How is it possible?
Thursday, May 27, 2010
dramatis personae
       Starring...

                         NameNode (metadata server and database)

                         SecondaryNameNode (assistant to NameNode)
                         JobTracker (scheduler)

     The Chorus…
                DataNodes                             TaskTrackers
              (block storage)                       (task execution)


Thanks to Zak Stone for earmuff image!
Thursday, May 27, 2010
Namenode         HDFS
                                                   3x64MB file, 3 rep
             (fs metadata)
                                                   4x64MB file, 3 rep
                                                   Small file, 7 rep
                                    Datanodes




Thursday, May 27, 2010
                         One Rack               A Different Rack
HDFS Write Path




Thursday, May 27, 2010
HDFS Failures?
                Datanode crash?
                         Clients read another copy
                         Background rebalance/rereplicate
                Namenode crash?
                         uh-oh
                         not responsible for
                         majority of downtime!



Thursday, May 27, 2010
The M/R
                   Programming Model




Thursday, May 27, 2010
You specify map()
                           and reduce()
                            functions.

                The framework does
                      the rest.
Thursday, May 27, 2010
fault-tolerance
                         (that’s what’s important)
                         (and that’s why Hadoop)




Thursday, May 27, 2010
map()
                         map: K₁,V₁→list K₂,V₂

    public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
      /**
        * Called once for each key/value pair in the input split. Most applications
        * should override this, but the default is the identity function.
        */
      protected void map(KEYIN key, VALUEIN value,
                          Context context) throws IOException,
                          InterruptedException {
         // context.write() can be called many times
         // this is default “identity mapper” implementation
         context.write((KEYOUT) key, (VALUEOUT) value);
      }
    }




Thursday, May 27, 2010
(the shuffle)

              map output is assigned to a “reducer”

              map output is sorted by key




Thursday, May 27, 2010
reduce()
                         K₂, iter(V₂)→list(K₃,V₃)

      public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
        /**
          * This method is called once for each key. Most applications will define
          * their reduce class by overriding this method. The default implementation
          * is an identity function.
          */
        @SuppressWarnings("unchecked")
        protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
                               ) throws IOException, InterruptedException {
           for(VALUEIN value: values) {
             context.write((KEYOUT) key, (VALUEOUT) value);
           }
        }
      }




Thursday, May 27, 2010
Putting it together...
  Logical


                                 Physical Flow



 Physical




Thursday, May 27, 2010
Some samples...
                         Build an inverted index.
                         Summarize data grouped by a key.
                         Build map tiles from geographic data.
                         OCRing many images.
                         Learning ML models. (e.g., Naive Bayes
                         for text classification)
                         Augment traditional BI/DWH
                         technologies (by archiving raw data).
Thursday, May 27, 2010
M/R
                                                  Job on stars
     Tasktrackers on the same                    Different job
      machines as datanodes                      Idle




                         One Rack         A Different Rack

Thursday, May 27, 2010
M/R




Thursday, May 27, 2010
M/R Failures
                         Task fails
                           Try again?
                           Try again somewhere else?
                           Report failure
                         Retries possible because of idempotence


Thursday, May 27, 2010
There’s more than the
                            Java API
            Streaming              Pig                   Hive
             perl, python,     Higher-level          SQL interface.
             ruby, whatever.   dataflow language
                               for easy ad-hoc       Great for
             stdin/stdout/     analysis.             analysts.
             stderr
                               Developed at          Developed at
                               Yahoo!                Facebook

                                 Many tasks actually require a series
                                       of M/R jobs; that’s ok!
Thursday, May 27, 2010
The Hadoop Ecosystem
                                           ETL Tools       BI Reporting       RDBMS

                                         Pig (Data Flow)    Hive (SQL)         Sqoop
               Zookeepr (Coordination)




                                                                                       Avro (Serialization)
                                         MapReduce (Job Scheduling/Execution System)

                                         HBase (Column DB)
                                               (Key-Value store)


                                                              HDFS
                                                 (Hadoop Distributed File System)



Thursday, May 27, 2010
Hadoop in the Wild
                                 (yes, it’s used in production)

                         Yahoo! Hadoop Clusters: > 82PB, >25k machines
                         (Eric14, HadoopWorld NYC ’09)

                         Facebook: 15TB new data per day;
                         10000+ cores, 12+ PB

                         Twitter: ~1TB per day, ~80 nodes

                         Lots of 5-40 node clusters at companies without petabytes
                         of data (web, retail, finance, telecom, research)


Thursday, May 27, 2010
Ok, fine, what next?
                         Get Hadoop!
                          Cloudera’s Distribution for Hadoop
                          http://hadoop.apache.org/
                         Try it out! (Locally, or on EC2)      Door
                                                               Prize
                         Watch free training videos on
                         http://cloudera.com/

Thursday, May 27, 2010
Questions?

                          todd@cloudera.com

                          (feedback? yes!)

                          (hiring? yes!)




Thursday, May 27, 2010
Backup Slides




Thursday, May 27, 2010
Important APIs
                                    → is 1:many
                 Input Format data→K₁,V₁
                                                        Writable
                         Mapper    K₁,V₁→K₂,V₂
                                                        JobClient
  M/R Flow




                                                                     Other
                     Combiner      K₂,iter(V₂)→K₂,V₂

                    Partitioner    K₂,V₂→int            *Context

                         Reducer   K₂, iter(V₂)→K₃,V₃   Filesystem
                  Out. Format K₃,V₃→data
Thursday, May 27, 2010
public int run(String[] args)
    throws Exception {                    grepJob.setReducerClass(LongSumRedu   FileOutputFormat.setOutputPath(sort
      if (args.length < 3) {              cer.class);                           Job, new Path(args[1]));
        System.out.println("Grep                                                    // sort by decreasing freq
    <inDir> <outDir> <regex>
    [<group>]");                          FileOutputFormat.setOutputPath(grep   sortJob.setOutputKeyComparatorClass
                                          Job, tempDir);                        (LongWritable.DecreasingComparator.
    ToolRunner.printGenericCommandUsage                                         class);
    (System.out);                         grepJob.setOutputFormat(SequenceFil
        return -1;                        eOutputFormat.class);                     JobClient.runJob(sortJob);
      }                                                                           } finally {
      Path tempDir = new Path("grep-      grepJob.setOutputKeyClass(Text.clas
    temp-"+Integer.toString(new           s);                                   FileSystem.get(grepJob).delete(temp
    Random().nextInt(Integer.MAX_VALUE)                                         Dir, true);
    ));                                   grepJob.setOutputValueClass(LongWri     }
      JobConf grepJob = new               table.class);                           return 0;
    JobConf(getConf(), Grep.class);                                             }
      try {                                   JobClient.runJob(grepJob);
        grepJob.setJobName("grep-
    search");                                 JobConf sortJob = new
                                          JobConf(Grep.class);
                                              sortJob.setJobName("grep-
    FileInputFormat.setInputPaths(grepJ   sort");




                                                                                the “grep”
    ob, args[0]);

                                          FileInputFormat.setInputPaths(sortJ
    grepJob.setMapperClass(RegexMapper.   ob, tempDir);
    class);




                                                                                 example
                                          sortJob.setInputFormat(SequenceFile
    grepJob.set("mapred.mapper.regex",    InputFormat.class);
    args[2]);
        if (args.length == 4)
                                          sortJob.setMapperClass(InverseMappe
    grepJob.set("mapred.mapper.regex.gr   r.class);
    oup", args[3]);
                                              // write a single file
                                              sortJob.setNumReduceTasks(1);
    grepJob.setCombinerClass(LongSumRed
    ucer.class);




Thursday, May 27, 2010
$ cat input.txt
    adams dunster kirkland dunster
    kirland dudley dunster
    adams dunster winthrop

    $ bin/hadoop jar hadoop-0.18.3-
    examples.jar grep input.txt output1
    'dunster|adams'

    $ cat output1/part-00000
    4 dunster
    2 adams



Thursday, May 27, 2010
JobConf grepJob = new JobConf(getConf(), Grep.class);
         try {
           grepJob.setJobName("grep-search");

             FileInputFormat.setInputPaths(grepJob, args[0]);




                                                                     Job
             grepJob.setMapperClass(RegexMapper.class);
             grepJob.set("mapred.mapper.regex", args[2]);
             if (args.length == 4)
               grepJob.set("mapred.mapper.regex.group", args[3]);

             grepJob.setCombinerClass(LongSumReducer.class);
             grepJob.setReducerClass(LongSumReducer.class);
                                                                    1of 2
             FileOutputFormat.setOutputPath(grepJob, tempDir);
             grepJob.setOutputFormat(SequenceFileOutputFormat.class);
             grepJob.setOutputKeyClass(Text.class);
             grepJob.setOutputValueClass(LongWritable.class);

           JobClient.runJob(grepJob);
         } ...


Thursday, May 27, 2010
JobConf sortJob = new JobConf(Grep.class);
             sortJob.setJobName("grep-sort");

             FileInputFormat.setInputPaths(sortJob, tempDir);


                                                                  Job
             sortJob.setInputFormat(SequenceFileInputFormat.class);

             sortJob.setMapperClass(InverseMapper.class);
             (implicit identity reducer)
             // write a single file
             sortJob.setNumReduceTasks(1);                       2 of 2
             FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
             // sort by decreasing freq
             sortJob.setOutputKeyComparatorClass(
                LongWritable.DecreasingComparator.class);

           JobClient.runJob(sortJob);
         } finally {
           FileSystem.get(grepJob).delete(tempDir, true);
         }
         return 0;
    }


Thursday, May 27, 2010
The types there...
                           ?, Text

                     Text, Long

              Text, list(Long)

                     Text, Long

                         Long, Text

Thursday, May 27, 2010
Facebook Data Infrastructure
                           Facebook’s DWH
                                 2008
                                       Scribe Tier     MySQL Tier




                               Hadoop Tier




                                  Oracle RAC Servers




il 1, 2009

 Thursday, May 27, 2010

More Related Content

What's hot

Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop Sudarshan Pant
 

What's hot (20)

Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 

Viewers also liked

從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby建興 王
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用建興 王
 
大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術Wei-Yu Chen
 
孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事台灣資料科學年會
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Wei-Yu Chen
 
MLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series TalkMLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series TalkJerry Wu
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (14)

從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby從 Java programmer 的觀點看 ruby
從 Java programmer 的觀點看 ruby
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop.TW : Now and Future
Hadoop.TW : Now and FutureHadoop.TW : Now and Future
Hadoop.TW : Now and Future
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用
 
大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術
 
海量視覺資料-孫民
海量視覺資料-孫民海量視覺資料-孫民
海量視覺資料-孫民
 
孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事孫民/從電腦視覺看人工智慧 : 下一件大事
孫民/從電腦視覺看人工智慧 : 下一件大事
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
MLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series TalkMLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series Talk
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to An Introduction to Apache Hadoop and its Core Components HDFS and MapReduce

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010c1sc0
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
iOS Development. Some practices.
iOS Development. Some practices.iOS Development. Some practices.
iOS Development. Some practices.Alexander Lobunets
 
2010-06 - a smalltalk about salesforce.com with java architects at YaJuG
2010-06 - a smalltalk about salesforce.com with java architects at YaJuG2010-06 - a smalltalk about salesforce.com with java architects at YaJuG
2010-06 - a smalltalk about salesforce.com with java architects at YaJuGYves Leblond
 
Google App Engine - Devfest India 2010
Google App Engine -  Devfest India 2010Google App Engine -  Devfest India 2010
Google App Engine - Devfest India 2010Patrick Chanezon
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingExperiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingCloudera, Inc.
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big dataAndrew Clegg
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4John Ballinger
 
Data Loading for Ext GWT
Data Loading for Ext GWTData Loading for Ext GWT
Data Loading for Ext GWTSencha
 

Similar to An Introduction to Apache Hadoop and its Core Components HDFS and MapReduce (20)

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010Big Data @ Bodensee Barcamp 2010
Big Data @ Bodensee Barcamp 2010
 
noSQL @ QCon SP
noSQL @ QCon SPnoSQL @ QCon SP
noSQL @ QCon SP
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
CSE509 Lecture 4
 
iOS Development. Some practices.
iOS Development. Some practices.iOS Development. Some practices.
iOS Development. Some practices.
 
2010-06 - a smalltalk about salesforce.com with java architects at YaJuG
2010-06 - a smalltalk about salesforce.com with java architects at YaJuG2010-06 - a smalltalk about salesforce.com with java architects at YaJuG
2010-06 - a smalltalk about salesforce.com with java architects at YaJuG
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
Google App Engine - Devfest India 2010
Google App Engine -  Devfest India 2010Google App Engine -  Devfest India 2010
Google App Engine - Devfest India 2010
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingExperiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's Missing
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop
HadoopHadoop
Hadoop
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4
 
Data Loading for Ext GWT
Data Loading for Ext GWTData Loading for Ext GWT
Data Loading for Ext GWT
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

An Introduction to Apache Hadoop and its Core Components HDFS and MapReduce

  • 1. Apache Hadoop an introduction Todd Lipcon todd@cloudera.com @tlipcon @cloudera May 27, 2010 Thursday, May 27, 2010
  • 2. Hi there! Software Engineer at Hadoop contributor, HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems Thursday, May 27, 2010
  • 3. Outline Why should you care? (Intro) What is Hadoop? The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions Thursday, May 27, 2010
  • 4. Data is everywhere. Data is important. Thursday, May 27, 2010
  • 9. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) Thursday, May 27, 2010
  • 10. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail), … . Are you throwing it away because it doesn’t ‘fit’? Thursday, May 27, 2010
  • 11. So, what’s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Thursday, May 27, 2010
  • 12. Apache Hadoop is an open-source system to reliably store and process gobs of information across many commodity computers. The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Thursday, May 27, 2010
  • 13. Two Core Components HDFS Map/Reduce Self-healing high-bandwidth Fault-tolerant clustered storage. distributed computing. Thursday, May 27, 2010
  • 14. What makes Hadoop special? Thursday, May 27, 2010
  • 15. Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA Thursday, May 27, 2010
  • 16. Hadoop separates distributed system fault-tolerance code from application logic. Unicorns Systems Statisticians Programmers Thursday, May 27, 2010
  • 17. Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA Thursday, May 27, 2010
  • 18. Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Thursday, May 27, 2010
  • 19. Assumption 3: Your analysis fits on one machine Image: Matthew J. Stinson CC-BY-NC Thursday, May 27, 2010
  • 20. Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data One Hadoop works for both applications! Thursday, May 27, 2010
  • 21. A Typical Look... 5-4000 commodity servers (8-core, 8-24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack Thursday, May 27, 2010
  • 22. Image: Josh Hough CC BY-NC-SA STOP! REAL METAL? Isn’t this some kind of “Cloud Computing” conference? Hadoop runs as a cloud (a cluster) and maybe in a cloud (eg EC2). Thursday, May 27, 2010
  • 23. Hadoop sounds like magic. How is it possible? Thursday, May 27, 2010
  • 24. dramatis personae Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… DataNodes TaskTrackers (block storage) (task execution) Thanks to Zak Stone for earmuff image! Thursday, May 27, 2010
  • 25. Namenode HDFS 3x64MB file, 3 rep (fs metadata) 4x64MB file, 3 rep Small file, 7 rep Datanodes Thursday, May 27, 2010 One Rack A Different Rack
  • 27. HDFS Failures? Datanode crash? Clients read another copy Background rebalance/rereplicate Namenode crash? uh-oh not responsible for majority of downtime! Thursday, May 27, 2010
  • 28. The M/R Programming Model Thursday, May 27, 2010
  • 29. You specify map() and reduce() functions. The framework does the rest. Thursday, May 27, 2010
  • 30. fault-tolerance (that’s what’s important) (and that’s why Hadoop) Thursday, May 27, 2010
  • 31. map() map: K₁,V₁→list K₂,V₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } Thursday, May 27, 2010
  • 32. (the shuffle) map output is assigned to a “reducer” map output is sorted by key Thursday, May 27, 2010
  • 33. reduce() K₂, iter(V₂)→list(K₃,V₃) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } Thursday, May 27, 2010
  • 34. Putting it together... Logical Physical Flow Physical Thursday, May 27, 2010
  • 35. Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DWH technologies (by archiving raw data). Thursday, May 27, 2010
  • 36. M/R Job on stars Tasktrackers on the same Different job machines as datanodes Idle One Rack A Different Rack Thursday, May 27, 2010
  • 38. M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence Thursday, May 27, 2010
  • 39. There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language for easy ad-hoc Great for stdin/stdout/ analysis. analysts. stderr Developed at Developed at Yahoo! Facebook Many tasks actually require a series of M/R jobs; that’s ok! Thursday, May 27, 2010
  • 40. The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Column DB) (Key-Value store) HDFS (Hadoop Distributed File System) Thursday, May 27, 2010
  • 41. Hadoop in the Wild (yes, it’s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Facebook: 15TB new data per day; 10000+ cores, 12+ PB Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research) Thursday, May 27, 2010
  • 42. Ok, fine, what next? Get Hadoop! Cloudera’s Distribution for Hadoop http://hadoop.apache.org/ Try it out! (Locally, or on EC2) Door Prize Watch free training videos on http://cloudera.com/ Thursday, May 27, 2010
  • 43. Questions? todd@cloudera.com (feedback? yes!) (hiring? yes!) Thursday, May 27, 2010
  • 45. Important APIs → is 1:many Input Format data→K₁,V₁ Writable Mapper K₁,V₁→K₂,V₂ JobClient M/R Flow Other Combiner K₂,iter(V₂)→K₂,V₂ Partitioner K₂,V₂→int *Context Reducer K₂, iter(V₂)→K₃,V₃ Filesystem Out. Format K₃,V₃→data Thursday, May 27, 2010
  • 46. public int run(String[] args) throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort if (args.length < 3) { cer.class); Job, new Path(args[1])); System.out.println("Grep // sort by decreasing freq <inDir> <outDir> <regex> [<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass Job, tempDir); (LongWritable.DecreasingComparator. ToolRunner.printGenericCommandUsage class); (System.out); grepJob.setOutputFormat(SequenceFil return -1; eOutputFormat.class); JobClient.runJob(sortJob); } } finally { Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp Random().nextInt(Integer.MAX_VALUE) Dir, true); )); grepJob.setOutputValueClass(LongWri } JobConf grepJob = new table.class); return 0; JobConf(getConf(), Grep.class); } try { JobClient.runJob(grepJob); grepJob.setJobName("grep- search"); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- FileInputFormat.setInputPaths(grepJ sort"); the “grep” ob, args[0]); FileInputFormat.setInputPaths(sortJ grepJob.setMapperClass(RegexMapper. ob, tempDir); class); example sortJob.setInputFormat(SequenceFile grepJob.set("mapred.mapper.regex", InputFormat.class); args[2]); if (args.length == 4) sortJob.setMapperClass(InverseMappe grepJob.set("mapred.mapper.regex.gr r.class); oup", args[3]); // write a single file sortJob.setNumReduceTasks(1); grepJob.setCombinerClass(LongSumRed ucer.class); Thursday, May 27, 2010
  • 47. $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams Thursday, May 27, 2010
  • 48. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); Job grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); 1of 2 FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ... Thursday, May 27, 2010
  • 49. JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); Job sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); (implicit identity reducer) // write a single file sortJob.setNumReduceTasks(1); 2 of 2 FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; } Thursday, May 27, 2010
  • 50. The types there... ?, Text Text, Long Text, list(Long) Text, Long Long, Text Thursday, May 27, 2010
  • 51. Facebook Data Infrastructure Facebook’s DWH 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers il 1, 2009 Thursday, May 27, 2010