SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Apache Hadoop in the Enterprise

Cloudera, Inc.
Amr Awadallah, Founder, CTO, VP of Engineering.
aaa@cloudera.com, twitter: @awadallah

Microstrategy World – January 2011 – Las Vegas
Unstructured Data Explosion




                                                                                  Complex, Unstructured




                                                                                  Relational




 • 2,500 exabytes of new information in 2012 with Internet as primary driver
 • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
 “zettabytes” this year                                            Source: IDC White Paper - sponsored by EMC.
                                         As the Economy Contracts, the Digital Universe Expands. May 2009.
                           Copyright © 2011, Cloudera, Inc. All Rights Reserved.                             .   2
Dramatic Changes in Enterprise Data Needs

 Data Explosion
 • Any Type of Data
 • From Many Sources
 • Instrument Everything

 Hard Problems
 • Complex Analysis
 • At Lowest Granularity
 • Data Beats Algorithm


                  Copyright © 2011, Cloudera, Inc. All Rights Reserved.   3
What is Hadoop?
• A scalable fault-tolerant distributed system for data storage and
  processing (open source under the Apache license)

• Core Hadoop has two main components
   • Hadoop Distributed File System (HDFS): self-healing high-bandwidth
     clustered storage
   • MapReduce: fault-tolerant distributed processing


• Key business values
   •   Flexible -> Store any data, run any analysis (Mine First, Govern Later)
   •   Affordable -> Cost per TB at a fraction of traditional options
   •   Broadly adopted -> A large and active ecosystem
   •   Proven at scale -> Several petabyte deployments in production today
   •   Open Source -> No Lock-In, low cost, large developer community.
                         Copyright © 2011, Cloudera, Inc. All Rights Reserved.   4
Cloudera’s Data Operating System (CDH)


                                             Hue                                       Hue SDK
                                                                                          Avro,
                          Oozie                              Oozie                        Hive
                                                                     Pig.
                                                                     Hive

         Avro, Flume, Sqoop                                                              HBase

                                                                                      Zookeeper


•   Open Source – 100% Apache licensed
•   Simplified – Component versions & dependencies managed for you
•   Reliable – Predictable release schedules, Patched with fixes to improve stability
•   Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc.
•   Integrated – All components & functions interoperate through standard API’s
•   Supported – Founders, committers, contributors across all projects
                              Copyright © 2011, Cloudera, Inc. All Rights Reserved.               5
Benefit #1: Agility

Schema-on-Write (RDBMS):                                  Schema-on-Read (Hadoop):
•   Schema must be created before                        •   Data is simply copied to the file
    data is loaded                                           store, no special transformation is
                                                             needed
•   Explicit load operation has to
    take place which transforms data                     •   A SerDe (Serializer/Deserlizer) is
    to database internal structure                           applied during read time to extract
                                                             the required columns
•   New columns must be added
    explicitly before data for such                      •   New data can start flowing
    columns can be loaded into the                           anytime and will appear
    database                                                 retroactively once the SerDe is
                                                             updated to parse them
•   Read is Fast                                         •   Load is Fast
                                        Benefits
•   Standards/Governance                                 •   Evolving Schemas/Agility

                         Copyright © 2011, Cloudera, Inc. All Rights Reserved.                 6
Benefit #2: Data Consolidation

                                                       Complex Data
                                                       Documents         SharePoint
                                                       Web feeds         Sensor data
                                                       System logs       EMB archives
                                                       Online forums     Images/Video

                                                       Structured Data (“relational”)
                                                       CRM          Inventory
                                                       Financials   Sales records
                                                       Logistics    HR records
                                                       Data Marts Web Profiles

    A single data system to enable processing across
               the universe of data types.
                 Copyright © 2011, Cloudera, Inc. All Rights Reserved.                  7
Benefit #3: Any Programing Language (Not Only SQL)
1. Java MapReduce: Gives the most flexibility and performance,
   but potentially long development cycle (the “assembly
   language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
   any programming language of your choice, but slightly lower
   performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
   MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
   flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
   store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
   a workflow of jobs composed of any of the above.
                    Copyright © 2011, Cloudera, Inc. All Rights Reserved.   8
Benefit #4: Balancing Return on Investment (or Byte!)
    • Return on Byte = value to be extracted from that byte
    divided by the cost of storing that byte

    • If ROB is < 1 then it will be buried into tape wasteland,
    thus we need more economical active storage.


                                                                            High ROB


                                                                            Low ROB



                    Copyright © 2011, Cloudera, Inc. All Rights Reserved.              9
Use The Right Tool For The Right Job

    Relational Databases:                             Hadoop:




Use when:                                              Use when:
•   Interactive OLAP Analytics (<1sec)                 •   Structured or Not (Agility)
•   Multistep ACID Transactions                        •   Scalable Storage/Compute
•   SQL Compliance                                     •   Complex Data Processing
                         Copyright © 2011, Cloudera, Inc. All Rights Reserved.           10
Where Does Hadoop Fit in the Enterprise Data Stack?

                                  Data Scientists          Analysts         Business Users
                                                                               Enterprise
                                         IDEs             BI, Analytics
                                                                               Reporting


               System
             Administrators
                Cloudera
               Mgmt Apps                                                                  Enterprise
                                                                                            Data
                                                                                          Warehouse
  Data
                                                                                                         Users
Architects
                                                                                          Low-Latency     Web
                                                                                            Serving     Application

                                                                   Relational               Systems
             Logs             Files           Web Data
                                                                   Databases
                                  Copyright © 2011, Cloudera, Inc. All Rights Reserved.                          11
Apache Hive Features

• A subset of SQL covering the most common statements
• JDBC/ODBC support
• Agile data types: Array, Map, Struct, and JSON objects
• Pluggable SerDe system to work on unstructured files directly.
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy
  compatibility, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive
                    Copyright © 2011, Cloudera, Inc. All Rights Reserved.   12
Broad Adoption in Key Verticals
               Financial Services             Telecom                           Retail           Government




   Example     Risk management:        BSS:                            Brand Equity:          Traffic Analysis:
Applications   “Examine purchase       “Analyze calling                “Monitor customer      “Use multimedia
               behavior across         patterns among                  and product data       data from various
               debit and credit        users and current               recorded across        sources to build an
               properties to better    capacity to forecast            internal & external    actionable graph of
               identify high-risk      traffic growth and              sources to trend       relationships among
               customers.”             locate new towers.”             brand valuation.”      targets.”
                                                           IT: Operations
Stakeholders
                                                    IT: Data Engineering
                  Risk Analysts               Research              Insight Team                  Intelligence



                                      Copyright © 2011, Cloudera, Inc. All Rights Reserved.                         13
Customers




            Copyright © 2011, Cloudera, Inc. All Rights Reserved.   14
How are Customers Using Cloudera?
Answering Questions that Were Impossible to Ask Before
                  Analyze search terms and subsequent user purchase decisions
                  to tune search results, increase conversion rates
                  Digest long-term historical trade data to identify fraudulent
                  activity and build real-time fraud prevention
                  Model site visitor behavior with analytics that deliver better
                  recommendations for new purchases
                  Continually refine predictive models for advertising response
                  rates to deliver more precisely targeted advertisements
                  Replace expensive legacy ETL system with more flexible,
                  cheaper infrastructure that is 20 times faster
                  Correlate educational outcomes with programs and student
                  histories to improve results
  Big Bank        Examine customer behavior to improve loan risk scoring
     More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/
                        Copyright © 2011, Cloudera, Inc. All Rights Reserved.   15
Cloudera Offerings
Facilitating enterprise adoption of Hadoop




            Software                    Services                          Training




                       Copyright © 2011, Cloudera, Inc. All Rights Reserved.         16
Cloudera Enterprise
Enterprise Support and Management Applications




 •   Improves conformance to important IT SLAs, policies and procedures
 •   Lowers the cost of management and administration
 •   Increases reliability and consistency of the platform
 •   Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems
                        Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Integrating with Existing IT Infrastructure

   BI/Analytics   ETL                     RDBMS                   Cloud/OS      Hardware




                        Copyright © 2011, Cloudera, Inc. All Rights Reserved.              18
MicroStrategy (for interactive Dashboards)




                 Copyright © 2011 Couldera, Inc. All Rights Reserved.   19
Informatica (for Extract-Transform-Load, aka ETL)




                 Copyright © 2011, Cloudera, Inc. All Rights Reserved.   20
Summary

• Cloudera’s Data OS (CDH) enables:
   •   Data Agility (Evolving Schemas)
   •   Consolidation (Structured or Not)
   •   Complex Data Processing (Any Language)
   •   Economical Storage (Enable Return-on-Byte > 1)
• Cloudera Enterprise enables:
   •   Conformance to important IT SLAs, policies and procedures
   •   Lower cost of management and administration
   •   Increased reliability and consistency
   •   Certified integration with existing IT infrastructure

                     Copyright © 2011, Cloudera, Inc. All Rights Reserved.   21
Contact Information and Free Hadoop Book



         Amr Awadallah
       CTO, Cloudera, Inc.
       aaa@cloudera.com
          650-644-3921
     twitter.com/awadallah
      twitter.com/cloudera




                 Copyright © 2011, Cloudera, Inc. All Rights Reserved.   22
Copyright © 2011, Cloudera, Inc. All Rights Reserved.   23
Appendix



      Copyright © 2011, Cloudera, Inc. All Rights Reserved.   24
Cloudera Overview
                     Jeff Hammerbacher, Chief Scientist
Hadoop…              Amr Awadallah, CTO, VP Engineering
                     Doug Cutting, Chief Architect
                     Mike Olson - CEO
                     Omer Trajman – VP, Customer Solutions
… meets enterprise   John Kreisa –VP, Marketing
                     Charles Zedlewski – VP, Product Management
                     Ed Albanese – Head of Business Development

Investors            Accel Partners, Greylock Partners, Meritech Capital Partners

Product category     Data Management

Business model       Cloudera offers Software, Support, Training, and Professional Services

Employees            70+

Customers            75+

Headquarters         Palo Alto, California

Elevator pitch       The leading provider of Apache Hadoop-based software and services for the enterprise
Vision               We enable organizations to profit from all of their data



                            Copyright © 2011, Cloudera, Inc. All Rights Reserved.                           25
Why CDH (Cloudera Distribution for Hadoop)?
 Features                       Benefits
 It’s packaged                  Much easier for users to install CDH than any other form
                                of Hadoop.
 It’s patched                   This makes CDH more stable and secure than just
                                downloading an Apache branch
 It’s proven                    Thousands of organizations already use CDH today so risk
                                is lower
 It’s highly functional         CDH will cover more use cases and users will be more
                                productive than if they were just using core Hadoop.
 It’s integrated                Save time (of piecing a system together yourself) and
                                lower risk (of choosing the wrong combination of
                                versions or patches)
 It’s the accepted standard     More of your preexisting investments in RDBMS, ETL and
                                BI work best with CDH
 It’s supported                 CDH is one of only two distributions that has a
                                commercial entity standing behind it
 It’s 100% Apache licensed      Investment in this technology is insured.
                          Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Hadoop Timeline

                                                                              Fastest sort of a TB, 3.5mins
                                                                              over 910 nodes
                            Cutting adds DFS &
                        MapReduce support to Nutch                                              • Fastest sort of a TB, 62secs
                                                                                                over 1,460 nodes
                                                            NY Times converts 4TB of            • Sorted a PB in 16.25hours
Doug Cutting & Mike Cafarella                                                                   over 3,658 nodes
                                                          image archives over 100 EC2s
  started working on Nutch


     2002        2003           2004         2005            2006            2007         2008           2009

             Google publishes GFS &                                                           Cloudera
                                                   Yahoo! hires Cutting,                                        Cloudera
               MapReduce papers                                                               Founded
                                                 Hadoop spins out of Nutch                                    hires Cutting

                                                                    Web-scale deployments at
                                                                     Y!, Facebook, Last.fm
                                                                                                  Hadoop Summit 2009,
                                                                                                     750 attendees


                                  Copyright © 2011, Cloudera, Inc. All Rights Reserved.                                  27
10 Common Hadoop-able Problems


 1. Modeling true risk                          6. Analyzing network data
                                                   to predict failure
 2. Customer churn
    analysis                                    7. Threat analysis
 3. Recommendation                              8. Trade surveillance
    engine
                                                9. Search quality
 4. Ad targeting
                                                10. Data “sandbox”
 5. PoS transaction analysis


                   Copyright © 2011, Cloudera, Inc. All Rights Reserved.    28
Case Studies: Hadoop World 2009

 •VISA: Large Scale Transaction Analysis
 •JP Morgan Chase: Data Processing for Financial Services
 •China Mobile: Data Mining Platform for Telecom Industry
 •Rackspace: Cross Data Center Log Processing
 •Booz Allen Hamilton: Protein Alignment using Hadoop
 •eHarmony: Matchmaking in the Hadoop Cloud
 •General Sentiment: Understanding Natural Language
 •Yahoo!: Social Graph Analysis
 •Visible Technologies: Real-Time Business Intelligence

  Slides and Videos: http://www.cloudera.com/hadoop-world-nyc


                  Copyright © 2011, Cloudera, Inc. All Rights Reserved.   29
Case Studies: Hadoop World 2010

  •eBay: Hadoop at eBay
  •Twitter: The Hadoop Ecosystem at Twitter
  •Yale University: MapReduce and Parallel Database Systems
  •General Electric: Sentiment Analysis powered by Hadoop
  •Facebook: HBase in Production
  •AOL: AOL’s Data Layer
  •Raytheon: SHARD: Storing and Querying Large-Scale Data
  •StumbleUpon: Mixing Real-Time and Batch Processing

   More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/



                     Copyright © 2011, Cloudera, Inc. All Rights Reserved.   30
Hadoop Design Axioms



         1. System Shall Manage and Heal Itself

         2. Performance Shall Scale Linearly

         3. Compute Should Move to Data

         4. Simple Core, Modular and Extensible




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.   31
HDFS: Hadoop Distributed File System
 Block Size = 64MB
Replication Factor = 3




  Cost/GB is a few
 ¢/month vs $/month
                Copyright © 2011, Cloudera, Inc. All Rights Reserved.
MapReduce: Distributed Processing




                Copyright © 2011, Cloudera, Inc. All Rights Reserved.
MapReduce Example for Word Count
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
                                       (words, counts)
  Split 1   (docid, text)   Map 1                                 (sorted words, counts)
                                                                                                                             Output
                                                 Be, 5                                     Reduce 1
                                                                                                          (sorted words,
                                                                                                          sum of counts)
                                                                                                                             File 1

            “To Be
            Or Not                                                                                          Be, 30
            To Be?”
                                     Be, 12
                                                                                                                             Output
                                                                                                          (sorted words,
                                                                                           Reduce i                           File i
  Split i   (docid, text)    Map i                                                                        sum of counts)




                                     Be, 7
                                     Be, 6
                                                      Shuffle                                                                Output
                                                                                                          (sorted words,
                                                                                           Reduce R                          File R
                                                                                                          sum of counts)
  Split N   (docid, text)   Map M      (words, counts)            (sorted words, counts)


 Map(in_key, in_value) => list of (out_key, intermediate_value)           Reduce(out_key, list of intermediate_values) => out_value(s)

                                       Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Hadoop High-Level Architecture
                                                  Hadoop Client
                                            Contacts Name Node for data
                                            or Job Tracker to submit jobs




           Name Node                                                                         Job Tracker
   Maintains mapping of file blocks                                                      Schedules jobs across
         to data node slaves                                                              task tracker slaves




            Data Node                                                                       Task Tracker
         Stores and serves                                                               Runs tasks (work units)
           blocks of data                                                                     within a job
                                           Share Physical Node



                                 Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Hive vs Pig Example (count distinct values > 0)

• Hive syntax:
  SELECT COUNT(DISTINCT col1)
  FROM mytable
  WHERE col1 > 0;

• Pig syntax:
  mytable = LOAD ‘myfile’ AS (col1, col2, col3);
  mytable = FOREACH mytable GENERATE col1;
  mytable = FILTER mytable BY col1 > 0;
  mytable = DISTINCT col1;
  mytable = GROUP mytable BY col1;
  mytable = FOREACH mytable GENERATE COUNT(mytable);
  DUMP mytable;

                 Copyright © 2011, Cloudera, Inc. All Rights Reserved.   36
Hive Agile Data Types

• STRUCTS:
   • SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
   • SELECT mytable.mycolumn[mykey+ FROM …
• ARRAYS:
   • SELECT mytable.mycolumn*5+ FROM …
• JSON:
   • SELECT get_json_object(mycolumn, objpath




                     Copyright © 2011, Cloudera, Inc. All Rights Reserved.   37
Copyright © 2011, Cloudera, Inc. All Rights Reserved.   38

Contenu connexe

Tendances

(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...
(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...
(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...Hiram Fleitas León
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]shuwutong
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostHow to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostAtScale
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteMark van Rijmenam
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
Microsoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered AnalyticsMicrosoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered AnalyticsJuan Alvarado
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraCloudera, Inc.
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeVasu S
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake PracticeSamanthaSwain7
 

Tendances (20)

(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...
(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...
(BI Advanced) Hiram Fleitas - SQL Server Machine Learning Predict Sentiment O...
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostHow to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Microsoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered AnalyticsMicrosoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered Analytics
 
Data lake
Data lakeData lake
Data lake
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Data Federation
Data FederationData Federation
Data Federation
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake Practice
 

En vedette

Administracion de un telecentro
Administracion de un telecentroAdministracion de un telecentro
Administracion de un telecentroPepe Jara Cueva
 
Aprendizaje Informal
Aprendizaje InformalAprendizaje Informal
Aprendizaje Informalpatob2000
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Amr Awadallah
 
Taking control of the big data explosion
Taking control of the big data explosionTaking control of the big data explosion
Taking control of the big data explosionVodafone Business
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchMapR Technologies
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemDataWorks Summit
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Discovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on AncestryDiscovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on AncestryAncestry.com
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadThink Big, a Teradata Company
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 
Family History Toolkit: Creating Timelines
Family History Toolkit: Creating TimelinesFamily History Toolkit: Creating Timelines
Family History Toolkit: Creating TimelinesAncestry.com
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 

En vedette (20)

Aprendizaje Informal
Aprendizaje InformalAprendizaje Informal
Aprendizaje Informal
 
Administracion de un telecentro
Administracion de un telecentroAdministracion de un telecentro
Administracion de un telecentro
 
Gestión de competencias de aprendizaje informal
Gestión de competencias de aprendizaje informalGestión de competencias de aprendizaje informal
Gestión de competencias de aprendizaje informal
 
Aprendizaje Informal
Aprendizaje InformalAprendizaje Informal
Aprendizaje Informal
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
 
Taking control of the big data explosion
Taking control of the big data explosionTaking control of the big data explosion
Taking control of the big data explosion
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop Ecosystem
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Discovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on AncestryDiscovering U.S. Passenger Lists on Ancestry
Discovering U.S. Passenger Lists on Ancestry
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
Family History Toolkit: Creating Timelines
Family History Toolkit: Creating TimelinesFamily History Toolkit: Creating Timelines
Family History Toolkit: Creating Timelines
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 

Similaire à Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT SimpleBob Rhubart
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Cloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleCloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleBob Rhubart
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 

Similaire à Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011 (20)

Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT Simple
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Cloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleCloud Computing: Making IT Simple
Cloud Computing: Making IT Simple
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

  • 1. Apache Hadoop in the Enterprise Cloudera, Inc. Amr Awadallah, Founder, CTO, VP of Engineering. aaa@cloudera.com, twitter: @awadallah Microstrategy World – January 2011 – Las Vegas
  • 2. Unstructured Data Explosion Complex, Unstructured Relational • 2,500 exabytes of new information in 2012 with Internet as primary driver • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Source: IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. Copyright © 2011, Cloudera, Inc. All Rights Reserved. . 2
  • 3. Dramatic Changes in Enterprise Data Needs Data Explosion • Any Type of Data • From Many Sources • Instrument Everything Hard Problems • Complex Analysis • At Lowest Granularity • Data Beats Algorithm Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3
  • 4. What is Hadoop? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Core Hadoop has two main components • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing • Key business values • Flexible -> Store any data, run any analysis (Mine First, Govern Later) • Affordable -> Cost per TB at a fraction of traditional options • Broadly adopted -> A large and active ecosystem • Proven at scale -> Several petabyte deployments in production today • Open Source -> No Lock-In, low cost, large developer community. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
  • 5. Cloudera’s Data Operating System (CDH) Hue Hue SDK Avro, Oozie Oozie Hive Pig. Hive Avro, Flume, Sqoop HBase Zookeeper • Open Source – 100% Apache licensed • Simplified – Component versions & dependencies managed for you • Reliable – Predictable release schedules, Patched with fixes to improve stability • Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc. • Integrated – All components & functions interoperate through standard API’s • Supported – Founders, committers, contributors across all projects Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5
  • 6. Benefit #1: Agility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before • Data is simply copied to the file data is loaded store, no special transformation is needed • Explicit load operation has to take place which transforms data • A SerDe (Serializer/Deserlizer) is to database internal structure applied during read time to extract the required columns • New columns must be added explicitly before data for such • New data can start flowing columns can be loaded into the anytime and will appear database retroactively once the SerDe is updated to parse them • Read is Fast • Load is Fast Benefits • Standards/Governance • Evolving Schemas/Agility Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6
  • 7. Benefit #2: Data Consolidation Complex Data Documents SharePoint Web feeds Sensor data System logs EMB archives Online forums Images/Video Structured Data (“relational”) CRM Inventory Financials Sales records Logistics HR records Data Marts Web Profiles A single data system to enable processing across the universe of data types. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
  • 8. Benefit #3: Any Programing Language (Not Only SQL) 1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility. 3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes. 4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe. 6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8
  • 9. Benefit #4: Balancing Return on Investment (or Byte!) • Return on Byte = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
  • 10. Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Use when: • Interactive OLAP Analytics (<1sec) • Structured or Not (Agility) • Multistep ACID Transactions • Scalable Storage/Compute • SQL Compliance • Complex Data Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
  • 11. Where Does Hadoop Fit in the Enterprise Data Stack? Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting System Administrators Cloudera Mgmt Apps Enterprise Data Warehouse Data Users Architects Low-Latency Web Serving Application Relational Systems Logs Files Web Data Databases Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11
  • 12. Apache Hive Features • A subset of SQL covering the most common statements • JDBC/ODBC support • Agile data types: Array, Map, Struct, and JSON objects • Pluggable SerDe system to work on unstructured files directly. • User Defined Functions and Aggregates • Regular Expression support • MapReduce support • Partitions and Buckets (for performance optimization) • In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect • More details: http://wiki.apache.org/hadoop/Hive Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12
  • 13. Broad Adoption in Key Verticals Financial Services Telecom Retail Government Example Risk management: BSS: Brand Equity: Traffic Analysis: Applications “Examine purchase “Analyze calling “Monitor customer “Use multimedia behavior across patterns among and product data data from various debit and credit users and current recorded across sources to build an properties to better capacity to forecast internal & external actionable graph of identify high-risk traffic growth and sources to trend relationships among customers.” locate new towers.” brand valuation.” targets.” IT: Operations Stakeholders IT: Data Engineering Risk Analysts Research Insight Team Intelligence Copyright © 2011, Cloudera, Inc. All Rights Reserved. 13
  • 14. Customers Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14
  • 15. How are Customers Using Cloudera? Answering Questions that Were Impossible to Ask Before Analyze search terms and subsequent user purchase decisions to tune search results, increase conversion rates Digest long-term historical trade data to identify fraudulent activity and build real-time fraud prevention Model site visitor behavior with analytics that deliver better recommendations for new purchases Continually refine predictive models for advertising response rates to deliver more precisely targeted advertisements Replace expensive legacy ETL system with more flexible, cheaper infrastructure that is 20 times faster Correlate educational outcomes with programs and student histories to improve results Big Bank Examine customer behavior to improve loan risk scoring More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/ Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15
  • 16. Cloudera Offerings Facilitating enterprise adoption of Hadoop Software Services Training Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16
  • 17. Cloudera Enterprise Enterprise Support and Management Applications • Improves conformance to important IT SLAs, policies and procedures • Lowers the cost of management and administration • Increases reliability and consistency of the platform • Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • 18. Integrating with Existing IT Infrastructure BI/Analytics ETL RDBMS Cloud/OS Hardware Copyright © 2011, Cloudera, Inc. All Rights Reserved. 18
  • 19. MicroStrategy (for interactive Dashboards) Copyright © 2011 Couldera, Inc. All Rights Reserved. 19
  • 20. Informatica (for Extract-Transform-Load, aka ETL) Copyright © 2011, Cloudera, Inc. All Rights Reserved. 20
  • 21. Summary • Cloudera’s Data OS (CDH) enables: • Data Agility (Evolving Schemas) • Consolidation (Structured or Not) • Complex Data Processing (Any Language) • Economical Storage (Enable Return-on-Byte > 1) • Cloudera Enterprise enables: • Conformance to important IT SLAs, policies and procedures • Lower cost of management and administration • Increased reliability and consistency • Certified integration with existing IT infrastructure Copyright © 2011, Cloudera, Inc. All Rights Reserved. 21
  • 22. Contact Information and Free Hadoop Book Amr Awadallah CTO, Cloudera, Inc. aaa@cloudera.com 650-644-3921 twitter.com/awadallah twitter.com/cloudera Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22
  • 23. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23
  • 24. Appendix Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24
  • 25. Cloudera Overview Jeff Hammerbacher, Chief Scientist Hadoop… Amr Awadallah, CTO, VP Engineering Doug Cutting, Chief Architect Mike Olson - CEO Omer Trajman – VP, Customer Solutions … meets enterprise John Kreisa –VP, Marketing Charles Zedlewski – VP, Product Management Ed Albanese – Head of Business Development Investors Accel Partners, Greylock Partners, Meritech Capital Partners Product category Data Management Business model Cloudera offers Software, Support, Training, and Professional Services Employees 70+ Customers 75+ Headquarters Palo Alto, California Elevator pitch The leading provider of Apache Hadoop-based software and services for the enterprise Vision We enable organizations to profit from all of their data Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25
  • 26. Why CDH (Cloudera Distribution for Hadoop)? Features Benefits It’s packaged Much easier for users to install CDH than any other form of Hadoop. It’s patched This makes CDH more stable and secure than just downloading an Apache branch It’s proven Thousands of organizations already use CDH today so risk is lower It’s highly functional CDH will cover more use cases and users will be more productive than if they were just using core Hadoop. It’s integrated Save time (of piecing a system together yourself) and lower risk (of choosing the wrong combination of versions or patches) It’s the accepted standard More of your preexisting investments in RDBMS, ETL and BI work best with CDH It’s supported CDH is one of only two distributions that has a commercial entity standing behind it It’s 100% Apache licensed Investment in this technology is insured. Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • 27. Hadoop Timeline Fastest sort of a TB, 3.5mins over 910 nodes Cutting adds DFS & MapReduce support to Nutch • Fastest sort of a TB, 62secs over 1,460 nodes NY Times converts 4TB of • Sorted a PB in 16.25hours Doug Cutting & Mike Cafarella over 3,658 nodes image archives over 100 EC2s started working on Nutch 2002 2003 2004 2005 2006 2007 2008 2009 Google publishes GFS & Cloudera Yahoo! hires Cutting, Cloudera MapReduce papers Founded Hadoop spins out of Nutch hires Cutting Web-scale deployments at Y!, Facebook, Last.fm Hadoop Summit 2009, 750 attendees Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27
  • 28. 10 Common Hadoop-able Problems 1. Modeling true risk 6. Analyzing network data to predict failure 2. Customer churn analysis 7. Threat analysis 3. Recommendation 8. Trade surveillance engine 9. Search quality 4. Ad targeting 10. Data “sandbox” 5. PoS transaction analysis Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28
  • 29. Case Studies: Hadoop World 2009 •VISA: Large Scale Transaction Analysis •JP Morgan Chase: Data Processing for Financial Services •China Mobile: Data Mining Platform for Telecom Industry •Rackspace: Cross Data Center Log Processing •Booz Allen Hamilton: Protein Alignment using Hadoop •eHarmony: Matchmaking in the Hadoop Cloud •General Sentiment: Understanding Natural Language •Yahoo!: Social Graph Analysis •Visible Technologies: Real-Time Business Intelligence Slides and Videos: http://www.cloudera.com/hadoop-world-nyc Copyright © 2011, Cloudera, Inc. All Rights Reserved. 29
  • 30. Case Studies: Hadoop World 2010 •eBay: Hadoop at eBay •Twitter: The Hadoop Ecosystem at Twitter •Yale University: MapReduce and Parallel Database Systems •General Electric: Sentiment Analysis powered by Hadoop •Facebook: HBase in Production •AOL: AOL’s Data Layer •Raytheon: SHARD: Storing and Querying Large-Scale Data •StumbleUpon: Mixing Real-Time and Batch Processing More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/ Copyright © 2011, Cloudera, Inc. All Rights Reserved. 30
  • 31. Hadoop Design Axioms 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Should Move to Data 4. Simple Core, Modular and Extensible Copyright © 2011, Cloudera, Inc. All Rights Reserved. 31
  • 32. HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few ¢/month vs $/month Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • 33. MapReduce: Distributed Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • 34. MapReduce Example for Word Count cat *.txt | mapper.pl | sort | reducer.pl > out.txt (words, counts) Split 1 (docid, text) Map 1 (sorted words, counts) Output Be, 5 Reduce 1 (sorted words, sum of counts) File 1 “To Be Or Not Be, 30 To Be?” Be, 12 Output (sorted words, Reduce i File i Split i (docid, text) Map i sum of counts) Be, 7 Be, 6 Shuffle Output (sorted words, Reduce R File R sum of counts) Split N (docid, text) Map M (words, counts) (sorted words, counts) Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s) Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • 35. Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Job Tracker Maintains mapping of file blocks Schedules jobs across to data node slaves task tracker slaves Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • 36. Hive vs Pig Example (count distinct values > 0) • Hive syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0; • Pig syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; Copyright © 2011, Cloudera, Inc. All Rights Reserved. 36
  • 37. Hive Agile Data Types • STRUCTS: • SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): • SELECT mytable.mycolumn[mykey+ FROM … • ARRAYS: • SELECT mytable.mycolumn*5+ FROM … • JSON: • SELECT get_json_object(mycolumn, objpath Copyright © 2011, Cloudera, Inc. All Rights Reserved. 37
  • 38. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 38