SlideShare une entreprise Scribd logo
1  sur  20
© Hortonworks Inc. 2013. Confidential and Proprietary.
Hadoop in LondonJuly 9, 2013
Herb Cunitz
Hortonworks President
@hcunitz
Page 1
© Hortonworks Inc. 2013. Confidential and Proprietary.
Why is Hadoop Important?
We Believe that
More than Half the
World's Data Will
Be Processed by
Apache Hadoop.
By 2015, Organizations that
Build a Modern Information
Management System Will
Outperform their Peers
Financially by 20 Percent.
– Gartner, Mark Beyer, “Information Management in the 21st Century”
© Hortonworks Inc. 2013. Confidential and Proprietary.
New Sources
(sentiment, clickstream, geo, sensor, …)
Traditional Data ArchitectureAPPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
OLTP, POS SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
Pressured
TRADITIONAL REPOS
RDBMS EDW MPP
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD &
TEST
Traditional Sources
(RDBMS, OLTP, OLAP)
© Hortonworks Inc. 2013. Confidential and Proprietary.
PressuredTraditional Data Architecture
Source: IDC
New Sources
(sentiment, clickstream, geo, sensor, …)
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
© Hortonworks Inc. 2013. Confidential and Proprietary.
New Sources
(sentiment, clickstream, geo, sensor, …)
Modern Data Architecture EnabledAPPLICATIONSDATASYSTEMSDATASOURCES
OLTP, POS
SYSTEMS
Business
Analytics
Custom
Applications
Packaged
Applications
TRADITIONAL REPOS
RDBMS EDW MPP
Traditional Sources
(RDBMS, OLTP, OLAP)
MANAGE &
MONITOR
OPERATIONAL
TOOLS
BUILD &
TEST
DEV & DATA
TOOLS
ENTERPRISE
HADOOP PLATFORM
© Hortonworks Inc. 2013. Confidential and Proprietary.
Agile “Data Lake” Solution Architecture
Capture All Data Process & Structure1 2 Distribute Results3 Feedback & Retain4
Dashboards, Re
ports, Visualizati
on, …
Web, Mobile,
CRM, ERP,
Point of sale
Business
Transactions
& Interactions
Business
Intelligence
& Analytics
Classic Data
Integration & ETL
Logs & Text Data
Sentiment Data
Structured
DB Data
Clickstream Data
Geo & Tracking Data
Sensor & Machine Data
Enterprise
Hadoop
Platform
© Hortonworks Inc. 2013. Confidential and Proprietary.
BATCH INTERACTIVE STREAMING GRAPH IN-MEMORY HPC MPIONLINE OTHER…
Key Requirement of a “Data Lake”
Store ALL DATA in one place…
…and Interact with that data in MULTIPLE WAYS
HDFS (Redundant, Reliable Storage)
© Hortonworks Inc. 2013. Confidential and Proprietary.
Applications Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
GRAPH
Giraph
IN-MEMORY
Spark
HPC MPI
OpenMPI
ONLINE
HBase
OTHER…
ex. Search
YARN Takes Hadoop Beyond Batch
Applications run “IN” Hadoop versus “ON” Hadoop…
…with Predictable Performance and Quality of Service
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
© Hortonworks Inc. 2013. Confidential and Proprietary.
2.0 Architected for the
Broad Enterprise
Hadoop 2.0 Key Highlights
Rolling Upgrades
Disaster Recovery
Snapshots
Full Stack HA
Hive on Tez
YARN
HDP 2.0 Features
Single Cluster,
Many Workloads
BATCH
INTERACTIVE
ONLINE
STREAMING
ZERO downtime
Multi Data Center
Point in time Recovery
Reliability
Interactive Query
Mixed workloads
Enterprise Requirements
© Hortonworks Inc. 2013. Confidential and Proprietary.
Making Hadoop Enterprise Ready
OS/VM Cloud Appliance
Enterprise Hadoop Platform
PLATFORM
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery,Security and Snapshots
OPERATIONAL
SERVICES
Manage & Operate
at Scale
DATA
SERVICES
Store, Process
and Access Data
CORE
Distributed
Storage & Processing
© Hortonworks Inc. 2013. Confidential and Proprietary.
SQL-IN-Hadoop with Apache Hive
Stinger Initiative
Focus Areas
Make Hive 100X Faster
Make Hive SQL Compliant
HDFS2
YARN
HIVE
SQL
MAP
REDUCE
Business
Analytics
Custom
Apps
TEZ
© 2013 Forrester Research, Inc. Reproduction Prohibited 13
© 2013 Forrester Research, Inc. Reproduction Prohibited 14
© Hortonworks Inc. 2013. Confidential and Proprietary.
Innovate
Participate
Integrate
Many Communities Must Work As One
Open
Source
End
Users
Vendors
© Hortonworks Inc. 2013. Confidential and Proprietary.
Ecosystem Completes the Puzzle
Data Systems
Applications, Business Tools, & Dev Tools
Infrastructure & Systems Management
© Hortonworks Inc. 2013. Confidential and Proprietary.
Hadoop Wave ONE: Web-scale Batch Apps
time
relative%
customers
Customers want
solutions & convenience
Customers want
technology & performance
Source: Geoffrey Moore - Crossing the Chasm
2006 to 2012
Web-Scale
Batch Applications
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM
© Hortonworks Inc. 2013. Confidential and Proprietary.
Customers want
solutions & convenience
Customers want
technology & performance
Hadoop Wave TWO: Broad Enterprise Apps
time
relative%
customers
Source: Geoffrey Moore - Crossing the Chasm
Innovators,
technology
enthusiasts
Early
adopters,
visionaries
Early
majority,
pragmatists
Late
majority,
conservatives
Laggards,
Skeptics
TheCHASM
2013 & Beyond
Batch, Interactive, Online,
Streaming, etc., etc.
© Hortonworks Inc. 2013. Confidential and Proprietary.
Hortonworks – We Do Hadoop
Open Source
Community
Partner
Ecosystem
Commercial
Adoption
© Hortonworks Inc. 2013
Thank You
Page 20

Contenu connexe

Tendances

Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
Cloudera, Inc.
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 

Tendances (20)

Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 

Similaire à Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Hortonworks
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Hortonworks
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
Hortonworks
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
 

Similaire à Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks (20)

Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 

Plus de Hortonworks

Plus de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

  • 1. © Hortonworks Inc. 2013. Confidential and Proprietary. Hadoop in LondonJuly 9, 2013 Herb Cunitz Hortonworks President @hcunitz Page 1
  • 2. © Hortonworks Inc. 2013. Confidential and Proprietary. Why is Hadoop Important? We Believe that More than Half the World's Data Will Be Processed by Apache Hadoop.
  • 3. By 2015, Organizations that Build a Modern Information Management System Will Outperform their Peers Financially by 20 Percent. – Gartner, Mark Beyer, “Information Management in the 21st Century”
  • 4. © Hortonworks Inc. 2013. Confidential and Proprietary. New Sources (sentiment, clickstream, geo, sensor, …) Traditional Data ArchitectureAPPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES OLTP, POS SYSTEMS Business Analytics Custom Applications Packaged Applications Pressured TRADITIONAL REPOS RDBMS EDW MPP OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Traditional Sources (RDBMS, OLTP, OLAP)
  • 5. © Hortonworks Inc. 2013. Confidential and Proprietary. PressuredTraditional Data Architecture Source: IDC New Sources (sentiment, clickstream, geo, sensor, …) 2.8 ZB in 2012 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020
  • 6. © Hortonworks Inc. 2013. Confidential and Proprietary. New Sources (sentiment, clickstream, geo, sensor, …) Modern Data Architecture EnabledAPPLICATIONSDATASYSTEMSDATASOURCES OLTP, POS SYSTEMS Business Analytics Custom Applications Packaged Applications TRADITIONAL REPOS RDBMS EDW MPP Traditional Sources (RDBMS, OLTP, OLAP) MANAGE & MONITOR OPERATIONAL TOOLS BUILD & TEST DEV & DATA TOOLS ENTERPRISE HADOOP PLATFORM
  • 7. © Hortonworks Inc. 2013. Confidential and Proprietary. Agile “Data Lake” Solution Architecture Capture All Data Process & Structure1 2 Distribute Results3 Feedback & Retain4 Dashboards, Re ports, Visualizati on, … Web, Mobile, CRM, ERP, Point of sale Business Transactions & Interactions Business Intelligence & Analytics Classic Data Integration & ETL Logs & Text Data Sentiment Data Structured DB Data Clickstream Data Geo & Tracking Data Sensor & Machine Data Enterprise Hadoop Platform
  • 8. © Hortonworks Inc. 2013. Confidential and Proprietary. BATCH INTERACTIVE STREAMING GRAPH IN-MEMORY HPC MPIONLINE OTHER… Key Requirement of a “Data Lake” Store ALL DATA in one place… …and Interact with that data in MULTIPLE WAYS HDFS (Redundant, Reliable Storage)
  • 9. © Hortonworks Inc. 2013. Confidential and Proprietary. Applications Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez STREAMING Storm GRAPH Giraph IN-MEMORY Spark HPC MPI OpenMPI ONLINE HBase OTHER… ex. Search YARN Takes Hadoop Beyond Batch Applications run “IN” Hadoop versus “ON” Hadoop… …with Predictable Performance and Quality of Service HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management)
  • 10. © Hortonworks Inc. 2013. Confidential and Proprietary. 2.0 Architected for the Broad Enterprise Hadoop 2.0 Key Highlights Rolling Upgrades Disaster Recovery Snapshots Full Stack HA Hive on Tez YARN HDP 2.0 Features Single Cluster, Many Workloads BATCH INTERACTIVE ONLINE STREAMING ZERO downtime Multi Data Center Point in time Recovery Reliability Interactive Query Mixed workloads Enterprise Requirements
  • 11. © Hortonworks Inc. 2013. Confidential and Proprietary. Making Hadoop Enterprise Ready OS/VM Cloud Appliance Enterprise Hadoop Platform PLATFORM SERVICES Enterprise Readiness High Availability, Disaster Recovery,Security and Snapshots OPERATIONAL SERVICES Manage & Operate at Scale DATA SERVICES Store, Process and Access Data CORE Distributed Storage & Processing
  • 12. © Hortonworks Inc. 2013. Confidential and Proprietary. SQL-IN-Hadoop with Apache Hive Stinger Initiative Focus Areas Make Hive 100X Faster Make Hive SQL Compliant HDFS2 YARN HIVE SQL MAP REDUCE Business Analytics Custom Apps TEZ
  • 13. © 2013 Forrester Research, Inc. Reproduction Prohibited 13
  • 14. © 2013 Forrester Research, Inc. Reproduction Prohibited 14
  • 15. © Hortonworks Inc. 2013. Confidential and Proprietary. Innovate Participate Integrate Many Communities Must Work As One Open Source End Users Vendors
  • 16. © Hortonworks Inc. 2013. Confidential and Proprietary. Ecosystem Completes the Puzzle Data Systems Applications, Business Tools, & Dev Tools Infrastructure & Systems Management
  • 17. © Hortonworks Inc. 2013. Confidential and Proprietary. Hadoop Wave ONE: Web-scale Batch Apps time relative% customers Customers want solutions & convenience Customers want technology & performance Source: Geoffrey Moore - Crossing the Chasm 2006 to 2012 Web-Scale Batch Applications Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Late majority, conservatives Laggards, Skeptics TheCHASM
  • 18. © Hortonworks Inc. 2013. Confidential and Proprietary. Customers want solutions & convenience Customers want technology & performance Hadoop Wave TWO: Broad Enterprise Apps time relative% customers Source: Geoffrey Moore - Crossing the Chasm Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Late majority, conservatives Laggards, Skeptics TheCHASM 2013 & Beyond Batch, Interactive, Online, Streaming, etc., etc.
  • 19. © Hortonworks Inc. 2013. Confidential and Proprietary. Hortonworks – We Do Hadoop Open Source Community Partner Ecosystem Commercial Adoption
  • 20. © Hortonworks Inc. 2013 Thank You Page 20

Notes de l'éditeur

  1. Where are weWhere does it go from hereWhat’s nextCommunityNew projects incubated: Falcon, Knox, and moreHadoop 2 and the YARN based architecture coming in for landing (beta vote)Certification of YARN based applications – Hortonworks just announcedEcosystemVC’s invested $1.4M in Big Data companies in 2012 and 2013 even bigger (now huge investment into tools for accessing data in Hadoop, indicating “it has arrived”)Virtually every provider that touches data in any shape has brought Hadoop inJob postingsCommercial adoptionProjects going live at scaleAmazing commercial use cases that you will hear more aboute.g Cardinal Health, Home Depot? phenomenal examples of application to healthcare
  2. To frame up my talk, I chose this quote from Mark Beyer of Gartner:“By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent.”Whether it’s opening up new business opportunities or outperforming your competitors by 20% or more, the important point to be made is that big data technologies offer very real and compelling BUSINESS and FINANCIAL value to go along with TECHNOLOGY that is able to do things never before possible.
  3. So let’s set some context before digging into the Modern Data Architecture.While overly simplistic, this graphic represents the traditional data architecture:- A set of data sources producing data- A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses- A set of custom and packaged applications as well as business analytics that leverage the data stored in those data systems. This architecture is tuned to handle TRANSACTIONS and data that fits into a relational database tables. [CLICK] Fast-forward to recent years and this traditional architecture has become PRESSURED with New Sources of data that aren’t handled well by existing data systems. In the world of Big Data, we’ve got classic TRANSACTIONS as well as New Sources of data that come from what I refer to as INTERACTIONS and OBSERVATIONS.INTERACTIONS come from such things as Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content including video, audio, and images.OBSERVATIONS tend to come from the “Internet of Things”. Sensors for heat, motion, and pressure and RFID and GPS chips within such things as mobile devices, ATM machines, automobiles, and even farm tractors are just some of the “things” that output Observation data.
  4. To get a sense of the scope of these NEW SOURCES of data, let’s look at some stats from IDC.[CLICK] According to IDC, 2.8ZB of data were created and replicated in 2012.A Zettabyte for those unfamiliar with the term is 1 BILLION Terabytes.[CLICK] 85% of that is from New Sources of Data.[CLICK] Out of that 85%, machine-generated data is a key driver in the growth and just that one new source of data is expected to grow by 15X by 2020.[CLICK] Fast-forward to 2020 and we’ll have 40 Zettabytes of data in the digital universe! This represents 50-fold growth from the beginning of 2010.[CLICK] Needless to say, wrestling that scale of data is like this poor guy trying to wrestle a champion Sumo athlete. Overwhelmed and outmatched to say the least. Fortunately, your data architecture need not be outmatched.
  5. As the volume of data has exploded, Enterprise Hadoop has emerged as a peer to traditional data systems. The momentum for Hadoop is NOT about revolutionary replacement of traditional databases. Rather it’s about adding a data system uniquely capable of handling big data problems at scale and doing so in a way that integrates easily with existing data systems, tools and approaches.This means it must interoperate with every layer of the stack:- Existing applications and BI tools- Existing databases and data warehouses for loading data to / from the data warehouse- Development tools used for building custom applications- Operational tools for managing and monitoringMainstream enterprises want to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
  6. So I’d like to walk you through a solution architecture focused on how new and existing data sources flow through this modern data architecture. The architecture starts with two major areas of data processing that are very familiar to enterprises:1. Business Transactions & Interactions2. Business Intelligence & AnalyticsEnterprise IT has been connecting these systems via classic Data Integration and ETL processing for many years in order to deliver STRUCTURED and REPEATABLE business analytics. The business determines the questions to ask and IT collects and structures the data needed to answer those questions.[CLICK] As we’ve discussed, New Data Sources representing Interactions and Observations have come onto the scene. And Enterprise Hadoop has appeared as a new system capable of capturing ALL of this multi-structured data into one place. Hadoop acts as a “Data Lake” if you will. Some call it a Data Reservoir, a Catch Basin, a Data Refinery, the foundation for a Data Hub & Spoke architecture. Regardless of name, it’s a place where ALL data can be brought together where it can then be flexibly aggregated and transformed into useful formats that help fuel new insights for the business. Structure and schema is applied when needed, NOT as a prerequisite before landing the data. [CLICK] The next step is about getting the data into the right format for the people and applications that need it. Some folks will earmark subsets of the Data Lake for data scientists, researchers, or particular departments to interact with. Tools like Hive and HBase are commonly used for interacting with Hadoop data directly.Others will directly integrate Enterprise Hadoop with Business Intelligence & Analytics solutions so they can obtain a 360 ̊ view of their customers and enhance their ability to more accurately understand customer Interactions that lead to or inhibit their Transactions.Still others will perform complex analytic models and calculations of key parameters in Hadoop and flow the results into online applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.[CLICK] And to achieve a closed loop analytics system, companies are leveraging Hadoop to cost-effectively retain large volumes of data for long periods of time. Keeping an active archive of the past 10 years of historical retail data enables companies to blend that data with 10 years of weather data so they can analyze the impact of weather on “Black Friday” selling season, for example.The result? Customers now have an agile data architecture that enables them to maximize the value from ALL of their data: transactions + interactions + observations.
  7. So as mainstream enterprises begin to store ALL of their data in one place, they will increasingly want to create applications that interact with that data in a wide variety of ways. While classic batch-oriented MapReduce is powerful, it’s just one of many application types people need.[CLICK] Interactive SQL solutions running on or next to Hadoop have gotten lots of press over recent months. Online data systems that store their data in HDFS are on the rise. As is Streaming and Complex Event Processing solutions, and Graph Processing. In-Memory Data Processing is another area. Even classic HPC Message Passing Interface apps are storing data in HDFS.
  8. The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  9. 1.0Architected for the Large Web Properties to; Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…
  10. Since enterprise Hadoop lies at the heart of the next-generation data architecture, it needs to provide the services and features that make it an enterprise-viable data platformAt the center, we start with Apache Hadoop for distributed file storage and data processing (a la HDFS, MapReduce, and YARN).[CLICK] Beyond that core, we need to address enterprise concerns such as high availability, disaster recovery, snapshots, security, etc. And the community has been hard at work in both the 1.0 and 2.0 lines of Hadoop addressing these needs. [CLICK] And on top of this, we need to provide data services that make it easy to move data in and out of the platform, process and transform the data into useful formats, and enable people and other systems to access the data easily. This is where components like Apache Hive for SQL access, HCatalog for describing and managing your tables within Hadoop, Pig for script-based data processing, HBase for online data serving, Sqoop and Flume for getting data into Hadoop, etc.[CLICK] It’s also important…I would argue equally important…to make the platform easy to operate. Components like Apache Ambari for provisioning, management and monitoring of the cluster, Oozie for job & workflow scheduling and a new framework called Apache Falcon for Data Lifecycle Management fit here.[CLICK] So all of that: Core and Platform Services, Data Services, and Operational Services all come together into what I think of as “Enterprise Hadoop”.[CLICK] Ensuring that Enterprise Hadoop can be flexibly deployed across operating systems and virtual environments like Linux, Windows, and VMware is important. Targeting Cloud environments like Amazon Web Services, Microsoft Azure, Rackspace OpenCloud, and OpenStack is increasingly important. As is the ability to provide enterprise Hadoop pre-configured within a Hardware appliance like Teradata’s Big Analytics Appliance helps enterprises deploy Hadoop quickly, easily and in a familiar way.
  11. As mentioned previously, SQL for Hadoop has been a hot topic for the past 6 months or so. And rightly so. There are easily millions of people with SQL skills that would like to leverage those skills as they look to gain insight and value from data stored in Hadoop. With that as backdrop, at the beginning of the year, the Stinger Initiative was rolled out. It’s focus was to rally the Apache Hive community around the goals of making Hive 100X faster, so it can handle those interactive querying use cases, and making Hive more SQL compliant so its BI use cases are richer. Oh, and by the way, this work needs to happen in a way that PRESERVES Hive’s awesome capability of processing ginormous data sets. As part of the Stinger Initiative, a new data processing framework has emerged as a sibling to MapReduce. This project is called Apache Tez and it handles the interactive querying use cases for Hive by eliminating needless HDFS writes that have traditionally slowed down Hive. Since Tez is built on YARN, Interactive SQL querying use cases can now run natively IN Hadoop and coexist nicely with classic MapReduce processing – yielding predictable performance and SLAs for apps running in the cluster.
  12. Everybody’s adopting Hadoop as a data processing platform because it accepts any kind of data and can process at almost any scale.But, as people adopt Hadoop and throw all this data on they start to find other challenges. For example how do you ensure data is being processed reliably? How do you know I’m not keeping data that is too old? If you process data globally, how do you deal with multi-datacenter replication?The challenge the tools that exist for Hadoop including tools like Oozie, Distcp and others operate at a very low level, so you need expert developers to build and test data processing solutions. This sort of custom development takes a lot of time and money and is error prone since you deal at such a low level.Still everybody does it this way because there aren’t real alternatives. I see a lot of people who use custom scripts to delete files when they get too old. This approach has a lot of drawbacks.Hadoop traditionally doesn’t provide native tools that solve problems like retention, anonymization, reprocessing and other needs.Falcon’s solves this by letting developers work at a much higher level of abstraction.Falcon provides native APIs for data processing, retention, replication and others that abstract away low level tools like scheduling and the mechanical details of replication.With Falcon developers do more, do it easier, and avoid common mistakes.Avoiding common mistakes is probably the most important thing.Data management on Hadoop is not easy, and Falcon was developed by engineers who worked on large scale data management at Yahoo complete with all the battle scars it brings.Falcon has a lot of the practical lessons learned baked into its APIs and ready for developers to simply use.Question: What data lifecycle management needs do you have in your environment?
  13. Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
  14. One thing I’ve learned in my last 10 years of working in the enterprise open source arena is that it’s best to think of “Community” in a broad way. In the Hadoop space, there is clearly the open source community. Without the innovative Apache open source technology, none of us would be here today.There’s also the end user community that spans the tech-savvy early adopter types as well as the more pragmatic and conservative adopters. Then the 3rd piece is the broader ecosystem that integrates with, extends, enhances, builds on, etc.
  15. Now let’s expand the scope to include ALL of the sponsors!I love this slide because it is very BUSY!The cool thing is that we have almost 70 sponsors that provide really nice coverage across all layers of the data stack. This is a great example that the Hadoop market is maturing quite nicely!
  16. Hadoop Wave ONE started in 2006 and did a GREAT job at Web-scale Batch-oriented data processing. A vibrant community and strong enterprise interest propelled Hadoop across the Chasm at the end of 2012.
  17. The 2nd wave of Hadoop has started and it will continue to fuel Hadoop on its path through mainstream adoption. Everyone in this room is at the forefront of a movement that will have lasting impact across the industry. Hadoop has the opportunity to process half the world’s data. There’s still a lot of work to be done.
  18. Where are weWhere does it go from hereWhat’s nextCommunityNew projects incubated: Falcon, Knox, and moreHadoop 2 and the YARN based architecture coming in for landing (beta vote)Certification of YARN based applications – Hortonworks just announcedEcosystemVC’s invested $1.4M in Big Data companies in 2012 and 2013 even bigger (now huge investment into tools for accessing data in Hadoop, indicating “it has arrived”)Virtually every provider that touches data in any shape has brought Hadoop inJob postingsCommercial adoptionProjects going live at scaleAmazing commercial use cases that you will hear more aboute.g Cardinal Health, Home Depot? phenomenal examples of application to healthcare