SlideShare une entreprise Scribd logo
1  sur  21
The Enterprise and Connected Data,
Trends in the Apache Hadoop
Ecosystem
Alan Gates
Co-Founder
Hortonworks
@alanfgates
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Hadoop Journey Begins…
1 ° ° °
° ° ° N
HDFS
MapReduce
Batch apps
2006
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Today
Our Hadoop Journey: Ecosystem Innovation Accelerates
2006 2011
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
6 Years of Apache Hive and Beyond
• Apache Hive becomes a Top-Level Project
• HiveServer2 adds ODBC/JDBC
• SQL breadth expands with windowing
and more
• Apache Tez enters incubation
• Hive 0.13 marks delivery of the Stinger
Initiative with Tez, Vectorized Query
and ORCFile support
• Standard SQL authorization,
integration with Apache Ranger
• ACID transactions introduced
• Governance added with Apache
Atlas integration
• Hive 2 introduces LLAP and
intelligent in-memory caching
2010 2011 2012 2013 2014 2015 2016
A SQL data warehouse infrastructure that
delivers fast, scalable SQL processing on
Hadoop and in the Cloud
• Extensive SQL:2011 Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
Storage
HDFS
S3 + Other HDFS
Compatible Filesystems
YARN Cluster
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC SQL
Queries
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: 25+x Performance Boost
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s new in Spark 2.0?
 API Improvements
– SparkSession – new entry point
– Unified DataFrame & DataSet API
– Structured Streaming/Continuous Application
 Performance Improvements
– Tungsten Phase 2 – Whole-stage code generation
 ML
– ML model persistence
– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
 SparkSQL
– SQL 2003 support (new ANSI SQL parser, subquery support)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
How to Secure and Govern Access to Your Data?
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
Policies
?
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Secure and Govern Your Data with Tag-Based Access Policies
Classification
Prohibition
Time
Location
Policies
PDP
Resource
Cache
Ranger
Manage Access Policies
and Audit Logs
Track Metadata
and Lineage
Atlas Client
Subscribers
to Topic
Gets Metadata
Updates
Atlas
Metastore
Tags
Assets
Entitles
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data In Motion
 Constrained
 High-latency
 Localized context
 Hybrid – cloud/on-premises
 Low-latency
 Global context
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Hadoop Journey: From the Data Center to the Cloud!
2006 Today
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Hadoop in the Cloud?
Unlimited
Elastic Scale
Ephemeral &
Long-Running
IT &
Business Agility
No Upfront
HW Costs
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Architectural Considerations for Hadoop in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral Workloads
10101
10101010101
01010101010101
0101010101010101010
Elastic Resource
Management
Shared Metadata,
Security & Governance
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared Data and Storage
Understand and Leverage Unique Cloud Properties
 Shared data lake is cloud storage accessible
by all apps
 Cloud storage segregated from compute
 Built-in geo-distribution and DR
Focus Areas
 Address cloud storage consistency
and performance
 Enhance performance via memory
and local storage
Shared Data
& Storage
10101
10101010101
01010101010101
0101010101010101010
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enhance Performance via Caching
Tabular Data: LLAP Read + Write-thru Cache
 Shared across jobs / apps and across engines
 Cache only the needed columns
 Spills to SSD when memory is full (anti-caching)
 Read & Write-through cache
 Security: Column-level and row-level
HDFS Caching for Non-tabular Data
 Cache data from cloud storage as needed
 Write-through cache
Workloads
Cloud Storage
LLAP R/W TablesHDFS Files
Cache
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Prescriptive On-Demand Ephemeral Workloads
On-Demand
Ephemeral
Workloads
Data Science
R/W TablesCompute Fabric
ETL
R/W TablesCompute Fabric
Warehouse
R/W TablesCompute Fabric
Search
R/W TablesCompute Fabric
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared Data Requires Shared Metadata, Security, and Governance
Shared Metadata Across All Workloads
 Metadata considerations
– Tabular data metastore
– Lineage and provenance metadata
– Pipeline and job management metadata
– Add upon ingest
– Update as processing modifies data
 Access / tag-based policies and audit logs
 Centrally stored to facilitate use across clusters
– Ex. backed by Cloud RDS (or shared DB)
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Tables
Files Objects
Shared
Metadata
Policies
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Elastic Resource Management in Context of Workload
Workload Management vs. Cluster Management
 Understand resource needs of different
workload types
 Add / remove resources to meet workload SLAs
 Manage compute power and high-performance
data-access (ex., LLAP)
 Pricing-aware: instances (spot, reserved),
data, bandwidth
Elastic
Resource
Management
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data in
Motion
Data at
Rest
Deep Historical
Analysis
DATA CE NTE R
Stream Analytics
Edge
Data
Data in
Motion
Machine
Learning
CLOU D Edge
Data
Edge
Analytics
Data at
Rest
Transformational Applications Require Connected Data
Thank You

Contenu connexe

Tendances

Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceApache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceBikas Saha
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
veshaal-singh-ebs-oracle cloud(iaas+paas)
veshaal-singh-ebs-oracle cloud(iaas+paas)veshaal-singh-ebs-oracle cloud(iaas+paas)
veshaal-singh-ebs-oracle cloud(iaas+paas)aioughydchapter
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
 
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Josh Elser
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureDataWorks Summit
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0DataWorks Summit
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache StormDataWorks Summit
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBryan Bende
 
Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015aioughydchapter
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Apache Phoenix Query Server
Apache Phoenix Query ServerApache Phoenix Query Server
Apache Phoenix Query ServerJosh Elser
 
Debugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in ProductionDebugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in ProductionXuan Gong
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
 
Apache Ambari - What's New in 2.2
 Apache Ambari - What's New in 2.2 Apache Ambari - What's New in 2.2
Apache Ambari - What's New in 2.2Hortonworks
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015aioughydchapter
 
Pimping SQL Developer and Data Modeler
Pimping SQL Developer and Data ModelerPimping SQL Developer and Data Modeler
Pimping SQL Developer and Data ModelerKris Rice
 

Tendances (20)

Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceApache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
veshaal-singh-ebs-oracle cloud(iaas+paas)
veshaal-singh-ebs-oracle cloud(iaas+paas)veshaal-singh-ebs-oracle cloud(iaas+paas)
veshaal-singh-ebs-oracle cloud(iaas+paas)
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Apache Phoenix Query Server
Apache Phoenix Query ServerApache Phoenix Query Server
Apache Phoenix Query Server
 
Debugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in ProductionDebugging Apache Hadoop YARN Cluster in Production
Debugging Apache Hadoop YARN Cluster in Production
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
Apache Ambari - What's New in 2.2
 Apache Ambari - What's New in 2.2 Apache Ambari - What's New in 2.2
Apache Ambari - What's New in 2.2
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015
 
Pimping SQL Developer and Data Modeler
Pimping SQL Developer and Data ModelerPimping SQL Developer and Data Modeler
Pimping SQL Developer and Data Modeler
 

En vedette

Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache trainingalanfgates
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemDatabricks
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJData Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaRISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...Spark Summit
 

En vedette (20)

Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Introduction to Hive
Introduction to HiveIntroduction to Hive
Introduction to Hive
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJData Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJ
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaRISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
 

Similaire à Big data spain keynote nov 2016

Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData DayJohn Park
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 

Similaire à Big data spain keynote nov 2016 (20)

Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
The Elephant in the Clouds
The Elephant in the CloudsThe Elephant in the Clouds
The Elephant in the Clouds
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 

Dernier

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 

Dernier (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 

Big data spain keynote nov 2016

  • 1. The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem Alan Gates Co-Founder Hortonworks @alanfgates
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our Hadoop Journey Begins… 1 ° ° ° ° ° ° N HDFS MapReduce Batch apps 2006
  • 3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Today Our Hadoop Journey: Ecosystem Innovation Accelerates 2006 2011
  • 4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved 6 Years of Apache Hive and Beyond • Apache Hive becomes a Top-Level Project • HiveServer2 adds ODBC/JDBC • SQL breadth expands with windowing and more • Apache Tez enters incubation • Hive 0.13 marks delivery of the Stinger Initiative with Tez, Vectorized Query and ORCFile support • Standard SQL authorization, integration with Apache Ranger • ACID transactions introduced • Governance added with Apache Atlas integration • Hive 2 introduces LLAP and intelligent in-memory caching 2010 2011 2012 2013 2014 2015 2016 A SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud • Extensive SQL:2011 Support • Compatible with every major BI Tool • Proven at 300+ PB Scale
  • 5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview Deep Storage HDFS S3 + Other HDFS Compatible Filesystems YARN Cluster LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache LLAP Daemon Query Executors In-Memory Cache Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries
  • 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: 25+x Performance Boost 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
  • 7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s new in Spark 2.0?  API Improvements – SparkSession – new entry point – Unified DataFrame & DataSet API – Structured Streaming/Continuous Application  Performance Improvements – Tungsten Phase 2 – Whole-stage code generation  ML – ML model persistence – Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)  SparkSQL – SQL 2003 support (new ANSI SQL parser, subquery support)
  • 8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved How to Secure and Govern Access to Your Data? Classification Prohibition Time Location Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake Policies ?
  • 10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Secure and Govern Your Data with Tag-Based Access Policies Classification Prohibition Time Location Policies PDP Resource Cache Ranger Manage Access Policies and Audit Logs Track Metadata and Lineage Atlas Client Subscribers to Topic Gets Metadata Updates Atlas Metastore Tags Assets Entitles Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake
  • 11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data In Motion  Constrained  High-latency  Localized context  Hybrid – cloud/on-premises  Low-latency  Global context SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  • 12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our Hadoop Journey: From the Data Center to the Cloud! 2006 Today
  • 13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Hadoop in the Cloud? Unlimited Elastic Scale Ephemeral & Long-Running IT & Business Agility No Upfront HW Costs
  • 14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Architectural Considerations for Hadoop in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 0101010101010101010 Elastic Resource Management Shared Metadata, Security & Governance
  • 15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared Data and Storage Understand and Leverage Unique Cloud Properties  Shared data lake is cloud storage accessible by all apps  Cloud storage segregated from compute  Built-in geo-distribution and DR Focus Areas  Address cloud storage consistency and performance  Enhance performance via memory and local storage Shared Data & Storage 10101 10101010101 01010101010101 0101010101010101010
  • 16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enhance Performance via Caching Tabular Data: LLAP Read + Write-thru Cache  Shared across jobs / apps and across engines  Cache only the needed columns  Spills to SSD when memory is full (anti-caching)  Read & Write-through cache  Security: Column-level and row-level HDFS Caching for Non-tabular Data  Cache data from cloud storage as needed  Write-through cache Workloads Cloud Storage LLAP R/W TablesHDFS Files Cache
  • 17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Prescriptive On-Demand Ephemeral Workloads On-Demand Ephemeral Workloads Data Science R/W TablesCompute Fabric ETL R/W TablesCompute Fabric Warehouse R/W TablesCompute Fabric Search R/W TablesCompute Fabric
  • 18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared Data Requires Shared Metadata, Security, and Governance Shared Metadata Across All Workloads  Metadata considerations – Tabular data metastore – Lineage and provenance metadata – Pipeline and job management metadata – Add upon ingest – Update as processing modifies data  Access / tag-based policies and audit logs  Centrally stored to facilitate use across clusters – Ex. backed by Cloud RDS (or shared DB) Classification Prohibition Time Location Streams Pipelines Feeds Tables Files Objects Shared Metadata Policies
  • 19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Elastic Resource Management in Context of Workload Workload Management vs. Cluster Management  Understand resource needs of different workload types  Add / remove resources to meet workload SLAs  Manage compute power and high-performance data-access (ex., LLAP)  Pricing-aware: instances (spot, reserved), data, bandwidth Elastic Resource Management
  • 20. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data in Motion Data at Rest Deep Historical Analysis DATA CE NTE R Stream Analytics Edge Data Data in Motion Machine Learning CLOU D Edge Data Edge Analytics Data at Rest Transformational Applications Require Connected Data

Notes de l'éditeur

  1. Speaker: Alan Gates, Hortonworks Co-Founder Title: 10 Years of Apache Hadoop and Beyond Duration: 40 minutes Abstract: In 2006, Apache Hadoop had its first line of code committed to what has become a breakthrough technology. A decade later, we are witness to open source innovation that has literally changed the face of business. Hadoop and related technologies have become the enterprise data platform, fueled by a rich ecosystem capable of supporting any application, any data, anywhere. Join Hortonworks Co-Founder Alan Gates as he as he drills down into the current and future state of Hadoop and reviews community initiatives aimed at enabling the next wave of modern data applications that are well governed and easy to deploy on-premises and in the cloud.
  2. Our Hadoop journey began in 2006 focused on executing batch MapReduce jobs on petabytes of data. Yahoo’s decision to contribute Hadoop to the Apache Software Foundation was critical because a vibrant set of related technologies began to appear around Hadoop. [NEXT]
  3. Fast forward to 2011 and the concept of YARN began to emerge. Its goal? Enable Hadoop to move from its batch-only roots and become a data platform capable of running batch, interactive, and real-time applications. The emergence of YARN further accelerated the innovation around Hadoop with the emergence of Spark, Kafka, Storm, and many other projects that started life as Apache Incubator proposals. [NEXT]
  4. I want to focus for minute in one area of how Hadoop has developed. Apache Hive has participated in that move from batch to interactive, from ETL only to enterprise ready EDW
  5. swiss army knife of big data, can do streaming, SQL, ETL, ML available from multiple languages (python, java, scala)
  6. So the enterprise has invested in integrating Hadoop into its data lake architecture. Landing petabyte of data from streams, pipelines, data feeds into HDFS files, Hive and HBase tables, etc. The question arises of how we can setup policies for these data sets that enable us to secure and govern access to it. [NEXT ]
  7. The community has been hard at work on integrating Apache Atlas as a metadata catalog and Apache Ranger as the centralized security system to address this need. The result is tag-based authorization model driven by the metadata catalog (i.e. Atlas) with access and audit policies applied to those tags (via Ranger). This enables a more flexible way to govern access to data and data sets than traditional role/group based access policies. Ex. as data pipelines land data, they can tag that data as the data lands and the access policies setup for those tags immediately apply. Moreover, Ranger has added the notions of time-based and location-based access policies, so users can do things like limit access to data that’s older than 90 days (for example) or limit access to data from certain geographies. This provides important enterprise-focused capabilities that will help businesses deploy more modern data applications in a way where they have the confidence their data is secure and well-governed. [NEXT]
  8. TALK TRACK People are no longer willing to wait until data is in the store before processing it Hortonworks DataFlow is a platform for data in motion. It is powered by Apache NiFI, Kafka, and Storm for dataflow management and stream processing. MiNiFi/NiFi : creates dynamic, configurable data pipelines Kafka support adaptation to differing rates of data creation and delivery Storm for real-time streaming processing to create immediate insights at a massive scale. There are scenarios where NiFI will provide all that you you need – especially in situations that only require dataflow management, but you will notice the orange and blue horizontal triangles provide a continuum of capability from edge to core, that indicates varying degrees of need for the different products.
  9. So after 10 years, the Hadoop ecosystem is available everywhere. In the Data Center, within appliances, across public and private clouds. This maximizes choice for people interested in getting started with Hadoop and deploying it at scale for transformational use cases. [NEXT]
  10. 13
  11. While there are a range of great choices in the market today, there’s more that we, as a community, can and should do to make Hadoop in the cloud better and first class. I’ll spend the remainder of this talk on the key architectural considerations Shared Data & Storage – the shared-data-lake is on cloud storage, it is not HDFS. Also memory and local storage play a different role – that of caching An import distinction in the cloud is On-Demand Ephemeral Workloads – this changes a number of things in fundamental ways. Shared Metadata, Security, and Governance remains important but need to be adjusted in the face of ephemeral clusters. And finally, I’ll touch on Elastic Resource Management We need to shift our thinking away from cluster resource management and more towards SLA-driven workloads [NEXT]
  12. In the cloud, the shared DataLake is on cloud storage. It is not HDFS of a specific Hadoop cluster. Note this is very different from a traditional on-premise cluster where each cluster has an internal shared store representing its internal DataLake. Moreover, it’s desirable to have this shared data be accessible by all apps, not just Hadoop apps – Cloud Native and 3rd party Good news: goe-distribution is built-into the cloud storage and DR becomes simpler. Cloud storage has two limitation: Eventual consistency and its API does not match the filesystem API expected by Hadoop and normal apps. Addressing these two issues is a key area of ongoing investment. I encourage you to attend today’s breakout session by my fellow Hortonworkers that’s focused on this topic. Cloud Storage is designed for low cost & scale – unfortunately performance is not its strong point due to segregation from compute. Memory and Local storage play a different role in the cloud – cache to enhance the performance. [NEXT]
  13. Wrt to caching we need to consider both tabular data and non-tabular data. For tabular data, LLAP comes to rescue – it provide a tabular cache that’s shared across jobs, apps, and engines such as Hive and Spark. LLAP only caches the needed columns, so it’s very efficient in its use of memory. Further, data is stored in an internal serialized form to optimize compute The design center is anti-caching – put it all only memory and spill to disk/SSD when memory is full . LLAP currently provides read caching, but is being extended to support a write-through cache. And LLAP addresses a key security Gap for the Hadoop eco system, it provides a convenient place to address column-level and row-level access control that works across all kinds of Engines: Hive, Spark, Flink, or even old fashion MapReduce. Note this was not previously possible …. From a non-tabular data perspective, HDFS can be used to cache cloud data – both a read cache and a write-through cache. This essentially evolves HDFS to play a different role, A place to store intermediate data and also to be a finely-tuned caching layer between the applications and the cloud storage. [NEXT]
  14. Always-on multitenant clusters are important for a range of mission critical use cases. However, bringing forth an ephemeral cluster to support a specific workload is game changing. The agile nature of the cloud allows us to create prescription workload environments. For someone interested in modeling and analyzing data sets, - they simply want to interact with a PRE-TUNED environment optimized for the application. - The complexities of configuring Spark, Hive and Hadoop need to be hidden under the hood. Whether it’s data science, data warehouse, ETL, or other common workload types, - provide pre-configured and pre-tuned compute environments - Further we need be able manage them in, ephemeral fashion. The NET: deliver user experiences that are focused on business agility, - rather than infinite configurability and cluster management. [NEXT]
  15. So far: I shared data and storage and how to optimize performance by caching. Shared data fundamentally requires a shared approach to metadata, security and governance. The Metadata is not just the classic Hive metadata that describes the tabular data, -about storing and tracking the lineage and provenance of data, - about details related to data pipeline processing and job management. Tabular data needs to be available to all applications so that SQL is an option regardless of where your data is Also, as data is ingested and processed, metadata needs to be created and adjusted Governance and securing the data remain critical and its matadata needs to be managed across all workloads. - The work done by projects such as Ranger and Atlas need to be evolved to fit the cloud environment. If we don’t do this then the cloud will not be adapted aggressively for enterprise use. Getting back to the Shared metadata – each ephemeral cluster cannot have it private copy of the metadata.. In the cloud world, metadata must be centrally stored so it is used across all ephemeral clusters. [NEXT]
  16. Final area: resource management. We up-level resource manage - So far, Yarn has focused on optimizing resources in the context cluster. - The cloud is not about the cluster it is about the workloads, And further resources are elastic. The scheduler needs to change its focus to managing resources in the context of a workload and meeting the workload’s SLA - It may need get extra resources from the cloud - get the right resource to match the needs of the workload. Sometimes adding compute power is not sufficient to meet an SLA, because latency/bandwidth to data may be the bottleneck –e.g. spin up LLAPs memory in order to improve caching and hence meet SLA Cloud offers another dimension – that of cost and budgets. There are different costs tied to CPU, memory and data access bandwidth, so elasticity and Spot pricing tradeoffs should be factored in Resource management in this new dimension is important if you want reap the benefit low cost cloud computing. To Summerize: the better one understands the nature of a workload, the more we are able to take advantage of elasticity and spot pricing. CONCLUDE: While one could lift and shift Hadoop on the cloud, I hope I have convinced you that we really need to evolve Hadoop to run first class in the cloud and also to take advantage of the unique cloud features such as elasticity. We at Hortonworks have been working on this over the few months and Ram will show you a quick demo of the tech-preview we are releasing this week. [NEXT]
  17. Today we have talked about evolving Hadoop to run well in the cloud. At Hortonworks, we are focused on enabling a connected data architecture that seamlessly spans the cloud and data center. This is illustrated on the screen. It stress two important points – the connectedness of the cloud and the on-premise infrastructure & data.. Also it illustrates the the connected ness of data at motion and data at rest. The Era of the Internet-of-Things demands that we manage the entire lifecycle of all data - (data in motion and data at rest) It’s about being able to collect and curate data across traditional silos so the various groups and lines of business can have a place where they can assemble a single view of data in order to drive deep historical insights. It’s also about proactively managing data from its point of inception and securely acquiring and delivering it. Moreover, it’s not just about point-to-point delivery, but it’s also about enabling bi-directional data flows that can leverage both real-time and historical insights to help shape and prioritize the flow of data. So in this diagram, for example, the upper-left edge could represent the connected car, whereas the lower-left edge can represent data from the manufacturing line. Having a connected data architecture that enables you to deal with all of this data unlocks the ability to figure out what manufacturing line issues may be causing operational issues in cars in the field, for example. In this world of next generation applications, I am existed about evolving the Hadoop eco-system to enable these types of use cases and usage models. [NEXT SLIDE]
  18. THANK YOU AND HAVE A GREAT CONFERENCE!