SlideShare une entreprise Scribd logo
1  sur  32
Scalable Data Warehousing on
Hadoop
Alan F. Gates, Co-founder, Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Do You Expect in a Hadoop Data Warehouse?
Benchmarks focus on two questions:
– How much of the TPC-DS query set can it run?
– How fast can it run it?
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What You Expect in a Data Warehouse?
High Performance
SQL 2011
High Storage Capacity
Security
Support for BI,
Cubes, Data Science
Monitoring & Management
Governance
Data Lifecycle Management
Replication & D/R
Workload Management
Data Ingestion
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
So, back to TPC-DS...
High Performance
SQL 2011
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive Overview
Apache Hive is a SQL data warehouse engine that
delivers fast, scalable SQL processing on Hadoop and
in the Cloud.
Features:
• Extensive SQL:2011 Support
• ACID Transactions
• In-Memory Caching
• Cost-Based Optimizer
• User-Based Dynamic Security
• JDBC and ODBC Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
(Yahoo Japan)
Analytics Performance
100 Million rows/s Per Node
(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage
(Facebook)
Largest Cluster
4,500+ Nodes
(Yahoo)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats Futures
Numeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries
INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins
BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL Logfile
CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs
BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default Values
DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions
Complex Types OLAP and Windowing Functions Custom Formats
ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath Analytics
Nested Data Traversal ACID Transactions
Lateral Views INSERT / UPDATE / DELETE
Procedural Extensions Constraints
HPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
Legend
New
Future work
Hive 2.next
Track Hive SQL:2011 Complete: HIVE-13554
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Apache Impala at 10TB
 10TB scale on 10 identical
AWS nodes.
 Hive and Impala showed
similar times on most
smaller queries.
 Hive scaled better, with
many queries completing in
<2m where Impala ran to
timeout (3000s).
Highlights
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive vs. Presto on a partitioned 1TB dataset.
 Presto lacks basic
performance optimizations
like dynamic partition
pruning.
 On a real dataset / workload
Presto perform poorly
without full re-writes.
 Example: Query 55 without
re-writes = 185.17s, with re-
writes = 16s. LLAP = 1.37s.
Highlights
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive LLAP: Stable Performance under High Concurrency
4x Queries,
2.8x
Runtime
Difference
5x Queries,
4.6x
Runtime
Difference
Mark
Concurrent
Queries
Average
Runtime
5 7.76s
25 36.24s
100 102.89s
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Much Can it Hold, and Where?
High Storage Capacity
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage
 Of course HDFS, default in the Hadoop world
 More and more cloud
 Move is copy in S3, but current implementation assumes move is atomic and nearly free
– modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)
 ACID in the cloud
– Compactor moves a lot of files around, need to optimize
– Need to figure out how streaming ingest works in the cloud
 LLAP, caching much more valuable in the cloud
– Looking at flushing cache to SSD so misses are less costly
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Is My Data Safe?
Security
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• Wire
encryption
• HDFS
encryption +
Ranger KMS
• Centralized
audit
reporting w/
Apache
Ranger
• Fine grain
access control
with Apache
Ranger
Security today in Hadoop
Authorization
What can I do?
Audit
What did I do?
Data Protection
Can data be encrypted at
rest and over the wire?
• Kerberos
• API security
with Apache
Knox
Authentication
Who am I/prove
it?
Centralized Security Administration w/ Ranger & Knox
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication—API Security with Knox
• Eliminates SSH “edge node”
• Central API management
• Central audit control
• Service level authorization
• SSO - SAMLv2, Siteminder
and OAM
• LDAP and AD integration
• SSO for Hadoop UIs (Ranger,
Ambari..)
Apache Knox extends the reach of Hadoop REST API without
Kerberos complexities
Integrated with existing IdM
systems
Single, simple point of
access for a cluster
Centralized and consistent
secure API across one or
more clusters
• Kerberos Encapsulation
• Single Hadoop access point
• REST API hierarchy
• Consolidated API calls
• Multi-cluster support
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Apache Ranger: Per-User Row Filtering by Region in Hive
User 2
(East Region)
User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger: Dynamic Data Masking of Hive Columns
R A N G E R
Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation!
Goal: Mask or anonymize sensitive columns of data
(e.g. PII, PCI, PHI) from Hive query output
⬢ Benefits
– Sensitive information never leaves database
– No changes are required at the application or Hive layer
– No need to produce additional protected duplicate
versions of datasets
– Simple & easy to setup masking policies
⬢ Core Technologies: Ranger, Hive
AT L A S
H I V E
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Tag-based Access Policies with Apache Atlas
• Basic Tag policy – PII example. Access and
entitlements must be tag based ABAC and scalable in
implementation.
• Geo-based policy – Policy based on IP address, proxy
IP substitution maybe required. The rule
enforcement must be geo aware.
• Time-based policy – Timer for data access, de-
coupled from deletion of data.
• Prohibitions – Prevention of combination of Hive
tables that may pose a risk together.
Key Benefits:
New scalable metadata
based security paradigm
Dynamic, real-time policy
Active protection – fast
updates to changes
Centralized and simple to
manage policy
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s There and Where Did It Come From?
Governance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
Teradata
Connector
Apache
Kafka
Apache Atlas: Cross-Component Dataset Lineage
Custom
Activity
Reporter
Metadata
Repository
RDBMS
Any process
using Sqoop is
covered
No other tool
tracks IOT out
of the box
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Enables Business Catalog for Ease of Use
 Organize data assets along business terms
– Authoritative: Hierarchical Taxonomy Creation
– Agile modeling: Model Conceptual, Logical, Physical assets
– Definition and assignment of tags like PII (Personally
Identifiable Information)
 Comprehensive features for compliance
– Multiple user profiles including Data Steward and Business
Analysts
– Object auditing to track “Who did it”
– Metadata Versioning to track ”what did they do”
 Faster Insight:
– Data Quality tab for profiling and sampling
– User Comments
Key Benefits:
Organize data assets along
business terms
Compliance Features
Faster Insight
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Will My Users Interact With It?
Support for BI,
Cubes, Data Science
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Deep Multidimensional Analytics
Real-Time
Analytics
Hive /
Spark
BI Tools
REST
API
Superset
UI
Events
Logs
Trans-
actions
Sensors
Historical
Sources
HDFS S3
Druid Data Cubes
Ultra-Fast Analytics
Slice-and-Dice
Streaming
Sources
Storm
Kafka Spark
Deep, Fast Drilldown
Across Any Dimension
Scalably Ingest Historical Data from
Transactional and Web Systems
= Future
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid’s Role in Scalable Data Warehousing
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Ranger
Atlas
Ambari
Management
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Analytics at Scale with No Data Movement
Syncsort
High-Performance
Data Movement
Hadoop
Scalable Storage and Compute
Hive LLAP
High Performance SQL
AtScale Intelligence Platform
OLAP Cubes for Higher Performance
Source Data
Systems
Fast, scalable SQL analytics
Intelligent in-memory caching
Define OLAP cubes for 10x faster queries
Unified semantic layer for all BI tools
High performance data import
from all major EDW platforms
Pre-aggregated
data
... Or, full-fidelity
data
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Column Security with LLAP
 Fine-Grained Column Level Access Control for SparkSQL.
 Fully dynamic policies per user. Doesn’t require views.
 Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1. SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2. HiveServer2 authorizes access
using Ranger. Per-user policies
like row filtering are applied.
3. Spark gets a modified query plan
based on dynamic security policy.
4. Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin, Attaches to Hive and Spark
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
But Wait, There’s More
Monitoring & Management
Data Lifecycle Management
Replication & D/R
Data Ingestion
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Data Warehousing on Hadoop
Capabilities
Batch SQL OLAP / CubeInteractive SQL
Sub-Second
SQL
ACID / MERGE
Applications
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Multidimensional
Analytics
• MDX Tools
• Excel
• Reporting
• BI Tools: Tableau,
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• BI Tools: Tableau,
Excel
• Continuous
Ingestion from
Operational DBMS
• Slowly Changing
Dimensions
Existing
Development
Emerging
Legend
Core
Platform
Scale-Out Storage
Petabyte Scale
Processing
Core SQL Engine
Apache Tez: Scalable
Distributed Processing
Advanced Cost-Based
Optimizer
Connectivity
Advanced Security
JDBC / ODBC
Comprehensive
SQL:2011 Coverage
MDX
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
For More Details
 Today:
– Interactive Analytics at Scale in Apache Hive Using Druid – 12:20
– Information is Beautiful: Apache Zeppelin Edition – 14:10
– LLAP: Sub-Second Analytical Queries in Hive – 15:00
– Apache Atlas: Governance for Your Data – 16:10
– An Overview on Optimization in Apache Hive: Past, Present, Future – 16:10
– An Approach for Multi-Tenancy Through Apache Knox – 17:00
 Tomorrow
– Cloudy with a Chance of Hadoop – Real World Considerations – 11:30
– Row/Column-Level Security in SQL for Apache Spark – 14:10
– Unleashing the Power of Apache Atlas with Apache Ranger – 15:00
– Birds of a Feather sessions – 17:50

Contenu connexe

Tendances

Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormDataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeDataWorks Summit
 

Tendances (20)

Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache Storm
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 

Similaire à Hive edw-dataworks summit-eu-april-2017

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPHortonworks
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationAbdelkrim Hadjidj
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善HortonworksJapan
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData DayJohn Park
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...DataWorks Summit
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionMilind Pandit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 

Similaire à Hive edw-dataworks summit-eu-april-2017 (20)

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 

Plus de alanfgates

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache trainingalanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016alanfgates
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013alanfgates
 

Plus de alanfgates (15)

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
 

Dernier

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Dernier (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Hive edw-dataworks summit-eu-april-2017

  • 1. Scalable Data Warehousing on Hadoop Alan F. Gates, Co-founder, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Do You Expect in a Hadoop Data Warehouse? Benchmarks focus on two questions: – How much of the TPC-DS query set can it run? – How fast can it run it?
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What You Expect in a Data Warehouse? High Performance SQL 2011 High Storage Capacity Security Support for BI, Cubes, Data Science Monitoring & Management Governance Data Lifecycle Management Replication & D/R Workload Management Data Ingestion
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved So, back to TPC-DS... High Performance SQL 2011
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Overview Apache Hive is a SQL data warehouse engine that delivers fast, scalable SQL processing on Hadoop and in the Cloud. Features: • Extensive SQL:2011 Support • ACID Transactions • In-Memory Caching • Cost-Based Optimizer • User-Based Dynamic Security • JDBC and ODBC Support • Compatible with every major BI Tool • Proven at 300+ PB Scale
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive: Fast Facts Most Queries Per Hour 100,000 Queries Per Hour (Yahoo Japan) Analytics Performance 100 Million rows/s Per Node (with Hive LLAP) Largest Hive Warehouse 300+ PB Raw Storage (Facebook) Largest Cluster 4,500+ Nodes (Yahoo)
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Types SQL Features File Formats Futures Numeric Core SQL Features Columnar ACID MERGE FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT String UNION ALL Logfile CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints Date, Time UNION DISTINCT JSON Default Values DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi-statement Transactions Complex Types OLAP and Windowing Functions Custom Formats ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features Nested Data Analytics CUBE and Grouping Sets XPath Analytics Nested Data Traversal ACID Transactions Lateral Views INSERT / UPDATE / DELETE Procedural Extensions Constraints HPL/SQL Primary / Foreign Key (Non Validated) Apache Hive: Journey to SQL:2011 Analytics Legend New Future work Hive 2.next Track Hive SQL:2011 Complete: HIVE-13554
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2 with LLAP: Architecture Overview Deep Storage YARN Cluster LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor) Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive vs. Apache Impala at 10TB  10TB scale on 10 identical AWS nodes.  Hive and Impala showed similar times on most smaller queries.  Hive scaled better, with many queries completing in <2m where Impala ran to timeout (3000s). Highlights
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive vs. Presto on a partitioned 1TB dataset.  Presto lacks basic performance optimizations like dynamic partition pruning.  On a real dataset / workload Presto perform poorly without full re-writes.  Example: Query 55 without re-writes = 185.17s, with re- writes = 16s. LLAP = 1.37s. Highlights
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive LLAP: Stable Performance under High Concurrency 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Mark Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Much Can it Hold, and Where? High Storage Capacity
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage  Of course HDFS, default in the Hadoop world  More and more cloud  Move is copy in S3, but current implementation assumes move is atomic and nearly free – modifying Hadoop (HADOOP-11694) and Hive (HIVE-14535)  ACID in the cloud – Compactor moves a lot of files around, need to optimize – Need to figure out how streaming ingest works in the cloud  LLAP, caching much more valuable in the cloud – Looking at flushing cache to SSD so misses are less costly
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Is My Data Safe? Security
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved • Wire encryption • HDFS encryption + Ranger KMS • Centralized audit reporting w/ Apache Ranger • Fine grain access control with Apache Ranger Security today in Hadoop Authorization What can I do? Audit What did I do? Data Protection Can data be encrypted at rest and over the wire? • Kerberos • API security with Apache Knox Authentication Who am I/prove it? Centralized Security Administration w/ Ranger & Knox
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authentication—API Security with Knox • Eliminates SSH “edge node” • Central API management • Central audit control • Service level authorization • SSO - SAMLv2, Siteminder and OAM • LDAP and AD integration • SSO for Hadoop UIs (Ranger, Ambari..) Apache Knox extends the reach of Hadoop REST API without Kerberos complexities Integrated with existing IdM systems Single, simple point of access for a cluster Centralized and consistent secure API across one or more clusters • Kerberos Encapsulation • Single Hadoop access point • REST API hierarchy • Consolidated API calls • Multi-cluster support
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Data Access User ID Region Total Spend 1 East 5,131 2 East 27,828 3 West 55,493 4 West 7,193 5 East 18,193 Apache Ranger: Per-User Row Filtering by Region in Hive User 2 (East Region) User 1 (West Region) Original Query: SELECT * from CUSTOMERS WHERE total_spend > 10000 Query Rewrites based on Dynamic Ranger Policies Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “east” Dynamic Rewrite: SELECT * from CUSTOMERS WHERE total_spend > 10000 AND region = “west”
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger: Dynamic Data Masking of Hive Columns R A N G E R Protect Sensitive Data in real-time with Dynamic Data Masking/Obfuscation! Goal: Mask or anonymize sensitive columns of data (e.g. PII, PCI, PHI) from Hive query output ⬢ Benefits – Sensitive information never leaves database – No changes are required at the application or Hive layer – No need to produce additional protected duplicate versions of datasets – Simple & easy to setup masking policies ⬢ Core Technologies: Ranger, Hive AT L A S H I V E
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Tag-based Access Policies with Apache Atlas • Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation. • Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement must be geo aware. • Time-based policy – Timer for data access, de- coupled from deletion of data. • Prohibitions – Prevention of combination of Hive tables that may pose a risk together. Key Benefits: New scalable metadata based security paradigm Dynamic, real-time policy Active protection – fast updates to changes Centralized and simple to manage policy
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s There and Where Did It Come From? Governance
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sqoop Teradata Connector Apache Kafka Apache Atlas: Cross-Component Dataset Lineage Custom Activity Reporter Metadata Repository RDBMS Any process using Sqoop is covered No other tool tracks IOT out of the box
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Atlas Enables Business Catalog for Ease of Use  Organize data assets along business terms – Authoritative: Hierarchical Taxonomy Creation – Agile modeling: Model Conceptual, Logical, Physical assets – Definition and assignment of tags like PII (Personally Identifiable Information)  Comprehensive features for compliance – Multiple user profiles including Data Steward and Business Analysts – Object auditing to track “Who did it” – Metadata Versioning to track ”what did they do”  Faster Insight: – Data Quality tab for profiling and sampling – User Comments Key Benefits: Organize data assets along business terms Compliance Features Faster Insight
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Will My Users Interact With It? Support for BI, Cubes, Data Science
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Deep Multidimensional Analytics Real-Time Analytics Hive / Spark BI Tools REST API Superset UI Events Logs Trans- actions Sensors Historical Sources HDFS S3 Druid Data Cubes Ultra-Fast Analytics Slice-and-Dice Streaming Sources Storm Kafka Spark Deep, Fast Drilldown Across Any Dimension Scalably Ingest Historical Data from Transactional and Web Systems = Future
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid’s Role in Scalable Data Warehousing UI Core Platform S3 or HDFS HiveServer2 MDX Unified SQL and MDX Layer SQL BI Tools MDX Tools Hive Realtime Feeds (Kafka, Storm, etc.) Druid OLAP Indexes HiveServer2 Hive SQL Thrift Server SparkSQL Fast SQL MDX Superset UI Fast Exploration Ranger Atlas Ambari Management
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Analytics at Scale with No Data Movement Syncsort High-Performance Data Movement Hadoop Scalable Storage and Compute Hive LLAP High Performance SQL AtScale Intelligence Platform OLAP Cubes for Higher Performance Source Data Systems Fast, scalable SQL analytics Intelligent in-memory caching Define OLAP cubes for 10x faster queries Unified semantic layer for all BI tools High performance data import from all major EDW platforms Pre-aggregated data ... Or, full-fidelity data
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Column Security with LLAP  Fine-Grained Column Level Access Control for SparkSQL.  Fully dynamic policies per user. Doesn’t require views.  Use Standard Ranger policies and tools to control access and masking policies. Flow: 1. SparkSQL gets data locations known as “splits” from HiveServer and plans query. 2. HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied. 3. Spark gets a modified query plan based on dynamic security policy. 4. Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server. HiveServer2 Authorization Hive Metastore Data Locations View Definitions LLAP Data Read Filter Pushdown Ranger Server Dynamic Policies Spark Client 1 2 4 3
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin, Attaches to Hive and Spark
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved But Wait, There’s More Monitoring & Management Data Lifecycle Management Replication & D/R Data Ingestion
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scalable Data Warehousing on Hadoop Capabilities Batch SQL OLAP / CubeInteractive SQL Sub-Second SQL ACID / MERGE Applications • ETL • Reporting • Data Mining • Deep Analytics • Multidimensional Analytics • MDX Tools • Excel • Reporting • BI Tools: Tableau, Microstrategy, Cognos • Ad-Hoc • Drill-Down • BI Tools: Tableau, Excel • Continuous Ingestion from Operational DBMS • Slowly Changing Dimensions Existing Development Emerging Legend Core Platform Scale-Out Storage Petabyte Scale Processing Core SQL Engine Apache Tez: Scalable Distributed Processing Advanced Cost-Based Optimizer Connectivity Advanced Security JDBC / ODBC Comprehensive SQL:2011 Coverage MDX
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved For More Details  Today: – Interactive Analytics at Scale in Apache Hive Using Druid – 12:20 – Information is Beautiful: Apache Zeppelin Edition – 14:10 – LLAP: Sub-Second Analytical Queries in Hive – 15:00 – Apache Atlas: Governance for Your Data – 16:10 – An Overview on Optimization in Apache Hive: Past, Present, Future – 16:10 – An Approach for Multi-Tenancy Through Apache Knox – 17:00  Tomorrow – Cloudy with a Chance of Hadoop – Real World Considerations – 11:30 – Row/Column-Level Security in SQL for Apache Spark – 14:10 – Unleashing the Power of Apache Atlas with Apache Ranger – 15:00 – Birds of a Feather sessions – 17:50

Notes de l'éditeur

  1. Notes: This being Apache there are 50 ways to assemble this, I’m going to cover one There are a lot of parts in the picture, I won’t be able to cover them all For several of these I want to look at what’s there now, and what communities are working on to improve this experience
  2. What tools/processes have you tried to attach to Hadoop and have been unable to do so? Why?
  3. Hive version = Hive 2.1 Presto version = Presto 0.163
  4. Additional details: Queries are 20 interactive queries taken from TPC-DS using 1 TB of data from TPC-DS schema. Queries run at random using a jmeter test harness
  5. Extend reach of Hadoop APIs Gateway for Hadoop’s REST API Different REST APIs have varying levels of AuthN, AuthZ, SSL, SSO Capability Enterprise authentication Apply Enterprise capabilities to All REST APIs – IdM Integration, SSO, OAuth, SAML Avoid exposing Cluster port, hostnames to all users
  6. Show – clearly identify customer metadata. Change Add customer classification example – Aetna – make the use case story have continuity. Use DX procedures to diagnosis ** bring meta from external systems into hadoop – keep it together
  7. Maps well to Yahoo case
  8. Notes: This being Apache there are 50 ways to assemble this, I’m going to cover one There are a lot of parts in the picture, I won’t be able to cover them all For several of these I want to look at what’s there now, and what communities are working on to improve this experience