SlideShare a Scribd company logo
1 of 42
Grill
HQL Tiered Data Warehouse
Amareshwari Sriramadasu
Suma Shivaprasad
Amareshwari
Sriramadasu
• Engineer at Inmobi
• Working in Hadoop and eco
systems since 2007
• Apache Hadoop – PMC
• Apache Hive– Committer
Suma
Shivaprasad
• Engineer at Inmobi
• Working in Hadoop and eco
systems since 2011
Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
Digital advertising at Inmobi
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells
Real estate
on digital
inventory
Has reach to
users
Wants to
target Users
Brings money
Market place
Consumer
Publisher/Advertiser specific
analytics
Understanding trends & inference
Feedback to improve Ad
Relevance in Real Time
Sizing & Estimation (Ex: Inventory
sizing, Forecasting)
Tracking metrics and Anamoly
detection
Debugging / Postmortem of issues
(troubleshooting)
Analytics Use cases
Advertisers and publishers
Account/Regional managers
Executive team
Business/Product analysts
Data scientists
Engineering systems
Developers
• Canned/Dashboard queries
• Adhoc queries
• Interactive/Batch queries
• Scheduled queries
• Infer insights through ML algorithms
Categorize the use cases
Adhoc querying system Internal Dashboards
Customer facing
Dashboards and
Reporting
Analytics systems at Inmobi
Analytics Warehouses at Inmobi and Scale
 Billions of Ad Requests and Impressions per day
 170 TB Hadoop warehouse
 5 TB SQL Columnar Datawarehouse
 Hbase cluster
 DBMS(Postgres, Mysql)
 Spark/Shark in near future
Why both Hadoop and SQL warehouse?
Canned
Adhoc
Response Times
IO (Input
Records)
Adhoc
Canned Adhoc
Query Engine
Query
Dashboard queries
are mostly canned
and Interactive
Adhoc queries can be
Interactive or batch
depending on the
data volumes and
query complexity
• Disparate user experience
• Disparate data storage systems
• Schema management across storages
• Data discovery
• Not leveraging ‘SQL on Hadoop’ community
Problems
Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
Associates structure to data
Provides Metastore and
catalog service – Hcatalog
Provides pluggable storage
interface
Accepts SQL like queries
HQL is widely adopted
language by systems like
Shark, Impala
Has strong apache
community
Data warehouse features like
facts, dimensions
Logical table associated with
multiple physical storages
Pluggable execution engine
Interactive queries
Query lifecycle management
Query quota management
Scheduling queries
Plugging in Machine learning
models
WhatdoesHiveprovide
WhatismissinginHive
Apache Hive to the rescue
Data Layout – OLAP Cube
Dimensions, Measures
Cost
Data Layout – Snowflake
Aggr Factk :
measures (mak <=
ma(k-1)), dims (dak <
da(k-1))
…..
Aggr Fact2 : measures (ma2 <=
ma1), dimensions (da2 < da1)
Aggr Fact1 : measures (ma1 <= mr),
dimensions (da1 < dr)
Raw Fact: measures (mr), dimensions(dr)
Dim2_1
Dim3
Dim2
Dim4_1
Dim4
Dim1
Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
Data Model
Cube Storage Fact Table
Physical
Fact tables
Dimension
Table
Physical
Dimension
tables
Data Model - Cube
Dimension
• Simple Dimension
• Referenced Dimension
• Hierarchical Dimension
• Expression Dimension
• Timed dimension
Measure
• Column Measure
• Expression Measure
Cube
Measures Dimensions
Note : Some of the concepts are borrowed from
http://community.pentaho.com/projects/mondrian/
Data Model – Storage
Storage
• Name
• End point
• Properties
• Ex : ProdCluster, StagingCluster, Postgres1,
HBase1, HBase2
Data Model – Fact Table
Fact
table
Cube
Fact
table
Storage
FactTable
• Columns
• Cube that it belongs
• Storages on which it is
present and the
associated update
periods
Data Model – Dimension table
DimensionTable
• Columns
• Dimension references
• Storages on which it is
present and associated
snapshot dump period, if
any.
Cube
Dimension
table
Dimension
table
Dimension
table
Storage
Data Model – Storage tables and partitions
Storagetable
• Belongs to fact/dimension
• Associated storage descriptor
• Partitioned by columns
• Naming convention – storage
name followed by
fact/dimension name
• Partition can override its
storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
Data Model
Aggrk Fact
Table
…..
Aggr2 Fact Table
Aggr1 Fact Table
Raw FactTable
Dim2_1
Dim3
Dim2
Dim4_1
Dim4
Dim1
CUBE
Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
CUBE SELECT [DISTINCT] select_expr,
select_expr, ...
FROM cube_table_reference
WHERE [where_condition AND]
TIME_RANGE_IN(colName , from, to)
[GROUP BY col_list]
[HAVING having_expr]
[ORDER BY colList]
[LIMIT number]
cube_table_reference:
cube_table_factor
| join_table
join_table:
cube_table_reference JOIN cube_table_factor
[join_condition]
| cube_table_reference {LEFT|RIGHT|FULL} [OUTER]
JOIN cube_table_reference [join_condition]
cube_table_factor:
cube_name [alias]
| ( cube_table_reference )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
colOrder: ( ASC | DESC )
colList : colName colOrder? (',' colName colOrder?)*
Queries on OLAP cubes
Resolve candidate tables and storages
Automatically resolve joins
Resolve aggregates and groupby expressions
Allows SQL over Cube QL
Queries can span multiple storages
Accepts multi time range queries
All Hive QL features
Query features
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable
citytable LIMIT 100
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable
citytable WHERE (citytable.dt = 'latest') LIMIT 100
cube select name, stateid from citytable limit 100
Example query
Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER
JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE ((
testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-
10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR
(testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-
10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR
(testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-
10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR
(testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-
10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR
(testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-
00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt
= 'latest')
GROUP BY(citytable.name)
cube select citytable.name, msr2 from testcube where
timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
Available in Hive
• Data warehouse features
like facts, dimensions
• Logical table associated
with multiple physical
storages
Available in Grill
• Pluggable execution engine for
HQL
• Query life cycle management
• Scheduling queries
Where is it available
Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
GRILL
Unify the Query layer for Adhoc/Canned
Batch/Interactive
Reports on single Interface
Grill Architecture
Implements an interface
• execute
• explain
• executeAsynchronously
• fetchResults
• Specify all storages it can support
Pluggable execution engine
OLAP Cube QL query
Rewrite query for available execution engine’s
supported storages
Get cost of the rewritten query from each
execution engine
Pick up execution engine with least cost and
fire the query
Cube query with multiple execution engines
Grill Server
Grill server – code status
• Richer statistics
• Query scheduler service
• Query caching
• Authentication and authorization
• Machine learning through Grill
• User query quota management
Grill roadmap
• Number of queries - 700 to 900 per day
• Number of dimension tables - 125
• Number of fact tables – 30
• Number cubes – 22
• Size of the data
• Total size – 170 TB
• Dimension data – 400 MB compressed per hour
• Raw data - 1.2 TB per day
• Aggregated facts- 53GB per day
Data ware house statistics
Jira reference
• https://issues.apache.org/jira/browse/HIVE-4115
Github source for Hive
• https://github.com/InMobi/hive
Github source for grill
• https://github.com/InMobi/grill
Documentation
• http://inmobi.github.io/grill
Mailing lists
• grill-users@googlegroups.com
• grill-dev@googlegroups.com
References
Analytics at Inmobi - Problem areas and Motivation
Why Apache Hive
OLAP Model in Apache Hive
Query examples
Grill – Unified analytics
Demo
Agenda
Demo
Thank you!
• amareshwari@apache.org
• suma.shivaprasad@inmobi.com

More Related Content

What's hot

Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecYang Li
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataLuke Han
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine TourLuke Han
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupLuke Han
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanLuke Han
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanLuke Han
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin IntroductionLuke Han
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting startedShubham Shirude
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Never late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNNever late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNDataWorks Summit
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...Luke Han
 

What's hot (20)

Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big data
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting started
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Never late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNNever late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARN
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 

Viewers also liked

Hibernate Session 4
Hibernate Session 4Hibernate Session 4
Hibernate Session 4b_kathir
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Introduction to hibernate
Introduction to hibernateIntroduction to hibernate
Introduction to hibernatehr1383
 
Hibernate Presentation
Hibernate  PresentationHibernate  Presentation
Hibernate Presentationguest11106b
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 

Viewers also liked (6)

Hibernate Session 4
Hibernate Session 4Hibernate Session 4
Hibernate Session 4
 
14 hql
14 hql14 hql
14 hql
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Introduction to hibernate
Introduction to hibernateIntroduction to hibernate
Introduction to hibernate
 
Hibernate Presentation
Hibernate  PresentationHibernate  Presentation
Hibernate Presentation
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 

Similar to Unified analytics with Apache Hive and Grill

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apacheconamarsri
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in HiveDataWorks Summit
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingLuke Han
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud confamarsri
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 

Similar to Unified analytics with Apache Hive and Grill (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apachecon
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Unified analytics with Apache Hive and Grill

  • 1. Grill HQL Tiered Data Warehouse Amareshwari Sriramadasu Suma Shivaprasad
  • 2. Amareshwari Sriramadasu • Engineer at Inmobi • Working in Hadoop and eco systems since 2007 • Apache Hadoop – PMC • Apache Hive– Committer
  • 3. Suma Shivaprasad • Engineer at Inmobi • Working in Hadoop and eco systems since 2011
  • 4. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
  • 5. Digital advertising at Inmobi Courtesy: http://www.liesdamnedlies.com/ Owns & Sells Real estate on digital inventory Has reach to users Wants to target Users Brings money Market place Consumer
  • 6. Publisher/Advertiser specific analytics Understanding trends & inference Feedback to improve Ad Relevance in Real Time Sizing & Estimation (Ex: Inventory sizing, Forecasting) Tracking metrics and Anamoly detection Debugging / Postmortem of issues (troubleshooting) Analytics Use cases Advertisers and publishers Account/Regional managers Executive team Business/Product analysts Data scientists Engineering systems Developers
  • 7. • Canned/Dashboard queries • Adhoc queries • Interactive/Batch queries • Scheduled queries • Infer insights through ML algorithms Categorize the use cases
  • 8. Adhoc querying system Internal Dashboards Customer facing Dashboards and Reporting Analytics systems at Inmobi
  • 9. Analytics Warehouses at Inmobi and Scale  Billions of Ad Requests and Impressions per day  170 TB Hadoop warehouse  5 TB SQL Columnar Datawarehouse  Hbase cluster  DBMS(Postgres, Mysql)  Spark/Shark in near future
  • 10. Why both Hadoop and SQL warehouse? Canned Adhoc Response Times IO (Input Records) Adhoc Canned Adhoc Query Engine Query Dashboard queries are mostly canned and Interactive Adhoc queries can be Interactive or batch depending on the data volumes and query complexity
  • 11. • Disparate user experience • Disparate data storage systems • Schema management across storages • Data discovery • Not leveraging ‘SQL on Hadoop’ community Problems
  • 12. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
  • 13. Associates structure to data Provides Metastore and catalog service – Hcatalog Provides pluggable storage interface Accepts SQL like queries HQL is widely adopted language by systems like Shark, Impala Has strong apache community Data warehouse features like facts, dimensions Logical table associated with multiple physical storages Pluggable execution engine Interactive queries Query lifecycle management Query quota management Scheduling queries Plugging in Machine learning models WhatdoesHiveprovide WhatismissinginHive Apache Hive to the rescue
  • 14. Data Layout – OLAP Cube Dimensions, Measures Cost
  • 15. Data Layout – Snowflake Aggr Factk : measures (mak <= ma(k-1)), dims (dak < da(k-1)) ….. Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw Fact: measures (mr), dimensions(dr) Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1
  • 16. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
  • 17. Data Model Cube Storage Fact Table Physical Fact tables Dimension Table Physical Dimension tables
  • 18. Data Model - Cube Dimension • Simple Dimension • Referenced Dimension • Hierarchical Dimension • Expression Dimension • Timed dimension Measure • Column Measure • Expression Measure Cube Measures Dimensions Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
  • 19. Data Model – Storage Storage • Name • End point • Properties • Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
  • 20. Data Model – Fact Table Fact table Cube Fact table Storage FactTable • Columns • Cube that it belongs • Storages on which it is present and the associated update periods
  • 21. Data Model – Dimension table DimensionTable • Columns • Dimension references • Storages on which it is present and associated snapshot dump period, if any. Cube Dimension table Dimension table Dimension table Storage
  • 22. Data Model – Storage tables and partitions Storagetable • Belongs to fact/dimension • Associated storage descriptor • Partitioned by columns • Naming convention – storage name followed by fact/dimension name • Partition can override its storage descriptor • Fact storage table Fact table • Dimension storage table Dimension table
  • 23. Data Model Aggrk Fact Table ….. Aggr2 Fact Table Aggr1 Fact Table Raw FactTable Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1 CUBE
  • 24. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
  • 25. CUBE SELECT [DISTINCT] select_expr, select_expr, ... FROM cube_table_reference WHERE [where_condition AND] TIME_RANGE_IN(colName , from, to) [GROUP BY col_list] [HAVING having_expr] [ORDER BY colList] [LIMIT number] cube_table_reference: cube_table_factor | join_table join_table: cube_table_reference JOIN cube_table_factor [join_condition] | cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition] cube_table_factor: cube_name [alias] | ( cube_table_reference ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression colOrder: ( ASC | DESC ) colList : colName colOrder? (',' colName colOrder?)* Queries on OLAP cubes
  • 26. Resolve candidate tables and storages Automatically resolve joins Resolve aggregates and groupby expressions Allows SQL over Cube QL Queries can span multiple storages Accepts multi time range queries All Hive QL features Query features
  • 27. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100 cube select name, stateid from citytable limit 100 Example query
  • 28. Example query • SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03- 10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03- 10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03- 10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03- 10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12- 00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest') GROUP BY(citytable.name) cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
  • 29. Available in Hive • Data warehouse features like facts, dimensions • Logical table associated with multiple physical storages Available in Grill • Pluggable execution engine for HQL • Query life cycle management • Scheduling queries Where is it available
  • 30. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
  • 31. GRILL Unify the Query layer for Adhoc/Canned Batch/Interactive Reports on single Interface
  • 33. Implements an interface • execute • explain • executeAsynchronously • fetchResults • Specify all storages it can support Pluggable execution engine
  • 34. OLAP Cube QL query Rewrite query for available execution engine’s supported storages Get cost of the rewritten query from each execution engine Pick up execution engine with least cost and fire the query Cube query with multiple execution engines
  • 36. Grill server – code status
  • 37. • Richer statistics • Query scheduler service • Query caching • Authentication and authorization • Machine learning through Grill • User query quota management Grill roadmap
  • 38. • Number of queries - 700 to 900 per day • Number of dimension tables - 125 • Number of fact tables – 30 • Number cubes – 22 • Size of the data • Total size – 170 TB • Dimension data – 400 MB compressed per hour • Raw data - 1.2 TB per day • Aggregated facts- 53GB per day Data ware house statistics
  • 39. Jira reference • https://issues.apache.org/jira/browse/HIVE-4115 Github source for Hive • https://github.com/InMobi/hive Github source for grill • https://github.com/InMobi/grill Documentation • http://inmobi.github.io/grill Mailing lists • grill-users@googlegroups.com • grill-dev@googlegroups.com References
  • 40. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
  • 41. Demo
  • 42. Thank you! • amareshwari@apache.org • suma.shivaprasad@inmobi.com

Editor's Notes

  1. Inmobi provides marketplace, where it buys the space on mobile from publishers and sells it to advertisers, meanwhile it acquires users.
  2. Adhoc querying system Adhoc and Batch queries Scheduled queries Based on Hadoop Mapreduce Provides UI and custom api Data is stored in HDFS Dashboard system Canned reports Interactive and adhoc queries Provides UI and Custom api Data is stored in columnar DWH Customer facing system Face to the outside world (Advertisers and publishers) Interactive and adhoc queries Provides UI and custom api Data is stored in relational DB, Postgres Conventional columnar databases (RDBMS) systems lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB, while hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Though there have been some improvements recently in storage structures in the Hadoop warehouses such as ORC, queries over hadoop still typically adopts a full scan approach. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance is fairly complex and not easy for most users.
  3. Inmobi has 130TB hadoop warehouse and 5TB SQL warehouse. Let us see an example of reporting page. This is the dashboard a publisher sees.
  4. Individually all the systems we just saw work really great! They provide best time responses to user queries. Disparate user experience because of multiple reporting systems Involves a learning curve for systems and their api Disparate data storage systems causing inability to scale Altering schema involves different systems Data discovery Cannot leverage data in other systems Not leveraging community around Cannot experiment with new storage/execution engine out of the box
  5. Column Measure : name, type, default aggregate, format string, start date, end date Expression Measure : Associated Expression Simple Dimension: name, type, start date, end date Referenced Dimension : Referencing table and column Hierarchical Dimension :hierarchy Expression Dimension : Associated expression
  6. The grammar is subset of HQL Resolve candidate dimension tables and the storage tables . Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid. Resolve fact storage tables for the queried time range. Automatically resolve joins using the relationships between cubes and dimension. Automatically add aggregate functions to measures. Add expression to group by clause, if projected; and project group by clause, if it is not.
  7. Resolve candidate dimension tables and the storage tables . Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid. Resolve fact storage tables for the queried time range. Automatically resolve joins using the relationships between cubes and dimension. Automatically add aggregate functions to measures. Add expression to group by clause, if projected; and project group by clause, if it is not.