SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
Run your aggregation queries at a
speed of 14x without spending $$$
By: Bhavya Aggarwal(CTO, Knoldus)
Sangeeta Gulia(Software Consultant, Knoldus)
Agenda
● What are aggregates?
● Why are aggregates slow
● Sql Queries and Big Data
● Techniques to prevent full scan
● Partitioning and Bucketing
● What is pre-aggregation.
● Advantages of pre-aggregation
● Trade offs with Pre-aggregation
● How can we pre-aggregate data.
● Suggestions for Pre-aggregation
“The single most dramatic way to affect performance in a large
data warehouse is to provide a proper set of aggregate
(summary) records that coexist with the primary base records.
Aggregates can have a very significant effect on performance, in
some cases speeding queries by a factor of one hundred or even
one thousand. No other means exist to harvest such spectacular
gains.” - Ralph Kimball
Commonly Used Aggregates
Function Description
MIN returns the smallest value in a given column
MAX returns the largest value in a given column
SUM returns the sum of the numeric values in a given column
AVG returns the average value of a given column
COUNT returns the total number of values in a given column
COUNT(*) returns the number of rows in a table
Why are Aggregates Slow
● Iteration has to be done on the whole data traversing
each record.
● When data size is large this computation of aggregate
functions take long time.
Interactive SQL Queries
● When it comes to extract some meaningful information
from stored data. First thing that comes to our mind is
SQL
● Almost everyone is comfortable in playing with the data
to extract meaningful information out of it.
● Problem starts when we have to work on large
data(terabytes/petabytes)
How we store data is important
● We need to analyse the application first rather than
storing data
● We should understand the use cases or the data that will
make sense for the Business Users
● What will be the optimized storage format (Columnar
Storage)
− ORC
− Parquet
− Carbondata
Techniques to prevent full scan
● Partitioning
● Bucketing
● Preaggregation
− Compaction
Partitioning
● It divides amount of data into number of folders based
on table columns value, this has performance benefit,
and helps in organizing data in a logical fashion.
● Always Partition on a low cardinality column.
Understanding Partitioning
● Example: if we are dealing with a large employee table and
often run queries with WHERE clauses that restrict the
results to a particular country or department .
create table employees
(id int,name string,dob string)
PARTITIONED BY (country STRING, DEPT STRING)
● For a faster query response Hive table can be partitioned .
Partitioning tables changes how Hive structures the data
storage and Hive will now create subdirectories reflecting
the partitioning structure like
.../employees/country=ABC/DEPT=XYZ.
Bucketing
● Bucketing feature of Hive can be used to distribute/
organize the table/partition data into multiple files such
that similar records are present in the same file based on
some logic mostly some hashing algorithm.
● Bucketing works well when the field has high cardinality
and data is evenly distributed among buckets
Understanding Bucketing
● For example, suppose a table using date as the top-level
partition and employee_id as the second-level partition
leads to too many small partitions.
● Instead, if we bucket the employee table and use
employee_id as the bucketing column, the value of this
column will be hashed by a user-defined number into
buckets.
create table employees
(id int,name string,age int)
PARTITIONED BY (dob string)
CLUSTERED BY(id) INTO 5 BUCKETS
Understanding Bucketing
then hive will store data in a directory hierarchy like
/user/hive/warehouse/mytable/dob=2001-02-01
bucketed with 5 buckets inside the above directory:
00000_0
00001_0
........
00004_0
Here dob=2001-02-01 is the partition and 000 files are the
buckets in partition.
How Bucketing Improves Performance
● Fast Map side Joins
● Efficient Group by
● Sampling
Dimension Table
● In a Dimensional Model, context of the measurements are
represented in dimension tables. You can also think of the context
of a measurement as the characteristics such as who, what,
where, when, how of a measurement.
● In your business process Sales, the characteristics of the ‘monthly
sales number’ measurement can be a Location (Where), Time
(When), Product Sold (What).
● Table that captures the information regarding each entity that is
referenced in a fact table.
Fact Table
● Table that contains the measures of interest.
● Fact tables contain the data corresponding to a particular
business process.
● Each row represents a single event associated with a process and
contains the measurement data associated with that event.
Order_ID Order_Date Customer_I
d
Product_Id Quantity Price Total
1212 12/12/2017 13243 1 2 20 40
1213 12/12/2017 13289 456 8 60 50
Dimension Aggregates
● Methods to define aggregates
● Include or leave out entire dimensions
● Include some columns of each dimension, but not others
− Second approach is more common / more useful
● Example: Several versions of Date dimension
− Base Date dimension
● 1 row per day
− Monthly aggregate dimension
● 1 row per month
− Yearly aggregate dimension
● 1 row per year
Choosing Dimension Aggregates
● Dimension aggregates often roll up along hierarchies
− Day : month – year
− Store : area_code - state – country
● Any subset of dimension attributes can be used
− Customer aggregate might include only a few
frequently-queried columns (age, gender, income, marital
status)
● Goal: reduced number of distinct combinations
− Results in fewer rows in aggregated fact
Pre-Aggregation
● Aggregate fact tables are simple numeric rollups of atomic fact
table data built solely to accelerate query performance
● It is important to remember that aggregation is a form of data
redundancy, because the aggregations are computed from other
warehouse values.
● For obvious performance reasons, the aggregations are
pre-calculated and loaded into the warehouse during off-hours.
Advantages of Pre-Aggregated Tables
● Reduce input/output, CPU, RAM, and swapping requirements
● Minimize the amount of data that must be aggregated and sorted
at run time thereby reducing memory requirements for joins
● Move time-intensive calculations with complicated logic or
significant computations into a batch routine from dynamic SQL
executed at report run time
Trade-offs for Pre Aggregation
● Query Performance V/S Load Performance
● Building and updating data structures will be costly.
● Load Time will be slower.
● Pre-aggregation will improve performance of only those
queries for which pre-computed data exist.
Pre Aggregation Choices
● Most (R)OLAP tools today support practical pre- aggregation
− IBM DB2 uses Materialized Query Tables(MDTs)
− Oracle 9iR2 uses Materialised Views
− Hyperion Essbase (DB2 OLAP Services)
− Carbondata Preaggregates
Creating Pre-aggregate Table
(carbondata)
● Carbondata supports pre aggregating of data so that OLAP
kind of queries can fetch data much faster.
Pre-Aggregation Flow
How pre-aggregate tables are selected
● For the main table sales and pre-aggregate table agg_sales
created above, queries of the kind:
− SELECT country, sex, sum(quantity), avg(price) from sales
GROUP BY country, sex
− SELECT sex, sum(quantity) from sales GROUP BY sex
− SELECT avg(price), country from sales GROUP BY country
● will be transformed by Query Planner to fetch data from
pre-aggregate table agg_sales
How pre-aggregate tables are selected
● But queries of kind :
− SELECT user_id, country, sex, sum(quantity), avg(price) from
sales GROUP BY user_id, country, sex
− SELECT sex, avg(quantity) from sales GROUP BY sex
− SELECT country, max(price) from sales GROUP BY country
● will fetch the data from the main table sales
Datamap
Queries
● Query1: Find out the maximum quantity ordered in each order
placed.
● Query2: Find out the total amount spent by a customer for all
the orders he has placed.
● Query3: Find out the average amount spent by a customer for a
particular order
Performance
● Query: Find out the maximum item quantity, ordered for each
order placed.
>select L_ORDERKEY, max(L_QUANTITY) from lineitem group by
L_ORDERKEY
● Performance on Main Table: [Datasize: TPCH data with scale of 50gb ]
● Average: 9.384
Performance(cont...)
● Now let us first create a preaggregate,
● >create datamap max_order_quantity on table lineitem using
'preaggregate' as select L_ORDERKEY, max(L_QUANTITY) from
lineitem group by L_ORDERKEY
● Performance after creating pre-aggregate:
● [Datasize: TPCH data with scale of 50gb ]
● Average: 1.190
Logical Plan(Max)
Logical Plan(Average)
Logical Plan(Sum)
Pre-aggregate with Timeseries
(carbondata)
Pre-aggregate with Timeseries
(carbondata)
Datamap property Description
event_time The event time column in the schema, which will be used for
rollup. The column need be timestamp type.
time granularity The aggregate dimension hierarchy. The property value is a
key value pair specifying the aggregate granularity for the time
level.
Carbon support “year_granularity”, “month_granularity”,
“day_granularity”, “hour_granularity”, “minute_granularity”,
“second_granularity”.
Granularity only support 1 when creating datamap. For
example, ‘hour_granularity’=’1’ means aggregate every 1 hour.
Now the value only support 1.
Understanding Datamap with Timeseries
● Query before creating aggregate:
● >select DOB, max(DOUBLE_COLUMN1) from uniqdata group by
DOB
● Creating Preaggregate:
● >create datamap timeseries_agg on table uniqdata using
'timeseries'
dmproperties('event_time'='DOB','year_granularity'='1') as select
DOB, max(DOUBLE_COLUMN1) from uniqdata group by DOB
Understanding Datamap with Timeseries
● After Creating Preaggregate:
● >select timeseries(DOB,'year'),max(DOUBLE_COLUMN1) from
uniqdata group by timeseries(DOB,'year')
● will map to aggregate, but below query will be executed using
main table.
● >select timeseries(DOB,'month'),max(DOUBLE_COLUMN1) from
uniqdata group by timeseries(DOB,'month')
Logical Plan of Datamap with Timeseries
Goals for Aggregation Strategy
● Do not get bogged down with too many aggregates.
● Try to cater to a wide range of user groups.
● Keep the aggregates hidden from end users. That is, the
aggregate must be transparent to the end user query. The query
tool must be the one to be aware of the aggregates to direct the
queries for proper access.
Suggestions for Preaggregation
● Before doing any calculations, spend good time to determine
what pre-aggregate you need. What will be the common queries.
● Spend good time on understanding level of hierarchies and
identify the important hierarchy.
● In each dimension check the attributes required to group the
fact table metrics
● The next step is to determine which of the attributes are used in
combinations and what are most common combinations
References
● https://carbondata.apache.org/data-management-on-carbondat
a.html
● https://www.slideshare.net/siddiqueibrahim37/aggregate-fact-ta
bles
● https://www.ibm.com/support/knowledgecenter/en/SSCRW7_6.3
.0/com.ibm.redbrick.doc6.3/vista/vista20.htm
● http://myitlearnings.com/bucketing-in-hive/
Run your queries 14X faster without any investment!

Contenu connexe

Tendances

Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQLtomflemingh2
 
Survey On Temporal Data And Change Management in Data Warehouses
Survey On Temporal Data And Change Management in Data WarehousesSurvey On Temporal Data And Change Management in Data Warehouses
Survey On Temporal Data And Change Management in Data WarehousesEtisalat
 
Cassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approachCassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approachDevopam Mittra
 
Oracle Database 12c features for DBA
Oracle Database 12c features for DBAOracle Database 12c features for DBA
Oracle Database 12c features for DBAKaran Kukreja
 
Partitioning your Oracle Data Warehouse - Just a simple task?
Partitioning your Oracle Data Warehouse - Just a simple task?Partitioning your Oracle Data Warehouse - Just a simple task?
Partitioning your Oracle Data Warehouse - Just a simple task?Trivadis
 
Time Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOSTime Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOSLaura Hood
 

Tendances (8)

Dbms schemas for decision support
Dbms schemas for decision supportDbms schemas for decision support
Dbms schemas for decision support
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
Survey On Temporal Data And Change Management in Data Warehouses
Survey On Temporal Data And Change Management in Data WarehousesSurvey On Temporal Data And Change Management in Data Warehouses
Survey On Temporal Data And Change Management in Data Warehouses
 
05 OLAP v6 weekend
05 OLAP  v6 weekend05 OLAP  v6 weekend
05 OLAP v6 weekend
 
Cassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approachCassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approach
 
Oracle Database 12c features for DBA
Oracle Database 12c features for DBAOracle Database 12c features for DBA
Oracle Database 12c features for DBA
 
Partitioning your Oracle Data Warehouse - Just a simple task?
Partitioning your Oracle Data Warehouse - Just a simple task?Partitioning your Oracle Data Warehouse - Just a simple task?
Partitioning your Oracle Data Warehouse - Just a simple task?
 
Time Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOSTime Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOS
 

Similaire à Run your queries 14X faster without any investment!

Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehousearungansi
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-FeaturesNavneet Upneja
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Niko Neugebauer
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Huy Nguyen
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle DatabaseBest Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle DatabaseEdgar Alejandro Villegas
 
How to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxHow to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxSadeka Islam
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLJim Mlodgenski
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud EraMydbops
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009HawaDia
 

Similaire à Run your queries 14X faster without any investment! (20)

mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehouse
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
3dw
3dw3dw
3dw
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-Features
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Date Analysis .pdf
Date Analysis .pdfDate Analysis .pdf
Date Analysis .pdf
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?Why PostgreSQL for Analytics Infrastructure (DW)?
Why PostgreSQL for Analytics Infrastructure (DW)?
 
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle DatabaseBest Practices – Extreme Performance with Data Warehousing on Oracle Database
Best Practices – Extreme Performance with Data Warehousing on Oracle Database
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
How to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxHow to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptx
 
3dw
3dw3dw
3dw
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud Era
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009
 

Plus de Knoldus Inc.

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxKnoldus Inc.
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionKnoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxKnoldus Inc.
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptxKnoldus Inc.
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesKnoldus Inc.
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxKnoldus Inc.
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxKnoldus Inc.
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxKnoldus Inc.
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxKnoldus Inc.
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationKnoldus Inc.
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationKnoldus Inc.
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIsKnoldus Inc.
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II PresentationKnoldus Inc.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAKnoldus Inc.
 

Plus de Knoldus Inc. (20)

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 

Dernier

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Run your queries 14X faster without any investment!

  • 1. Run your aggregation queries at a speed of 14x without spending $$$ By: Bhavya Aggarwal(CTO, Knoldus) Sangeeta Gulia(Software Consultant, Knoldus)
  • 2. Agenda ● What are aggregates? ● Why are aggregates slow ● Sql Queries and Big Data ● Techniques to prevent full scan ● Partitioning and Bucketing ● What is pre-aggregation. ● Advantages of pre-aggregation ● Trade offs with Pre-aggregation ● How can we pre-aggregate data. ● Suggestions for Pre-aggregation
  • 3. “The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.” - Ralph Kimball
  • 4. Commonly Used Aggregates Function Description MIN returns the smallest value in a given column MAX returns the largest value in a given column SUM returns the sum of the numeric values in a given column AVG returns the average value of a given column COUNT returns the total number of values in a given column COUNT(*) returns the number of rows in a table
  • 5. Why are Aggregates Slow ● Iteration has to be done on the whole data traversing each record. ● When data size is large this computation of aggregate functions take long time.
  • 6. Interactive SQL Queries ● When it comes to extract some meaningful information from stored data. First thing that comes to our mind is SQL ● Almost everyone is comfortable in playing with the data to extract meaningful information out of it. ● Problem starts when we have to work on large data(terabytes/petabytes)
  • 7. How we store data is important ● We need to analyse the application first rather than storing data ● We should understand the use cases or the data that will make sense for the Business Users ● What will be the optimized storage format (Columnar Storage) − ORC − Parquet − Carbondata
  • 8. Techniques to prevent full scan ● Partitioning ● Bucketing ● Preaggregation − Compaction
  • 9. Partitioning ● It divides amount of data into number of folders based on table columns value, this has performance benefit, and helps in organizing data in a logical fashion. ● Always Partition on a low cardinality column.
  • 10. Understanding Partitioning ● Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . create table employees (id int,name string,dob string) PARTITIONED BY (country STRING, DEPT STRING) ● For a faster query response Hive table can be partitioned . Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like .../employees/country=ABC/DEPT=XYZ.
  • 11. Bucketing ● Bucketing feature of Hive can be used to distribute/ organize the table/partition data into multiple files such that similar records are present in the same file based on some logic mostly some hashing algorithm. ● Bucketing works well when the field has high cardinality and data is evenly distributed among buckets
  • 12. Understanding Bucketing ● For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. ● Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. create table employees (id int,name string,age int) PARTITIONED BY (dob string) CLUSTERED BY(id) INTO 5 BUCKETS
  • 13. Understanding Bucketing then hive will store data in a directory hierarchy like /user/hive/warehouse/mytable/dob=2001-02-01 bucketed with 5 buckets inside the above directory: 00000_0 00001_0 ........ 00004_0 Here dob=2001-02-01 is the partition and 000 files are the buckets in partition.
  • 14. How Bucketing Improves Performance ● Fast Map side Joins ● Efficient Group by ● Sampling
  • 15. Dimension Table ● In a Dimensional Model, context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement. ● In your business process Sales, the characteristics of the ‘monthly sales number’ measurement can be a Location (Where), Time (When), Product Sold (What). ● Table that captures the information regarding each entity that is referenced in a fact table.
  • 16. Fact Table ● Table that contains the measures of interest. ● Fact tables contain the data corresponding to a particular business process. ● Each row represents a single event associated with a process and contains the measurement data associated with that event. Order_ID Order_Date Customer_I d Product_Id Quantity Price Total 1212 12/12/2017 13243 1 2 20 40 1213 12/12/2017 13289 456 8 60 50
  • 17.
  • 18. Dimension Aggregates ● Methods to define aggregates ● Include or leave out entire dimensions ● Include some columns of each dimension, but not others − Second approach is more common / more useful ● Example: Several versions of Date dimension − Base Date dimension ● 1 row per day − Monthly aggregate dimension ● 1 row per month − Yearly aggregate dimension ● 1 row per year
  • 19. Choosing Dimension Aggregates ● Dimension aggregates often roll up along hierarchies − Day : month – year − Store : area_code - state – country ● Any subset of dimension attributes can be used − Customer aggregate might include only a few frequently-queried columns (age, gender, income, marital status) ● Goal: reduced number of distinct combinations − Results in fewer rows in aggregated fact
  • 20. Pre-Aggregation ● Aggregate fact tables are simple numeric rollups of atomic fact table data built solely to accelerate query performance ● It is important to remember that aggregation is a form of data redundancy, because the aggregations are computed from other warehouse values. ● For obvious performance reasons, the aggregations are pre-calculated and loaded into the warehouse during off-hours.
  • 21. Advantages of Pre-Aggregated Tables ● Reduce input/output, CPU, RAM, and swapping requirements ● Minimize the amount of data that must be aggregated and sorted at run time thereby reducing memory requirements for joins ● Move time-intensive calculations with complicated logic or significant computations into a batch routine from dynamic SQL executed at report run time
  • 22. Trade-offs for Pre Aggregation ● Query Performance V/S Load Performance ● Building and updating data structures will be costly. ● Load Time will be slower. ● Pre-aggregation will improve performance of only those queries for which pre-computed data exist.
  • 23. Pre Aggregation Choices ● Most (R)OLAP tools today support practical pre- aggregation − IBM DB2 uses Materialized Query Tables(MDTs) − Oracle 9iR2 uses Materialised Views − Hyperion Essbase (DB2 OLAP Services) − Carbondata Preaggregates
  • 24. Creating Pre-aggregate Table (carbondata) ● Carbondata supports pre aggregating of data so that OLAP kind of queries can fetch data much faster.
  • 26. How pre-aggregate tables are selected ● For the main table sales and pre-aggregate table agg_sales created above, queries of the kind: − SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex − SELECT sex, sum(quantity) from sales GROUP BY sex − SELECT avg(price), country from sales GROUP BY country ● will be transformed by Query Planner to fetch data from pre-aggregate table agg_sales
  • 27. How pre-aggregate tables are selected ● But queries of kind : − SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex − SELECT sex, avg(quantity) from sales GROUP BY sex − SELECT country, max(price) from sales GROUP BY country ● will fetch the data from the main table sales Datamap
  • 28. Queries ● Query1: Find out the maximum quantity ordered in each order placed. ● Query2: Find out the total amount spent by a customer for all the orders he has placed. ● Query3: Find out the average amount spent by a customer for a particular order
  • 29. Performance ● Query: Find out the maximum item quantity, ordered for each order placed. >select L_ORDERKEY, max(L_QUANTITY) from lineitem group by L_ORDERKEY ● Performance on Main Table: [Datasize: TPCH data with scale of 50gb ] ● Average: 9.384
  • 30. Performance(cont...) ● Now let us first create a preaggregate, ● >create datamap max_order_quantity on table lineitem using 'preaggregate' as select L_ORDERKEY, max(L_QUANTITY) from lineitem group by L_ORDERKEY ● Performance after creating pre-aggregate: ● [Datasize: TPCH data with scale of 50gb ] ● Average: 1.190
  • 35. Pre-aggregate with Timeseries (carbondata) Datamap property Description event_time The event time column in the schema, which will be used for rollup. The column need be timestamp type. time granularity The aggregate dimension hierarchy. The property value is a key value pair specifying the aggregate granularity for the time level. Carbon support “year_granularity”, “month_granularity”, “day_granularity”, “hour_granularity”, “minute_granularity”, “second_granularity”. Granularity only support 1 when creating datamap. For example, ‘hour_granularity’=’1’ means aggregate every 1 hour. Now the value only support 1.
  • 36. Understanding Datamap with Timeseries ● Query before creating aggregate: ● >select DOB, max(DOUBLE_COLUMN1) from uniqdata group by DOB ● Creating Preaggregate: ● >create datamap timeseries_agg on table uniqdata using 'timeseries' dmproperties('event_time'='DOB','year_granularity'='1') as select DOB, max(DOUBLE_COLUMN1) from uniqdata group by DOB
  • 37. Understanding Datamap with Timeseries ● After Creating Preaggregate: ● >select timeseries(DOB,'year'),max(DOUBLE_COLUMN1) from uniqdata group by timeseries(DOB,'year') ● will map to aggregate, but below query will be executed using main table. ● >select timeseries(DOB,'month'),max(DOUBLE_COLUMN1) from uniqdata group by timeseries(DOB,'month')
  • 38. Logical Plan of Datamap with Timeseries
  • 39. Goals for Aggregation Strategy ● Do not get bogged down with too many aggregates. ● Try to cater to a wide range of user groups. ● Keep the aggregates hidden from end users. That is, the aggregate must be transparent to the end user query. The query tool must be the one to be aware of the aggregates to direct the queries for proper access.
  • 40. Suggestions for Preaggregation ● Before doing any calculations, spend good time to determine what pre-aggregate you need. What will be the common queries. ● Spend good time on understanding level of hierarchies and identify the important hierarchy. ● In each dimension check the attributes required to group the fact table metrics ● The next step is to determine which of the attributes are used in combinations and what are most common combinations
  • 41.
  • 42. References ● https://carbondata.apache.org/data-management-on-carbondat a.html ● https://www.slideshare.net/siddiqueibrahim37/aggregate-fact-ta bles ● https://www.ibm.com/support/knowledgecenter/en/SSCRW7_6.3 .0/com.ibm.redbrick.doc6.3/vista/vista20.htm ● http://myitlearnings.com/bucketing-in-hive/