SlideShare une entreprise Scribd logo
1  sur  27
How to Guarantee Exact COUNT DISTINCT Queries
with Sub-Second Latency on Massive Datasets
Kaige Liu
2020.5
© Kyligence Inc. 2019, Confidential.
Business Scenarios
Technical Principles
Demo
Use Cases
Q&A
Agenda
© Kyligence Inc. 2019, Confidential.
Business Scenarios
© Kyligence Inc. 2019, Confidential.
What Is Count Distinct?
Count Distinct is used to compute the number of
unique values in a data set.
• PV (Page View)
• UV (Unique Visitors)
ID Username Page
1 Alice /kyligence
2 Alice /Kyligence/Blog
3 Carol /Kyligence/Events
4 Bob /Kyligence/Resources
5 Alice /Kyligence/Downloads
Alice, Bob, Carol
3
© Kyligence Inc. 2019, Confidential.
Approximate and Exact Count Distinct
• Approximate Count Distinct
• Quick, less memory/CPU
• Not accurate
• Trend analysis, small errors are acceptable
• Exact Count Distinct
• Slow, more memory/CPU
• Accurate
• Transaction relevant. Paid Advertising, Precision Marketing, etc.
Error Rate $ 1 Million $ 1 Billion
1.22% $12,200 $12,200,000
2.44% $24,000 $24,000,000
9.75% $97,500 $97,500,000
© Kyligence Inc. 2019, Confidential.
Where
are they
coming
from?
Who are
my
visitors?
Web/Ap
p
Analytic
s
Which
page lost
the most
users?
How
many
active
users?
How
many
new
users?
How
many
unique
visitors?
Scenarios - Web/App Analytics
© Kyligence Inc. 2019, Confidential.
Scenarios - User Behavior Analytics
Retention Analysis
Funnel Analysis
© Kyligence Inc. 2019, Confidential.
Technical Principles
© Kyligence Inc. 2019, Confidential.
Challenges with Exact Count Distinct
• Approximate Count Distinct is easy – HyperLogLog
• Exact Count Distinct is a big challenge for all query engines at massive scale
Challenges
• Bad performance – Need to scan all data
• Non-cumulative – Hard to do rollup and/or operations
• Hard to optimize on multiple columns
• Analysis always requires more than one count distinct operation
© Kyligence Inc. 2019, Confidential.
Count Distinct Performance on Different Platforms
• Google BigQuery
• Snowflake
• Athena
• Apache Kylin
• Kyligence
© Kyligence Inc. 2019, Confidential.
Kyligence = Kylin + Intelligence
• Founded in 2016 by the creators of Apache Kylin
• Built around Kylin, with augmented AI and enhanced to deliver
unprecedented enterprise analytic performance
• CRN Top-10 big data startups in 2018
• Global Presence: San Jose, Seattle, New York, Shanghai, Beijing
• VCs: Fidelity International, Shunwei Capital, Broadband Capital,
Redpoint, Cisco, Coatue
Accelerate Critical Business Decisions with AI-Augmented Data Management
and Analytics
2016
Founded Pre-
A
Redpoint
Cisco
2017
Series A
CBC
Shunwei
2018
Series B
8Roads
2019
Series C
Coatue
© Kyligence Inc. 2019, Confidential.
How Does Apache Kylin Achieve This?
BitmapPre-Aggregation
• Pre-aggregate count distinct in cubes
• Fetch results directly without on the
fly calculations
• Supports Rollup
• Reduces memory/storage significantly
• Supports String type and detail queries
Dictionary
© Kyligence Inc. 2019, Confidential.
Pre-Aggregation
Date UID Page
2020-04-01
01
1 /kyligence
2020-04-01
01
1 /Kyligence/Blog
2020-04-01
01
2 /Kyligence/News
2020-04-02
02
3 /Kyligence/Events
2020-04-02
02
2 /Kyligence/Resources
2020-04-02
02
1 /Kyligence/Downloads
Date Count(UID) Count(distinct UID)
UID)
2020-04-01
01
3 2
2020-04-02
02
3 3
Date Count(UID) Count(distinct UID)
2020-04-01
01 and
2020-04-02
02
6 ??
© Kyligence Inc. 2019, Confidential.
7 6 5 4 3 2 1 0
Bitmap
UID
1
2
4
5
7
9
10
11
13
1 0 0 1 0 1 1 0
0 0 1 0 1 1 1 0
Table Bitmap
• Saves storage significantly
• Supports logical operations directly
• Contains information needed to do
aggregation
• RoaringBitmap
© Kyligence Inc. 2019, Confidential.
Bitmap
Date UID Page
2020-04-01
01
1 /kyligence
2020-04-01
01
1 /Kyligence/Blog
2020-04-01
01
2 /Kyligence/News
2020-04-02
02
3 /Kyligence/Events
2020-04-02
02
2 /Kyligence/Resources
2020-04-02
02
1 /Kyligence/Downloads
Date Count(UID) Count(distinct UID)
UID)
2020-04-01
01
3 2
2020-04-02
02
3 3
Date Count(UID) Count(distinct UID)
2020-04-01
01 and
2020-04-02
02
6 Bitmap(1,2,3)
Date Count(UID) Count(distinct UID)
UID)
2020-04-01 3 Bitmap(1,2)
2020-04-02 3 Bitmap(1,2,3)
© Kyligence Inc. 2019, Confidential.
Operations in Bitmap
• Two bitmaps, each containing two different data sets:
[1, 3, 4, 5]
[2, 3, 4, 6]
• And - All elements contained in both bitmaps:
[1, 3, 4, 5] and [2, 3, 4, 6] = [3, 4]
Scenarios: Retention Analysis, Funnel Analysis
• Or – All elements in either bitmap:
[1, 3, 4, 5] or [2, 3, 4, 6] = [1, 2, 3, 4, 5, 6]
Scenarios: Cross-Dimension Analysis
© Kyligence Inc. 2019, Confidential.
Dictionary
Date USERNAME Page
2020-04-01
01
Alice /kyligence
2020-04-01
01
Alice /Kyligence/Blog
2020-04-01
01
Bob /Kyligence/News
2020-04-02
02
Coral /Kyligence/Events
2020-04-02
02
Bob /Kyligence/Resources
2020-04-02
02
Alice /Kyligence/Downloads
USERNAME ECODED
Alice 1
Bob 2
Coral 3
Date Count(UID) Count(distinct UID)
2020-04-01
01 and
2020-04-02
02
6 Bitmap(1,2,3)
Date Count(UID) Count(distinct UID)
UID)
2020-04-01
01
3 Bitmap(1,2)
2020-04-02
02
3 Bitmap(1,2,3)
Bitmap can only support int values. How about String columns?
Dictionary
© Kyligence Inc. 2019, Confidential.
Use Cases
© Kyligence Inc. 2019, Confidential.
Manbang Group
• The largest Chinese truck logistics startup
• 7 million+ trucks
• 2.25 million active users
• 8 apps and 10 TB+ data
Requirements
• Retention analysis on a wide range of dimensions
and date ranges
• Funnel analysis with ability to customize funnel
• User profile analysis
© Kyligence Inc. 2019, Confidential.
Architecture with Apache Kylin
© Kyligence Inc. 2019, Confidential.
Retention Analysis for Manbang Group
• Users can choose any column and any date range to do the retention analysis
© Kyligence Inc. 2019, Confidential.
Funnel Analysis for Manbang group
• Users can customize funnels with any number of steps
• Can identify the specific users lost between steps
© Kyligence Inc. 2019, Confidential.
DiDi
• #1 ride-share company in China
• 92 million monthly active users
(as of Dec. 2019)
• 24 million rides per day in 2019
Requirements
• User profile analysis
• Precision marketing
© Kyligence Inc. 2019, Confidential.
Scenarios – Apache Kylin in Didi
• Precision Marketing
o Send coupons to exact target users
o Upgrade cars for specific users
• Promotion Activity Analysis
o How many new/returned users are gained in this activity?
o Which kind of users are most interested in this activity?
• Optimize User Experience
o Which stages lost the most users?
o How to increase customer stickiness?
User Profile
Precision
Marketing
User
Behavior
Analysis
User Tags
Workflow
Analysis
Promotion
Activity
Analysis
© Kyligence Inc. 2019, Confidential.
Didi Kylin Usage
200 TB+ 5,000+ 7,000+ 7
Data Cubes Jobs per day Clusters
© Kyligence Inc. 2019, Confidential.
Join the Community
https://github.com/apache/kylin apache-kylin.slack.comuser@kylin.apache.org
THANK YOU

Contenu connexe

Tendances

Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)Cloudera, Inc.
 
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...InfluxData
 
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...DevOps.com
 
Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?Tyrone Systems
 
Enabling Push Button Productization of AI Models
Enabling Push Button Productization of AI ModelsEnabling Push Button Productization of AI Models
Enabling Push Button Productization of AI ModelsDatabricks
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the EnterpriseThe Hive
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingayushi19
 
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” ArchitecturesFIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” ArchitecturesFIWARE
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Dataconomy Media
 
OpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS DataOpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS DataGanesan Narayanasamy
 
Visualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual realityVisualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual realityMolham Al-Maleh
 
InfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application dataInfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application dataBharath Nunepalli
 
This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019NVIDIA
 
Seven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchSeven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchNVIDIA
 
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...FIWARE
 
Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Veselin Pizurica
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseJeff Kelly
 

Tendances (20)

Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)Logicalis IoT & Smart Cities (Use Case)
Logicalis IoT & Smart Cities (Use Case)
 
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
Ayush Tiwari [PTC] | Unlock IoT Value with PTC’s ThingWorx Platform & InfluxD...
 
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
How to Gain a Competitive Edge with an Open Source, Purpose-built Time Series...
 
Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?Will Edge Computing IoT Solutions be a Real Trend in 2019?
Will Edge Computing IoT Solutions be a Real Trend in 2019?
 
Enabling Push Button Productization of AI Models
Enabling Push Button Productization of AI ModelsEnabling Push Button Productization of AI Models
Enabling Push Button Productization of AI Models
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” ArchitecturesFIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
 
OpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS DataOpenPOWER partner presentation - GTS Data
OpenPOWER partner presentation - GTS Data
 
Visualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual realityVisualizing Big Data with augmented and virtual reality
Visualizing Big Data with augmented and virtual reality
 
InfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application dataInfoSphere Optim archive for archive/purge of application data
InfoSphere Optim archive for archive/purge of application data
 
This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019This Week in Data Science - Top 5 News - April 26, 2019
This Week in Data Science - Top 5 News - April 26, 2019
 
Seven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchSeven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence Research
 
AI at the Edge
AI at the EdgeAI at the Edge
AI at the Edge
 
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
FIWARE Global Summit - Standard Data Models for the Integration of FIWARE and...
 
Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?Has serverless adoption hit a roadblock?
Has serverless adoption hit a roadblock?
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
 
Opportunities derived by AI
Opportunities derived by AIOpportunities derived by AI
Opportunities derived by AI
 

Similaire à How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Massive Datasets

Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTyler Wishnoff
 
Augmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataAugmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataTyler Wishnoff
 
Augmented OLAP for Big Data
Augmented OLAP for Big DataAugmented OLAP for Big Data
Augmented OLAP for Big DataLuke Han
 
Simplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudTyler Wishnoff
 
Legacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and UptimeLegacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and UptimePrecisely
 
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...NuoDB
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Snowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the UglySnowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the UglySamanthaBerlant
 
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...Tyler Wishnoff
 
Augmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsTyler Wishnoff
 
Addressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analyticsAddressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analyticsSamanthaBerlant
 
Apache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data SpainApache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data SpainLuke Han
 
Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019IanUriarte2
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kineticCisco Canada
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kineticCisco Canada
 
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschapIoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschapIoT Academy
 
IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS OverviewJean Tan
 
The value of a connected factory
The value of a connected factoryThe value of a connected factory
The value of a connected factoryCroonwolter&dros
 
A Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of ThingsA Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of ThingsInside Analysis
 
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXtsigitnist02
 

Similaire à How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Massive Datasets (20)

Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
 
Augmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataAugmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big Data
 
Augmented OLAP for Big Data
Augmented OLAP for Big DataAugmented OLAP for Big Data
Augmented OLAP for Big Data
 
Simplify Data Analytics Over the Cloud
Simplify Data Analytics Over the CloudSimplify Data Analytics Over the Cloud
Simplify Data Analytics Over the Cloud
 
Legacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and UptimeLegacy IBM Systems and Splunk: Security, Compliance and Uptime
Legacy IBM Systems and Splunk: Security, Compliance and Uptime
 
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
The Enabling Power of Distributed SQL for Enterprise Digital Transformation I...
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Snowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the UglySnowflake: The Good, the Bad and the Ugly
Snowflake: The Good, the Bad and the Ugly
 
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
Lightning-Fast, Interactive Business Intelligence Performance with MicroStrat...
 
Augmented OLAP for Big Data Analytics
Augmented OLAP for Big Data AnalyticsAugmented OLAP for Big Data Analytics
Augmented OLAP for Big Data Analytics
 
Addressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analyticsAddressing the systemic shortcomings of cloud analytics
Addressing the systemic shortcomings of cloud analytics
 
Apache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data SpainApache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data Spain
 
Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019Ian Uriarte Timbergrove at IBM IoTExchange 2019
Ian Uriarte Timbergrove at IBM IoTExchange 2019
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
 
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschapIoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
IoT Update | Hoe implementeer je IoT Schaalbaar in je IT landschap
 
IBM CDS Overview
IBM CDS OverviewIBM CDS Overview
IBM CDS Overview
 
The value of a connected factory
The value of a connected factoryThe value of a connected factory
The value of a connected factory
 
A Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of ThingsA Connected Data Landscape: Virtualization and the Internet of Things
A Connected Data Landscape: Virtualization and the Internet of Things
 
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTXCustomer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
Customer Presentation - IBM Cloud Pak for Data Overview (Level 100).PPTX
 

Plus de SamanthaBerlant

Kyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and Indexing
Kyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and IndexingKyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and Indexing
Kyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and IndexingSamanthaBerlant
 
Smashing Through Big Data Barriers with Tableau and Snowflake
Smashing Through Big Data Barriers with Tableau and SnowflakeSmashing Through Big Data Barriers with Tableau and Snowflake
Smashing Through Big Data Barriers with Tableau and SnowflakeSamanthaBerlant
 
Kyligence Cloud 4 - Feature Focus: AI-Augmented Engine
Kyligence Cloud 4 - Feature Focus: AI-Augmented EngineKyligence Cloud 4 - Feature Focus: AI-Augmented Engine
Kyligence Cloud 4 - Feature Focus: AI-Augmented EngineSamanthaBerlant
 
Precomputation or Data Virtualization, which one is right for you?
Precomputation or Data Virtualization, which one is right for you?Precomputation or Data Virtualization, which one is right for you?
Precomputation or Data Virtualization, which one is right for you?SamanthaBerlant
 
Architecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceArchitecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceSamanthaBerlant
 
Kyligence Cloud 4 - An Overview
Kyligence Cloud 4 - An OverviewKyligence Cloud 4 - An Overview
Kyligence Cloud 4 - An OverviewSamanthaBerlant
 
Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...
Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...
Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...SamanthaBerlant
 
Open Source Technologies in the Analytics Revolution
Open Source Technologies in the Analytics RevolutionOpen Source Technologies in the Analytics Revolution
Open Source Technologies in the Analytics RevolutionSamanthaBerlant
 
Enhance Data Governance with Kyligence Unified Semantic Layer
Enhance Data Governance with Kyligence Unified Semantic LayerEnhance Data Governance with Kyligence Unified Semantic Layer
Enhance Data Governance with Kyligence Unified Semantic LayerSamanthaBerlant
 

Plus de SamanthaBerlant (10)

Kyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and Indexing
Kyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and IndexingKyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and Indexing
Kyligence Cloud 4 - Feature Focus: Spark-Powered Cubing and Indexing
 
Smashing Through Big Data Barriers with Tableau and Snowflake
Smashing Through Big Data Barriers with Tableau and SnowflakeSmashing Through Big Data Barriers with Tableau and Snowflake
Smashing Through Big Data Barriers with Tableau and Snowflake
 
Kyligence Cloud 4 - Feature Focus: AI-Augmented Engine
Kyligence Cloud 4 - Feature Focus: AI-Augmented EngineKyligence Cloud 4 - Feature Focus: AI-Augmented Engine
Kyligence Cloud 4 - Feature Focus: AI-Augmented Engine
 
Precomputation or Data Virtualization, which one is right for you?
Precomputation or Data Virtualization, which one is right for you?Precomputation or Data Virtualization, which one is right for you?
Precomputation or Data Virtualization, which one is right for you?
 
Architecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High PerformanceArchitecting Snowflake for High Concurrency and High Performance
Architecting Snowflake for High Concurrency and High Performance
 
Kyligence Cloud 4 - An Overview
Kyligence Cloud 4 - An OverviewKyligence Cloud 4 - An Overview
Kyligence Cloud 4 - An Overview
 
Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...
Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...
Extreme Excel: How a 35-Year-Old Desktop App Smashed Through the Big Data Bar...
 
Open Source Technologies in the Analytics Revolution
Open Source Technologies in the Analytics RevolutionOpen Source Technologies in the Analytics Revolution
Open Source Technologies in the Analytics Revolution
 
Enhance Data Governance with Kyligence Unified Semantic Layer
Enhance Data Governance with Kyligence Unified Semantic LayerEnhance Data Governance with Kyligence Unified Semantic Layer
Enhance Data Governance with Kyligence Unified Semantic Layer
 
Apache Kylin 101
Apache Kylin 101Apache Kylin 101
Apache Kylin 101
 

Dernier

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Dernier (20)

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Massive Datasets

  • 1. How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Massive Datasets Kaige Liu 2020.5
  • 2. © Kyligence Inc. 2019, Confidential. Business Scenarios Technical Principles Demo Use Cases Q&A Agenda
  • 3. © Kyligence Inc. 2019, Confidential. Business Scenarios
  • 4. © Kyligence Inc. 2019, Confidential. What Is Count Distinct? Count Distinct is used to compute the number of unique values in a data set. • PV (Page View) • UV (Unique Visitors) ID Username Page 1 Alice /kyligence 2 Alice /Kyligence/Blog 3 Carol /Kyligence/Events 4 Bob /Kyligence/Resources 5 Alice /Kyligence/Downloads Alice, Bob, Carol 3
  • 5. © Kyligence Inc. 2019, Confidential. Approximate and Exact Count Distinct • Approximate Count Distinct • Quick, less memory/CPU • Not accurate • Trend analysis, small errors are acceptable • Exact Count Distinct • Slow, more memory/CPU • Accurate • Transaction relevant. Paid Advertising, Precision Marketing, etc. Error Rate $ 1 Million $ 1 Billion 1.22% $12,200 $12,200,000 2.44% $24,000 $24,000,000 9.75% $97,500 $97,500,000
  • 6. © Kyligence Inc. 2019, Confidential. Where are they coming from? Who are my visitors? Web/Ap p Analytic s Which page lost the most users? How many active users? How many new users? How many unique visitors? Scenarios - Web/App Analytics
  • 7. © Kyligence Inc. 2019, Confidential. Scenarios - User Behavior Analytics Retention Analysis Funnel Analysis
  • 8. © Kyligence Inc. 2019, Confidential. Technical Principles
  • 9. © Kyligence Inc. 2019, Confidential. Challenges with Exact Count Distinct • Approximate Count Distinct is easy – HyperLogLog • Exact Count Distinct is a big challenge for all query engines at massive scale Challenges • Bad performance – Need to scan all data • Non-cumulative – Hard to do rollup and/or operations • Hard to optimize on multiple columns • Analysis always requires more than one count distinct operation
  • 10. © Kyligence Inc. 2019, Confidential. Count Distinct Performance on Different Platforms • Google BigQuery • Snowflake • Athena • Apache Kylin • Kyligence
  • 11. © Kyligence Inc. 2019, Confidential. Kyligence = Kylin + Intelligence • Founded in 2016 by the creators of Apache Kylin • Built around Kylin, with augmented AI and enhanced to deliver unprecedented enterprise analytic performance • CRN Top-10 big data startups in 2018 • Global Presence: San Jose, Seattle, New York, Shanghai, Beijing • VCs: Fidelity International, Shunwei Capital, Broadband Capital, Redpoint, Cisco, Coatue Accelerate Critical Business Decisions with AI-Augmented Data Management and Analytics 2016 Founded Pre- A Redpoint Cisco 2017 Series A CBC Shunwei 2018 Series B 8Roads 2019 Series C Coatue
  • 12. © Kyligence Inc. 2019, Confidential. How Does Apache Kylin Achieve This? BitmapPre-Aggregation • Pre-aggregate count distinct in cubes • Fetch results directly without on the fly calculations • Supports Rollup • Reduces memory/storage significantly • Supports String type and detail queries Dictionary
  • 13. © Kyligence Inc. 2019, Confidential. Pre-Aggregation Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/Blog 2020-04-01 01 2 /Kyligence/News 2020-04-02 02 3 /Kyligence/Events 2020-04-02 02 2 /Kyligence/Resources 2020-04-02 02 1 /Kyligence/Downloads Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 2 2020-04-02 02 3 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 ??
  • 14. © Kyligence Inc. 2019, Confidential. 7 6 5 4 3 2 1 0 Bitmap UID 1 2 4 5 7 9 10 11 13 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 0 Table Bitmap • Saves storage significantly • Supports logical operations directly • Contains information needed to do aggregation • RoaringBitmap
  • 15. © Kyligence Inc. 2019, Confidential. Bitmap Date UID Page 2020-04-01 01 1 /kyligence 2020-04-01 01 1 /Kyligence/Blog 2020-04-01 01 2 /Kyligence/News 2020-04-02 02 3 /Kyligence/Events 2020-04-02 02 2 /Kyligence/Resources 2020-04-02 02 1 /Kyligence/Downloads Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 2 2020-04-02 02 3 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 Bitmap(1,2,3) Date Count(UID) Count(distinct UID) UID) 2020-04-01 3 Bitmap(1,2) 2020-04-02 3 Bitmap(1,2,3)
  • 16. © Kyligence Inc. 2019, Confidential. Operations in Bitmap • Two bitmaps, each containing two different data sets: [1, 3, 4, 5] [2, 3, 4, 6] • And - All elements contained in both bitmaps: [1, 3, 4, 5] and [2, 3, 4, 6] = [3, 4] Scenarios: Retention Analysis, Funnel Analysis • Or – All elements in either bitmap: [1, 3, 4, 5] or [2, 3, 4, 6] = [1, 2, 3, 4, 5, 6] Scenarios: Cross-Dimension Analysis
  • 17. © Kyligence Inc. 2019, Confidential. Dictionary Date USERNAME Page 2020-04-01 01 Alice /kyligence 2020-04-01 01 Alice /Kyligence/Blog 2020-04-01 01 Bob /Kyligence/News 2020-04-02 02 Coral /Kyligence/Events 2020-04-02 02 Bob /Kyligence/Resources 2020-04-02 02 Alice /Kyligence/Downloads USERNAME ECODED Alice 1 Bob 2 Coral 3 Date Count(UID) Count(distinct UID) 2020-04-01 01 and 2020-04-02 02 6 Bitmap(1,2,3) Date Count(UID) Count(distinct UID) UID) 2020-04-01 01 3 Bitmap(1,2) 2020-04-02 02 3 Bitmap(1,2,3) Bitmap can only support int values. How about String columns? Dictionary
  • 18. © Kyligence Inc. 2019, Confidential. Use Cases
  • 19. © Kyligence Inc. 2019, Confidential. Manbang Group • The largest Chinese truck logistics startup • 7 million+ trucks • 2.25 million active users • 8 apps and 10 TB+ data Requirements • Retention analysis on a wide range of dimensions and date ranges • Funnel analysis with ability to customize funnel • User profile analysis
  • 20. © Kyligence Inc. 2019, Confidential. Architecture with Apache Kylin
  • 21. © Kyligence Inc. 2019, Confidential. Retention Analysis for Manbang Group • Users can choose any column and any date range to do the retention analysis
  • 22. © Kyligence Inc. 2019, Confidential. Funnel Analysis for Manbang group • Users can customize funnels with any number of steps • Can identify the specific users lost between steps
  • 23. © Kyligence Inc. 2019, Confidential. DiDi • #1 ride-share company in China • 92 million monthly active users (as of Dec. 2019) • 24 million rides per day in 2019 Requirements • User profile analysis • Precision marketing
  • 24. © Kyligence Inc. 2019, Confidential. Scenarios – Apache Kylin in Didi • Precision Marketing o Send coupons to exact target users o Upgrade cars for specific users • Promotion Activity Analysis o How many new/returned users are gained in this activity? o Which kind of users are most interested in this activity? • Optimize User Experience o Which stages lost the most users? o How to increase customer stickiness? User Profile Precision Marketing User Behavior Analysis User Tags Workflow Analysis Promotion Activity Analysis
  • 25. © Kyligence Inc. 2019, Confidential. Didi Kylin Usage 200 TB+ 5,000+ 7,000+ 7 Data Cubes Jobs per day Clusters
  • 26. © Kyligence Inc. 2019, Confidential. Join the Community https://github.com/apache/kylin apache-kylin.slack.comuser@kylin.apache.org

Notes de l'éditeur

  1. UV/PV put some words in the slide
  2. Put a static image instead of gif
  3. Link And OR to analysis scenarios