SlideShare une entreprise Scribd logo
1  sur  25
Real-time Analytics with
Trino and Apache Pinot
Xiang Fu, Elon Azoulay
Oct 22, 2021
Speaker Intro
Xiang Fu
● Co-founder - StarTree
● Previously: Streaming Platform @ Uber
● PMC & Committer: Apache Pinot
Elon Azoulay
● Software Engineer - Stealth Mode Startup
● Previously: Data @ Facebook
Agenda
● Today’s Compromises: Latency vs. Flexibility
● Not trade-off using Apache Pinot
● Trino Pinot Connector
● Benchmark
Today’s Compromises: Latency vs Flexibility
Flexibility takes time: Join on the Fly
customers
orders
customers.state =
‘California’ AND
customers.gender
= ‘Female’
JOIN customers ON
(customers.customer_id
= orders.customer_id)
Group By
customers.city,
Month(orders.date)
sum(orders.amount)
FILTER JOIN GROUP BY AGGREGATION
- Flexible to do any computation
- High query cost: disk & network
I/O, Data Partitioning, Data Serde
ETL Trade-offs: Pre-joined Table
customers
orders
state =
‘California’
AND gender
= ‘Female’
JOIN customers ON
(customers.customer_id
= orders.customer_id)
Group By city,
Month(orders.
date)
sum(amount)
FILTER
JOIN GROUP BY AGGREGATION
user_orders
_joined
Pre-Joined
Table
- Flexible to explore user dimensions
- Query time is still proportional to the
data scan, not predictable
ETL Trade-offs: Pre-aggregated Table
state =
‘California’
AND gender
= ‘Female’
Group By city,
Month(orders.
date)
sum(sum_amount)
FILTER GROUP BY AGGREGATION
user_orders
_joined
Pre-Joined
Table
user_orders_
aggregated
Pre-Aggregated
Table
SELECT
sum(amount) as
sum_amount, date,
city GROUP BY
date, city
Aggregation
+ GroupBy
- Reduced query runtime workload
- Query time is still proportional to the
multiplication of non-groupBy columns
ETL Trade-offs: Pre-cubed Table
state =
‘California’
AND gender
= ‘Female’
sum_amount,
month, city
FILTER
user_orders
_joined
Pre-Joined
Table
user_orders
_cubed
Pre-Aggregated
Table
SELECT
sum(amount) as
sum_amount, date,
Month(date) as
month, city
GROUP BY CUBE
(date, city,
Month(date))
Cubing
PROJECTION
- Predictable query runtime
- Storage overhead: one raw record
translates to multiple records
- Dimension explosion
Fact Table
Dimension Table Pre-Join Pre-Aggregation Pre-Cube
Latency
Flexibility
low
high
low
high
Not to Trade-off Using Apache Pinot
Throughput high
low
User Facing Applications Business Facing Metrics
Apache Pinot Overview
Anomaly Detection
- Ingestion: Millions of events/sec
- Workload: Thousands of queries/sec
- Performance: Millisecond
- Operation: Thousands Nodes Cluster
ADLS
GCS
Real-Time Offline
BI Visualization Data Products Anomaly Detection
ADLS
GCS
Real-Time Offline
OLTP
Server1 Server2 Server3
Zookeeper
Broker 1 Broker 2
Controller
Secrets Behind Apache Pinot
Scan
Aggregation
Filter
Storage
Bloom
Filter
Inverted Index
Columnar Store
Byte
Encoding
Sorted
Index ❏Common Techniques
❏Pinot
Compression
Star-Tree Pre-aggregation
Star-
Tree
Index
Bit/RLE
Encoding
Per-segment flexible query planning
Range
Index
Text
Index
Apache Pinot - StarTree Index
• Configurable trade-off between latency and space by partial pre-aggregation
technique
• Be able to achieve a hard upper bound for query latencies
No pre-computation
Latency
Storage
Full Pre-Cube
(KV Store)
Partial pre-computation
(Startree Index)
T= 10000
T= 100
Trino Overview
Source: Trino Architecture
BI Visualization Data Products Anomaly Detection Ad hoc Analysis
ADLS
GCS
Trino Pinot Connector
Trino Pinot Connector
Trino Pinot Connector: Aggregation Pushdown
Chasing the light: Aggregation pushdown
- Issue single Pinot broker request
- Best-effort push down for aggregations like
count/sum/min/max/distinct/approximate_distinct, etc
- 10~100x latency improvement
Passthrough Broker Queries
SELECT CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’
END AS size, team, count(*) FROM
pinot.default.”SELECT team
FROM baseball_stats WHERE conference = ‘America East’”
GROUP BY CASE WHEN team = ‘Giants’ then ‘BIG’ else
‘SMALL’ END, 2
Group by Expression Support
Passthrough Broker Queries
SELECT team, count(*) FROM
pinot.default.”SELECT team, player
FROM baseball_stats WHERE conference = ‘America East’”
ORDER BY CASE WHEN team = ‘Giants’ then ‘BIG’ else
‘SMALL’ END, 2
Order by Expression Support
Trino Pinot Connector: Server Query + Pinot Streaming API
Pinot Streaming(gRpc) Connector
- Distributed workload in parallel among Trino workers
- Configurable memory footprint for data pulling from Pinot
- Open the gate of queries requires full table scan or join
Ongoing and Future Work on the Connector
● Data Insertion
○ Push segments to the controller
○ Adds or replaces segments.
Ongoing and Future Work on the Connector
● Pinot Segments Deletion
● Table & Column Creation/Alter/Drop
CREATE TABLE DIMTABLE
(LONG_COL bigint, STRING_COL varchar)
WITH (
PRIMARY_KEY_COLUMNS = ARRAY['long_col'],
OFFLINE_CONFIG = '{
"tableName": "dimtable",
"tableType": "OFFLINE",
"isDimTable": true,
"segmentsConfig": {
"segmentPushType": "REFRESH",
"replication": "1"
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableIndexConfig": {
"loadMode": "MMAP"
}
}’
);
Perf Benchmark
Benchmark Config:
- 1 Pinot Controllers (4 cores/8GB)
- 1 Pinot Brokers (4 cores/8GB)
- 3 Pinot Servers (4 cores/8GB)
- 1 Trino Coordinator (4 cores/8GB)
- 1 Trino Workers (4 cores/8GB)
Data Set:
- 40 Million rows data set
Query:
- Aggregation GroupBy + Predicate
Trino:
- Aggregation Pushdown enable/disable
Perf Benchmark
Query Type: Aggregation Group By + Predicate pushdown
Thank you
- Getting Started: https://tinyurl.com/trinoPinotTutorial
- Run Trino in Kubernetes: https://github.com/trinodb/charts
- StarTree: https://www.startree.ai/
- Apache Pinot: https://pinot.apache.org/
- Pinot on github: https://github.com/apache/pinot
- Pinot slack: https://tinyurl.com/pinotSlackChannel
- Apache Pinot Twitter: https://twitter.com/ApachePinot
- Apache Pinot Meetup: https://www.meetup.com/apache-pinot
- Starburst: https://www.starburst.io/
- Trino: https://trino.io/
- Trino on github: https://github.com/trinodb/trino
- Trino slack: https://trino.io/slack.html

Contenu connexe

Tendances

Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Mayank Shrivastava
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotXiang Fu
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino ProjectMartin Traverso
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 

Tendances (20)

Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 

Similaire à Real-time Analytics with Trino and Apache Pinot

Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaRicardo Bravo
 
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
A Practical Deep Dive into Observability of Streaming Applications with Kosta...A Practical Deep Dive into Observability of Streaming Applications with Kosta...
A Practical Deep Dive into Observability of Streaming Applications with Kosta...HostedbyConfluent
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019Intel® Software
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioFred Moyer
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleSaurabh Verma
 
Dynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker EnvironmentsDynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker EnvironmentsTorin Sandall
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...DataStax
 

Similaire à Real-time Analytics with Trino and Apache Pinot (20)

Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Sprint 73
Sprint 73Sprint 73
Sprint 73
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
A Practical Deep Dive into Observability of Streaming Applications with Kosta...A Practical Deep Dive into Observability of Streaming Applications with Kosta...
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
Dynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker EnvironmentsDynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker Environments
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Real-time Analytics with Trino and Apache Pinot

  • 1. Real-time Analytics with Trino and Apache Pinot Xiang Fu, Elon Azoulay Oct 22, 2021
  • 2. Speaker Intro Xiang Fu ● Co-founder - StarTree ● Previously: Streaming Platform @ Uber ● PMC & Committer: Apache Pinot Elon Azoulay ● Software Engineer - Stealth Mode Startup ● Previously: Data @ Facebook
  • 3. Agenda ● Today’s Compromises: Latency vs. Flexibility ● Not trade-off using Apache Pinot ● Trino Pinot Connector ● Benchmark
  • 5. Flexibility takes time: Join on the Fly customers orders customers.state = ‘California’ AND customers.gender = ‘Female’ JOIN customers ON (customers.customer_id = orders.customer_id) Group By customers.city, Month(orders.date) sum(orders.amount) FILTER JOIN GROUP BY AGGREGATION - Flexible to do any computation - High query cost: disk & network I/O, Data Partitioning, Data Serde
  • 6. ETL Trade-offs: Pre-joined Table customers orders state = ‘California’ AND gender = ‘Female’ JOIN customers ON (customers.customer_id = orders.customer_id) Group By city, Month(orders. date) sum(amount) FILTER JOIN GROUP BY AGGREGATION user_orders _joined Pre-Joined Table - Flexible to explore user dimensions - Query time is still proportional to the data scan, not predictable
  • 7. ETL Trade-offs: Pre-aggregated Table state = ‘California’ AND gender = ‘Female’ Group By city, Month(orders. date) sum(sum_amount) FILTER GROUP BY AGGREGATION user_orders _joined Pre-Joined Table user_orders_ aggregated Pre-Aggregated Table SELECT sum(amount) as sum_amount, date, city GROUP BY date, city Aggregation + GroupBy - Reduced query runtime workload - Query time is still proportional to the multiplication of non-groupBy columns
  • 8. ETL Trade-offs: Pre-cubed Table state = ‘California’ AND gender = ‘Female’ sum_amount, month, city FILTER user_orders _joined Pre-Joined Table user_orders _cubed Pre-Aggregated Table SELECT sum(amount) as sum_amount, date, Month(date) as month, city GROUP BY CUBE (date, city, Month(date)) Cubing PROJECTION - Predictable query runtime - Storage overhead: one raw record translates to multiple records - Dimension explosion
  • 9. Fact Table Dimension Table Pre-Join Pre-Aggregation Pre-Cube Latency Flexibility low high low high Not to Trade-off Using Apache Pinot Throughput high low
  • 10. User Facing Applications Business Facing Metrics Apache Pinot Overview Anomaly Detection - Ingestion: Millions of events/sec - Workload: Thousands of queries/sec - Performance: Millisecond - Operation: Thousands Nodes Cluster ADLS GCS Real-Time Offline
  • 11. BI Visualization Data Products Anomaly Detection ADLS GCS Real-Time Offline OLTP Server1 Server2 Server3 Zookeeper Broker 1 Broker 2 Controller
  • 12. Secrets Behind Apache Pinot Scan Aggregation Filter Storage Bloom Filter Inverted Index Columnar Store Byte Encoding Sorted Index ❏Common Techniques ❏Pinot Compression Star-Tree Pre-aggregation Star- Tree Index Bit/RLE Encoding Per-segment flexible query planning Range Index Text Index
  • 13. Apache Pinot - StarTree Index • Configurable trade-off between latency and space by partial pre-aggregation technique • Be able to achieve a hard upper bound for query latencies No pre-computation Latency Storage Full Pre-Cube (KV Store) Partial pre-computation (Startree Index) T= 10000 T= 100
  • 15. BI Visualization Data Products Anomaly Detection Ad hoc Analysis ADLS GCS Trino Pinot Connector
  • 17. Trino Pinot Connector: Aggregation Pushdown Chasing the light: Aggregation pushdown - Issue single Pinot broker request - Best-effort push down for aggregations like count/sum/min/max/distinct/approximate_distinct, etc - 10~100x latency improvement
  • 18. Passthrough Broker Queries SELECT CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’ END AS size, team, count(*) FROM pinot.default.”SELECT team FROM baseball_stats WHERE conference = ‘America East’” GROUP BY CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’ END, 2 Group by Expression Support
  • 19. Passthrough Broker Queries SELECT team, count(*) FROM pinot.default.”SELECT team, player FROM baseball_stats WHERE conference = ‘America East’” ORDER BY CASE WHEN team = ‘Giants’ then ‘BIG’ else ‘SMALL’ END, 2 Order by Expression Support
  • 20. Trino Pinot Connector: Server Query + Pinot Streaming API Pinot Streaming(gRpc) Connector - Distributed workload in parallel among Trino workers - Configurable memory footprint for data pulling from Pinot - Open the gate of queries requires full table scan or join
  • 21. Ongoing and Future Work on the Connector ● Data Insertion ○ Push segments to the controller ○ Adds or replaces segments.
  • 22. Ongoing and Future Work on the Connector ● Pinot Segments Deletion ● Table & Column Creation/Alter/Drop CREATE TABLE DIMTABLE (LONG_COL bigint, STRING_COL varchar) WITH ( PRIMARY_KEY_COLUMNS = ARRAY['long_col'], OFFLINE_CONFIG = '{ "tableName": "dimtable", "tableType": "OFFLINE", "isDimTable": true, "segmentsConfig": { "segmentPushType": "REFRESH", "replication": "1" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "loadMode": "MMAP" } }’ );
  • 23. Perf Benchmark Benchmark Config: - 1 Pinot Controllers (4 cores/8GB) - 1 Pinot Brokers (4 cores/8GB) - 3 Pinot Servers (4 cores/8GB) - 1 Trino Coordinator (4 cores/8GB) - 1 Trino Workers (4 cores/8GB) Data Set: - 40 Million rows data set Query: - Aggregation GroupBy + Predicate Trino: - Aggregation Pushdown enable/disable
  • 24. Perf Benchmark Query Type: Aggregation Group By + Predicate pushdown
  • 25. Thank you - Getting Started: https://tinyurl.com/trinoPinotTutorial - Run Trino in Kubernetes: https://github.com/trinodb/charts - StarTree: https://www.startree.ai/ - Apache Pinot: https://pinot.apache.org/ - Pinot on github: https://github.com/apache/pinot - Pinot slack: https://tinyurl.com/pinotSlackChannel - Apache Pinot Twitter: https://twitter.com/ApachePinot - Apache Pinot Meetup: https://www.meetup.com/apache-pinot - Starburst: https://www.starburst.io/ - Trino: https://trino.io/ - Trino on github: https://github.com/trinodb/trino - Trino slack: https://trino.io/slack.html

Notes de l'éditeur

  1. Realtime OLAP Database Columnar, Indexed Storage Low latency analytics Distributed – highly available, reliable, scalable Lambda architecture Offline data pushes Real-time stream ingestion Open Source
  2. Explain controller - single coordinator controlling all actions cluster state - to maintain partition assignment - partition to server mapping Cluster state maintains start and end offsets
  3. For certain query pattern (slice and dice on a given list of dimensions), we allow users to configure a upper bound of documents to scan. Pinot will intelligently partial pre-aggregate the records to achieve the requirement, but without exploding the storage
  4. Components: Coordinator (query endpoint, metadata), workers (process query results) Divides the work into splits which are processed in parallel Results returned from coordinator in final phase of processing
  5. Join 2 pinot tables
  6. Build Broker Query Pushdown filter, aggregation, limit Produce a single broker split Submit broker request Produce Results to Trino Process joins, other filters, aggregations and final limit Return results to client
  7. Build Broker Query Pushdown filter, aggregation, limit Produce a single broker split Submit broker request Produce Results to Trino Process joins, other filters, aggregations and final limit Return results to client
  8. First step is to get the metadata from the pinot controller, Talk about how this is configurable with cache ttl config
  9. Pinot - Fast single table OLAP Trino - Powerful connector ecosystem Complete system - covers entire landscape Get the best of Trino and Pinot Proven stack at Uber and many more
  10. Pinot - Fast single table OLAP Trino - Powerful connector ecosystem Complete system - covers entire landscape Get the best of Trino and Pinot Proven stack at Uber and many more