SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Case Study from Treasure Data
- How do users run query on Trino?
Toru Takahashi
Director, Customer Experience
Treasure Data
Trino Japan Virtual Meetup
November 17, 2021
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Toru Takahashi
- Twitter @nora96o / Github: toru-takahashi
- Started as 1st Technical Support Engineer
at Treasure Data
- Current: Director, Customer Experience
- Disclaimer
- Treasure Data uses Trino 312/350 but
keep using “Presto” as our term for
backwards compatibility
About Me
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Today’s Talk
- What I will talk today
- Share a case study about our customers’ Trino usage as
use of customer data platform
- Give some insights if you start providing Trino to your
internal/external end users
- What I will not talk today
- Details about Our implementation on our service with
Trino
- Things about Trino configuration on our environment
Overall Architecture
© 2021 Treasure Data, Inc. - Proprietary & Confidential
How can users use Trino on TD?
Custom SQL Segment Builder
© 2021 Treasure Data, Inc. - Proprietary & Confidential
- Num of Queries
- Executed 2.8 billion queries (num
of unique query_id) in 2020
350 million queries (1.25x YoY)
will be executed in 2021.
- Roughly 1 million queries per day
- Num of users
- Total Registered Users is 50K
users
Trino Usage
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Let’s see some
interesting
usages
© 2021 Treasure Data, Inc. - Proprietary & Confidential
How do users configure a schedule?
Scheduling Queries
Scheduled query to run your query automatically once a schedule is
set is a common feature in a lot of BI / Data analytics Services.
Ex. Redash
Treasure Data
© 2021 Treasure Data, Inc. - Proprietary & Confidential
- 45% Scheduled Queries run on 00
minutes exactly
- Users love Daily/Hourly/every
30 mins
- We had expected multi tenant
system helps a distribution of job
execution. However, Schedule is
tightly associated with the data
location
- Clusters always needs to adjust a
daily peak workload because 1
hour is too short for auto-scaling
How do users configure a schedule?
The graph shows how people configure a query schedule...
00
30
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Side Note...
One Good Idea I have seen
- Google Sheet trigger function is a nice from system workload
point of view. Users specify a certain period, not certain time.
- But, Generally speaking, providing this feature is a challenge…
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Query Execution Time
- 99% jobs completed with 10 mins however 0.005% queries
(hundreds queries per day) run over 2 hours.
- This long running jobs are critical for cluster stability
- At the beginning, we had 24 hours Query Execution Time restriction
however it was too long. Thus, we have changed to 4 hours.
- This makes us rotating a cluster hard. Because re-running long
running jobs may take a double execution time by a retry
© 2021 Treasure Data, Inc. - Proprietary & Confidential
SQL Text Length
- 96% of query length are less than 10,000 characters
- 0.1% (<= hundreds) queries have over 1 million characters
= Text with over 1 mega bytes is sent to Trino …
- Example…
- WHERE col in (‘x’,’y’,’z’, … million of IN condition)
- UNION ALL .. UNION ALL .. Union ALL ... over 10000 times
- Having a gateway to block a job
by a query text size in front of Trino
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Measuring Presto Query Cost in $$$
- It’s important to understand how much money we are
spending for processing individual Trino queries even if our
pricing model doesn’t bill a query processing.
- Calculated by the machine hour usage of individual
queries (e.g., CPU, memory hour, splits executing time,
network capacity occupancy hours), as well as more
detailed stats, such as S3 GET count, processed rows and
bytes, etc
- It motivate us to improve a customer query by education,
suggestion, etc.
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Re-Partitioning is a key of performance and
Cost
- Our table requires to use a specific key for partitiniong and
they can’t manage the partitioning info in detail. And, we can’t
regularly re-merge partitions due to a massive data volume
- Scanning customer fragmented S3 partitions was quite
expensive in terms of the S3 GET cost.
- Remerging fragmented partitions based on finding by query
cost data
© 2021 Treasure Data, Inc. - Proprietary & Confidential
S3 Access Stability Improvement
- Switched our S3 internal access method
from the old path-style requests:
https://s3.amazonaws.com/(bucket_name)/
to the virtual-hosted S3 request style:
https://(bucket_name).s3.amazonaws.com/
- According to AWS, it enables various
internal optimization for managing the S3
traffic.
- This improvement has stabilized the table scan
performance of Presto.
- Cost efficiency
- S3 GET request cost: $0.0004 / 1000
requests.
$X/day*365d * multiple regions >= $10K
-
© 2021 Treasure Data, Inc. - Proprietary & Confidential
Learn from Our Customers
- Users will do things beyond your worst expectations.
- SQL is one of the popular and common tool to access
data. But people don’t know how to write efficient SQL yet.
- But insufficient queries will be a huge pain from operation
and cost point of views.
- In B2B, to strict a limit from current is so difficult than to relax
a limit.
- My recommendation is to establish a strict limit at first, and
then relaxing the limit gradually.
© 2021 Treasure Data, Inc. - Proprietary & Confidential
https://jobs.lever.co/treasure-data
In Japan, US, UK, Canada, +α
We are Hiring!

Contenu connexe

Tendances

Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsImply
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heartGabriel Hamilton
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and RoadmapImply
 
Self Service Analytics at Twitch
Self Service Analytics at TwitchSelf Service Analytics at Twitch
Self Service Analytics at TwitchImply
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDatatdc-globalcode
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and TricksImply
 
Analytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterAnalytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterImply
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Tal Bar-Zvi
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleMongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18Imply
 

Tendances (20)

Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and Roadmap
 
Self Service Analytics at Twitch
Self Service Analytics at TwitchSelf Service Analytics at Twitch
Self Service Analytics at Twitch
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and Tricks
 
Analytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterAnalytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at Twitter
 
Big query
Big queryBig query
Big query
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and Scale
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18
 

Similaire à Learn from Case Study; How do people run query on Trino? / Trino japan virtual meetup

Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform DATAVERSITY
 
Viadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosViadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosCepoi Eugen
 
Dimension Data Saugatuk Webinar
Dimension Data Saugatuk WebinarDimension Data Saugatuk Webinar
Dimension Data Saugatuk WebinarKeao Caindec
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL DatabaseJames Serra
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure Redis Labs
 
Microsoft Windows Azure - EBC Deck June 2010 Presentation
Microsoft Windows Azure -  EBC Deck June 2010 PresentationMicrosoft Windows Azure -  EBC Deck June 2010 Presentation
Microsoft Windows Azure - EBC Deck June 2010 PresentationMicrosoft Private Cloud
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and moreDenodo
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performanceShenglin Du
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsMatt Kuklinski
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesDenodo
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdfAamirJadoon5
 
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA
 
“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...
“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...
“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...Edge AI and Vision Alliance
 

Similaire à Learn from Case Study; How do people run query on Trino? / Trino japan virtual meetup (20)

Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform 
 
Viadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on MesosViadeos Segmentation platform with Spark on Mesos
Viadeos Segmentation platform with Spark on Mesos
 
Dimension Data Saugatuk Webinar
Dimension Data Saugatuk WebinarDimension Data Saugatuk Webinar
Dimension Data Saugatuk Webinar
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
 
Microsoft Windows Azure - EBC Deck June 2010 Presentation
Microsoft Windows Azure -  EBC Deck June 2010 PresentationMicrosoft Windows Azure -  EBC Deck June 2010 Presentation
Microsoft Windows Azure - EBC Deck June 2010 Presentation
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and more
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performance
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
 
L21 scalability
L21 scalabilityL21 scalability
L21 scalability
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
 
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
 
“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...
“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...
“Ensuring Quality Data for Deep Learning in Varied Application Domains: Data ...
 

Plus de Toru Takahashi

高専後10年の学び - 高専カンファレンス In 仙台
高専後10年の学び - 高専カンファレンス In 仙台高専後10年の学び - 高専カンファレンス In 仙台
高専後10年の学び - 高専カンファレンス In 仙台Toru Takahashi
 
TokyoGirls.rb meetup vol.1 SponsorLT
TokyoGirls.rb meetup vol.1 SponsorLTTokyoGirls.rb meetup vol.1 SponsorLT
TokyoGirls.rb meetup vol.1 SponsorLTToru Takahashi
 
Zendesk Sunshine - Zenlab vol.6
Zendesk Sunshine - Zenlab vol.6Zendesk Sunshine - Zenlab vol.6
Zendesk Sunshine - Zenlab vol.6Toru Takahashi
 
History of TreasureData Support
History of TreasureData SupportHistory of TreasureData Support
History of TreasureData SupportToru Takahashi
 
Dairy of Support Engineering Manager
Dairy of Support Engineering ManagerDairy of Support Engineering Manager
Dairy of Support Engineering ManagerToru Takahashi
 
ZendeskのTriggerを有効活用するためにデータを一元化している話
ZendeskのTriggerを有効活用するためにデータを一元化している話ZendeskのTriggerを有効活用するためにデータを一元化している話
ZendeskのTriggerを有効活用するためにデータを一元化している話Toru Takahashi
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とToru Takahashi
 
IoT Data Connector Fluent Bit
IoT Data Connector Fluent BitIoT Data Connector Fluent Bit
IoT Data Connector Fluent BitToru Takahashi
 
ビッグデータ分析基盤を支えるOSSたち
ビッグデータ分析基盤を支えるOSSたちビッグデータ分析基盤を支えるOSSたち
ビッグデータ分析基盤を支えるOSSたちToru Takahashi
 
(Draft) lambda architecture by using TreasureData
(Draft) lambda architecture by using TreasureData(Draft) lambda architecture by using TreasureData
(Draft) lambda architecture by using TreasureDataToru Takahashi
 

Plus de Toru Takahashi (11)

高専後10年の学び - 高専カンファレンス In 仙台
高専後10年の学び - 高専カンファレンス In 仙台高専後10年の学び - 高専カンファレンス In 仙台
高専後10年の学び - 高専カンファレンス In 仙台
 
TokyoGirls.rb meetup vol.1 SponsorLT
TokyoGirls.rb meetup vol.1 SponsorLTTokyoGirls.rb meetup vol.1 SponsorLT
TokyoGirls.rb meetup vol.1 SponsorLT
 
Zendesk Sunshine - Zenlab vol.6
Zendesk Sunshine - Zenlab vol.6Zendesk Sunshine - Zenlab vol.6
Zendesk Sunshine - Zenlab vol.6
 
History of TreasureData Support
History of TreasureData SupportHistory of TreasureData Support
History of TreasureData Support
 
Zenlab - API Night
Zenlab - API NightZenlab - API Night
Zenlab - API Night
 
Dairy of Support Engineering Manager
Dairy of Support Engineering ManagerDairy of Support Engineering Manager
Dairy of Support Engineering Manager
 
ZendeskのTriggerを有効活用するためにデータを一元化している話
ZendeskのTriggerを有効活用するためにデータを一元化している話ZendeskのTriggerを有効活用するためにデータを一元化している話
ZendeskのTriggerを有効活用するためにデータを一元化している話
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
IoT Data Connector Fluent Bit
IoT Data Connector Fluent BitIoT Data Connector Fluent Bit
IoT Data Connector Fluent Bit
 
ビッグデータ分析基盤を支えるOSSたち
ビッグデータ分析基盤を支えるOSSたちビッグデータ分析基盤を支えるOSSたち
ビッグデータ分析基盤を支えるOSSたち
 
(Draft) lambda architecture by using TreasureData
(Draft) lambda architecture by using TreasureData(Draft) lambda architecture by using TreasureData
(Draft) lambda architecture by using TreasureData
 

Dernier

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 

Dernier (20)

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 

Learn from Case Study; How do people run query on Trino? / Trino japan virtual meetup

  • 1. Case Study from Treasure Data - How do users run query on Trino? Toru Takahashi Director, Customer Experience Treasure Data Trino Japan Virtual Meetup November 17, 2021
  • 2. © 2021 Treasure Data, Inc. - Proprietary & Confidential Toru Takahashi - Twitter @nora96o / Github: toru-takahashi - Started as 1st Technical Support Engineer at Treasure Data - Current: Director, Customer Experience - Disclaimer - Treasure Data uses Trino 312/350 but keep using “Presto” as our term for backwards compatibility About Me
  • 3. © 2021 Treasure Data, Inc. - Proprietary & Confidential Today’s Talk - What I will talk today - Share a case study about our customers’ Trino usage as use of customer data platform - Give some insights if you start providing Trino to your internal/external end users - What I will not talk today - Details about Our implementation on our service with Trino - Things about Trino configuration on our environment
  • 5. © 2021 Treasure Data, Inc. - Proprietary & Confidential How can users use Trino on TD? Custom SQL Segment Builder
  • 6. © 2021 Treasure Data, Inc. - Proprietary & Confidential - Num of Queries - Executed 2.8 billion queries (num of unique query_id) in 2020 350 million queries (1.25x YoY) will be executed in 2021. - Roughly 1 million queries per day - Num of users - Total Registered Users is 50K users Trino Usage
  • 7. © 2021 Treasure Data, Inc. - Proprietary & Confidential Let’s see some interesting usages
  • 8. © 2021 Treasure Data, Inc. - Proprietary & Confidential How do users configure a schedule? Scheduling Queries Scheduled query to run your query automatically once a schedule is set is a common feature in a lot of BI / Data analytics Services. Ex. Redash Treasure Data
  • 9. © 2021 Treasure Data, Inc. - Proprietary & Confidential - 45% Scheduled Queries run on 00 minutes exactly - Users love Daily/Hourly/every 30 mins - We had expected multi tenant system helps a distribution of job execution. However, Schedule is tightly associated with the data location - Clusters always needs to adjust a daily peak workload because 1 hour is too short for auto-scaling How do users configure a schedule? The graph shows how people configure a query schedule... 00 30
  • 10. © 2021 Treasure Data, Inc. - Proprietary & Confidential Side Note... One Good Idea I have seen - Google Sheet trigger function is a nice from system workload point of view. Users specify a certain period, not certain time. - But, Generally speaking, providing this feature is a challenge…
  • 11. © 2021 Treasure Data, Inc. - Proprietary & Confidential Query Execution Time - 99% jobs completed with 10 mins however 0.005% queries (hundreds queries per day) run over 2 hours. - This long running jobs are critical for cluster stability - At the beginning, we had 24 hours Query Execution Time restriction however it was too long. Thus, we have changed to 4 hours. - This makes us rotating a cluster hard. Because re-running long running jobs may take a double execution time by a retry
  • 12. © 2021 Treasure Data, Inc. - Proprietary & Confidential SQL Text Length - 96% of query length are less than 10,000 characters - 0.1% (<= hundreds) queries have over 1 million characters = Text with over 1 mega bytes is sent to Trino … - Example… - WHERE col in (‘x’,’y’,’z’, … million of IN condition) - UNION ALL .. UNION ALL .. Union ALL ... over 10000 times - Having a gateway to block a job by a query text size in front of Trino
  • 13. © 2021 Treasure Data, Inc. - Proprietary & Confidential Measuring Presto Query Cost in $$$ - It’s important to understand how much money we are spending for processing individual Trino queries even if our pricing model doesn’t bill a query processing. - Calculated by the machine hour usage of individual queries (e.g., CPU, memory hour, splits executing time, network capacity occupancy hours), as well as more detailed stats, such as S3 GET count, processed rows and bytes, etc - It motivate us to improve a customer query by education, suggestion, etc.
  • 14. © 2021 Treasure Data, Inc. - Proprietary & Confidential Re-Partitioning is a key of performance and Cost - Our table requires to use a specific key for partitiniong and they can’t manage the partitioning info in detail. And, we can’t regularly re-merge partitions due to a massive data volume - Scanning customer fragmented S3 partitions was quite expensive in terms of the S3 GET cost. - Remerging fragmented partitions based on finding by query cost data
  • 15. © 2021 Treasure Data, Inc. - Proprietary & Confidential S3 Access Stability Improvement - Switched our S3 internal access method from the old path-style requests: https://s3.amazonaws.com/(bucket_name)/ to the virtual-hosted S3 request style: https://(bucket_name).s3.amazonaws.com/ - According to AWS, it enables various internal optimization for managing the S3 traffic. - This improvement has stabilized the table scan performance of Presto. - Cost efficiency - S3 GET request cost: $0.0004 / 1000 requests. $X/day*365d * multiple regions >= $10K -
  • 16. © 2021 Treasure Data, Inc. - Proprietary & Confidential Learn from Our Customers - Users will do things beyond your worst expectations. - SQL is one of the popular and common tool to access data. But people don’t know how to write efficient SQL yet. - But insufficient queries will be a huge pain from operation and cost point of views. - In B2B, to strict a limit from current is so difficult than to relax a limit. - My recommendation is to establish a strict limit at first, and then relaxing the limit gradually.
  • 17. © 2021 Treasure Data, Inc. - Proprietary & Confidential https://jobs.lever.co/treasure-data In Japan, US, UK, Canada, +α We are Hiring!