1. Internals of Presto Service
Taro L. Saito, Treasure Data
leo@treasure-data.com
March 11-12th, 2015
Treasure Data Tech Talk #1 at Tokyo
2. Taro L. Saito @taroleo
• 2007 University of Tokyo. Ph.D.
– XML DBMS, Transaction Processing
• Relational-Style XML Query [SIGMOD 2008]
• ~ 2014 Assistant Professor at University of Tokyo
– Genome Science Research
• Distributed Computing, Personal Genome Analysis
• March 2014 ~ Treasure Data
– Software Engineer, MPP Team Leader
• Open source projects at GitHub
– snappy-java, msgpack-java, sqlite-jdbc
– sbt-pack, sbt-sonatype, larray
– silk
• Distributed workflow engine
2
3. Hive
TD API /
Web Console
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
Interactive query
4. What is Presto?
• A distributed SQL Engine developed by Facebook
– For interactive analysis on peta-scale dataset
• As a replacement of Hive
– Nov. 2013: Open sourced at GitHub
• Presto
– Written in Java
– In-memory query layer
– CPU efficient for ad-hoc analysis
– Based on ANSI SQL
– Isolation of query layer and storage access layer
• A connector provides data access (reading schema and records)
4
5. Presto: Distributed SQL Engine
5
TD Presto has its own
query retry mechanism
Tailored to throughput CPU-intensive. Faster response time
Fault
Tolerant
9. MessageBuffer
• msgpack-java v06 was the bottleneck
– Inefficient buffer access
• v07
• Fast memory access
• sun.misc.Unsafe
• Direct access to heap memory
• extract primitive type value from byte[]
• cast
• No boxing
9
10. Unsafe memory access performance is comparable to C
• http://frsyuki.hatenablog.com/entry/2014/03/12/155231
10
11. Why ByteBuffer is slow?
• Following a good programming manner
– Define interface, then implement classes
• ByteBuffer interface has HeapByteBuffer and DirectByteBuffer
implementations
• In reality: TypeProfile slows down method access
– JVM generates look-up table of method implementations
– Simply importing one or more classes generates TypeProfile
• v07 avoid TypeProfile generation
– Load an implementation class through Reflection
11
12. Format Type Detection
• MessageUnpacker
– read prefix: 1 byte
– detect format type
• switch-case
– ANTLR generates this
type of codes
12
16. Claremont Report on Database Research
• Discussion on future of DBMS
– Top researchers, vendors and
practitioners.
– CACM, Vol. 52 No. 6, 2009
• Predicts emergence of Cloud Data
Service
– SQL has an important role
• limited functionality
• suited for service provider
– A difficult example: Spark
• Need a secure application container
to run arbitrary Scala code.
16
17. Beckman Report on Database Research
• 2013
– http://beckman.cs.wisc.edu/beckman-report2013.pdf
– Topics of Big-Data
• End-to-end service
– From data collection to knowledge
• Cloud Service has become popular
– IaaS, PaaS, SaaS
– Challenge is to migrate all of the functionalities of DBMS into Cloud
17
18. Results Push
Results Push
SQL
Big Data Simplified: The Treasure Data Approach
AppServers
Multi-structured Events!
• register!
• login!
• start_event!
• purchase!
• etc!
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
ü App log data!
ü Mobile event data!
ü Sensor data!
ü Telemetry!
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent
Embedded SDKs
Server-side Agents
18
19. Challenges in Database as a Service
• Tradeoffs
– Cost and service level objectives (SLOs)
• Reference
– Workload Management for Big Data Analytics. A. Aboulnaga
[SIGMOD2013 Tutorial]
19
Run each query set
on an independent
cluster
Run all queries
together on the
smallest possible
cluster
Fast
$$$
Limited performance guarantee
Reasonable price
20. Shift of Presto Query Usage
• Initial phase
– Try and error of queries
• Many syntax errors, semantic errors
• Next phase
– Scheduled query execution
• Increased Presto query usage
– Some customers submit more than 1,000 Presto queries / day
– Establishing typical query patterns
• hourly, daily reports
• query templates
• Advanced phase: More elaborate data analysis
– Complex queries
• via data scientists and data analysts
– High resource usage
20
24. Query Collection in TD
• SQL query logs
– query, detailed query plan, elapsed time, processed rows, etc.
• Presto is used for analyzing the query history
24
29. Collecting Recoverable Error Patterns
• Presto has no fault tolerance
• Error types
– User error
• Syntax errors
– SQL syntax, missing function
• Semantic errors
– missing tables/columns
– Insufficient resource
• Exceeded task memory size
– Internal failure
• I/O error
– S3/Riak CS
• worker failure
• etc.
29
TD Presto retries
these queries
30. Query Retry on Internal Errors
• More than 99.8% of queries finishes without errors
30
31. Query Retry on Internal Errors (log scale)
• Queries succeed eventually
31
32. Multi-tenancy: Resource Allocation
• Price-plan based resource allocation
• Parameters
– The number of worker nodes to use (min-candidates)
– The number of hash partitions (initial-hash-partitions)
– The maximum number of running tasks per account
• If running queries exceeds allowed number of tasks, the next queries need
to wait (queued)
• Presto: SqlQueryExecution class
– Controls query execution state: planning -> running -> finished
• No resource allocation policy
– Extended TDSqlQueryExection class monitors running tasks and limits
resource usage
• Rewriting SqlQueryExecutionFactory at run-time by using ASM library
32
33. Query Queue
• Presto 0.97
– Introduces user-wise query queues
• Can limit the number of concurrent queries per user
• Problem
– Running too many queries delays overall query
performance
33
34. Customer Feedback
• A feedback:
– We don’t care if large queries take long time
– But interactive queries should run immediately
• Challenges
– How do we allocate resources even if preceding queries
occupies customer share of resources?
– How do we know a submitted query is interactive one?
34
35. Admission control is necessary
• Adjust resource utilization
– Running Drivers (Splits)
– MPL (Multi-Programming Level)
35
36. Challenge: Auto Scaling
• Setting the cluster size based on the peak usage is expensive
• But predicting customer usage is difficult
36
37. Typical Query Patterns [Li Juang]
• Q: What are typical queries of a customer?
– Customer feels some queries are slow
– But we don’t know what to compare with, except scheduled queries
• Approach: Clustering Customer SQLs
• TF/IDF measure: TF x IDF vector
– Split SQL statements into tokens
– Term frequency (TF) = the number of each term in a query
– Inverse document frequency (IDF) = log (# of queries / # of queries that
have a token)
• k-means clustering
– TF/IDF vector
– Generates clusters of similar queries
• x-means clustering for deciding number of clusters automatically
– D. Pelleg [ICML2000]
37
38. Problematic Queries
• 90% of queries finishes within 2 min.
– But remaining 10% is still large
• 10% of 10,000 queries is 1,000.
• Long-running queries
• Hog queries
38
39. Long Running Queries
• Typical bottlenecks
– Cross joins
– IN (a, b, c, …)
• semi-join filtering process is slow
– Complex scan condition
• pushing down selection
• but delays column scan
– Tuple materialization
• coordinator generates json data
– Many aggregation columns
• group by 1, 2, 3, 4, 5, 6, …
– Full scan
• Scanning 100 billion rows…
• Adding more resources does not always make query faster
• Storing intermediate data to disks is necessary
39
Result are
buffered
(waiting fetch)
slow process
fast
fast
40. Hog Query
• Queries consuming a lot of CPU/memory resources
– Coined in S. Krompass et al. [EDBT2009]
• Example:
– select 1 as day, count(…) from … where time <= current_date - interval 1 day
union all
select 2 as day, count(…) from … where time <= current_date - interval 2 day
union all
– …
– (up to 190 days)
• More than 1000 query stages.
• Presto tries to run all of the stages at once.
– High CPU usage at coordinator
40
41. • Query rewriting (better)
– With group by and window functions
– Not a perfect solution
• Need to understand the meaning of the query
• Semantic change is not allowed
– e.g., We cannot rewrite UNION to UNION ALL
– UNION includes duplicate elimination
• Workaround Idea
– Bushy plan -> Deep plan
– Introduce stage-wise resource assignment
Query Rewriting? Plan Optimization?
41
42. Future Work
• Reducing Queuing/Response Time
– Introducing shared queue between customers
• For utilizing remaining cluster resources
– Fair-Scheduling: C. Gupata [EDBT2009]
– Self-tuning DBMS. S. Chaudhuri [VLDB2007]
• Adjusting Running Query Size (hard)
– Limiting driver resources as small as possible for hog queries
– Query plan based cost estimation
• Predicting Query Running Time
– J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]
42
43. Summary: Treasures in Treasure Data
• Treasures for our customers
– Data collected by fluentd (td-agent)
– Query analysis platform
– Query results - values
• For Treasure Data
– SQL query logs
• Stored in treasure data
– We know how customers use SQL
• Typical queries and failures
– We know which part of query can be improved
43