Contenu connexe Similaire à Presto meetup 2015-03-19 @Facebook (20) Plus de Treasure Data, Inc. (20) Presto meetup 2015-03-19 @Facebook1. Copyright ©2015 Treasure Data. All Rights Reserved.
Presto as a Service
Tips for operation and monitoring
Dongmin Yu
Treasure Data, Inc.
min@treasure-data.com
JeroMQ / ZeroMQ committer & maintainer
Mar 19, 2015
Presto Meetup @ Facebook
2. Copyright ©2015 Treasure Data. All Rights Reserved.
Topics
• Presto as a Service in Treasure Data
– Error Recovery
– Presto Deployment
• Tips for Monitoring Presto
– JSON API
– Presto + Fluentd
• Custom changes
2
4. Hive
TD API /
Web ConsoleInteractive query
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
5. Copyright ©2015 Treasure Data. All Rights Reserved.
Deployment
• Building Presto takes more than 20 minutes.
• Facebook frequently releases new versions
• Let CircleCI build Presto
– Deploy jar files to private Maven repository
– We sometime use non-release versions
• for fixing serious bugs
• hot-fix patches
• Integration Test
– td-presto connector
• PlazmaDB, Multi-tenant query scheduler
• Query optimizer
– Run test queries on staging cluster
– Presto Verifier
5
6. Copyright ©2015 Treasure Data. All Rights Reserved.
Production: Blue-Green Deployment
• http://martinfowler.com/bliki/BlueGreenDeployment.html
• 2 Presto Coordinators (Blue/Green)
– Route Presto queries to the active cluster
– No down-time upon deployment
• Launch Presto worker instances with chef <- less than 5 min. in AWS
• Inactive clusters is used for pre-production testing and customer support
– Investigation and tuning of customer query performance
– Trouble shooting
6
7. Copyright ©2015 Treasure Data. All Rights Reserved.
Error Recovery
• Presto has no fault tolerance
• Error types
– User error
• Syntax errors
– SQL syntax, missing function
• Semantic errors
– missing tables/columns
– Insufficient resource
• Exceeded task memory size
– Internal failure
• I/O error
– S3/Riak CS
• worker failure
• etc.
7
Worth A Retry!
10. Copyright ©2015 Treasure Data. All Rights Reserved.
Query Retry Patterns used in TD
• Error code + message pattern
10
12. Copyright ©2015 Treasure Data. All Rights Reserved.
Monitoring Presto
• REST API for monitoring Presto state
– JSON format
• (presto server IP):8080/v1/query
– List of recent queries (BasicQueryInfo class)
• (presto server IP):8080/v1/query/(query id)
– Detailed query state information
– Query plan, tasks and running worker IDs
– Processed rows/data size
12
18. Copyright ©2015 Treasure Data. All Rights Reserved.
Presto Coordinator
• Organizes query execution pipelines
– Coordinates presto workers
• Retrieves table partition and split location from connectors
– Creates distributed query plans
• Full GC
– Stalls coordinator
• When memory is insufficient
– Use memory-rich machine
– GC Tuning
• UseG1GC
18
19. Copyright ©2015 Treasure Data. All Rights Reserved.
presto-metrics (Ruby)
• https://github.com/xerial/presto-metrics
19
21. Copyright ©2015 Treasure Data. All Rights Reserved.
Query Collection in TD
• SQL query logs
– query, detailed query plan, elapsed time, processed rows, etc.
– newSetBinder(binder,EventClient.class).addBinding()
.to(FluentEventClient.class)
• Presto is used for analyzing the query history
21
23. Copyright ©2015 Treasure Data. All Rights Reserved.
Query Running Time
• More than 90% of queries finishes within 2 min.
≒ expected response time for interactive queries
23
24. Copyright ©2015 Treasure Data. All Rights Reserved.
Detecting Anomaly
• Started Query Rate (in 5min/15min)
– If no query has started, cluster may be down (or not started properly)
• Processed rows in a query
– Sum up the number of the processed rows from all of the sub stages
– Simple, but the most reliable measure
• Send an alert
– Slack notification
– PagerDuty call
• JP/US team rotation
24
25. Copyright ©2015 Treasure Data. All Rights Reserved.
Benchmarking
• Query performance comparison
– between two versions of Presto
• Benchmark
– Run query set multiple times
– Store the results to TD
– Report the result with Presto
• Aggregation query
25
26. Copyright ©2015 Treasure Data. All Rights Reserved.
Presto Operation Tool
• Prestop
– Our internal tool for managing multiple presto
clusters
• written in Scala
– Query monitoring
– Benchmarking
– Workload simulation
• stress testing
• Monitoring
– Datadog
– PageDuty
– ChartIO (query stats)
26
27. Copyright ©2015 Treasure Data. All Rights Reserved.
buffer
Optimizing Scan Performance – Storage Manager
• Fully utilize the network bandwidth from S3
• TD Presto becomes CPU bottleneck
27
TableScanOperators
• s3 file list
• table schema
header
request
S3 / RiakCS
• release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue
• priority queue
• max connections limit
Header
Column Block 0
(column names)
Column Block 1
Column Block i
Column Block m
MPC1 file
HeaderReader
• callback to HeaderParser
ColumnBlockReader
header
HeaderParser
• parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column
block
prepare
buffer
MessageUnpacker
MessageUnpacker
S3 read
S3 read
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read• decompression
• msgpack-java v07
• On-demand de-ser
S3 read
S3 read
S3 read
28. Copyright ©2015 Treasure Data. All Rights Reserved.
Multi-tenancy: Resource Allocation
• Price-plan based resource allocation
• Parameters
– The number of worker nodes to use (min-candidates)
– The number of hash partitions (initial-hash-partitions)
– The maximum number of running tasks per account
• If running queries exceeds allowed number of tasks, the next queries need to wait
(queued)
• Presto: SqlQueryExecution class
– Controls query execution state: planning -> running -> finished
• No resource allocation policy
– Extended TDSqlQueryExection class monitors running tasks and limits resource
usage
• Rewriting SqlQueryExecutionFactory at run-time by using ASM library
28