Procella is a distributed SQL query engine built for flexible workloads within YouTube. Procella is highly scalable and is designed primarily to serve high volumes of queries at low latencies while ingesting realtime data. It is used to serve video/channel statistics for users watching videos as well as OLAP style queries for video analytics (youtube.com/analytics) and public dashboards (artists.youtube.com). Procella also supports complex SQL operations over structured data and is used by YouTube analysts for ad-hoc analysis.
Procella works on the Google distributed computing stack working directly on data residing in accessible columnar formats on the Google distributed file system Colossus. The underlying data is thus producible and directly consumable by other tools such as MapReduce and Dremel. The compute runs directly on shared machines on Borg clusters, and does not need dedicated virtual (or physical) machines. These features allows Procella to fit nicely in the Google ecosystem, scale compute and storage independently, and to gracefully handle evictions and machine failures without compromising availability or performance.
Procella has been in production for over two years and is currently serving billions of SQL queries per day across various workloads at YouTube and several other Google product areas.
Speaker
Aniket Mokashi, Google, Senior Software Engineer
2. About me
● Tech Lead on YouTube Data Team
○ Procella
○ Youtube Data Warehouse
○ Youtube Analytics Backend
○ Data Quality: Anomaly Detection
● Prior to Google
○ Data Infra teams at Twitter, Netflix
○ Apache Parquet PMC
○ Apache Pig PMC
○ Contributor: Hadoop, Hive
3. Agenda
● Products at YouTube and Google
● Use cases and features
● Architecture
● Key features deep dive
● Q&A
6. And more...
● Youtube Metrics
● Firebase Performance Console
● Google Play Console
● Youtube Internal Analytics
● ...
7. About Procella
A widely used SQL query engine at Google
● Fully featured: Most of SQL++: Structured, joins, set ops, analytic functions
● Super fast: Most queries are sub-second.
● Highly scalable: Petabytes of data, thousands of tables, trillions of rows, ...
● High QPS: Millions of point queries @ msec, thousands of reporting queries @
sub-second, hundreds of ad-hoc queries @ seconds
● Designed for real-time: Millions QPS instant ingestion, highly efficient, …
● Easy to use: SQL, DDL, Virtual tables, …
● Compatible: Capacitor, Dremel, ACLs, Encryption, …
8. External Reporting/Analytics
● Use cases
○ YouTube Analytics
○ Firebase Perf Monitoring
○ Google Play Console
○ Data Studio
● Properties
○ High QPS, low latency
ingestion and queries
○ Hundreds of metrics across
dozens of dimensions
○ SQL
● Technology:
○ On the fly aggregations
○ Indexing & partitioning
○ Real-time ingestion
○ Batch & Real-time Stitching
■ Lambda architecture
○ Caching
9. Internal Reporting/Dashboards
● Use cases
○ Dashboards
○ Experiments Analysis
○ Custom reporting
● Properties
○ Low qps
○ Speed & scale
○ SQL
● Technology
○ Stateful caching
○ Columnar optimized data
format (Artus)
○ Vector evaluation
○ Virtual tables
○ Tools (Dremel) and format
(Capacitor) compatibility
10. Real Time Insights/Monitoring
● Use cases
○ YTA Real Time (External)
○ YTA Insights (External)
○ YT Site Health (Internal)
● Properties
○ Native time-series support
○ Complex compaction, complex
metrics
○ High scan rate
○ Very high real time ingestion rate
● Technology
○ TTL
○ SQL based compaction
○ Approx algorithms
(Quantiles, Top, ..)
11. Real Time Stats Serving
● Use cases
○ YouTube metrics
(subscriptions, likes, etc.) on
various YT pages
● Properties
○ Millions of rows ingested per
sec
○ Millions of queries per sec
○ Milliseconds response time
○ Simple point queries
○ Many replicas
● Technology
○ Indexable columnar format
(Artus)
○ Heavily optimized serving
path
○ Pubsub based ingestion
15. Procella Scale
Property Value
Queries executed 450+ million per day
Queries on real-time data 200+ million per day
Leaf plans executed 90+ billion per day
Rows scanned 20+ quadrillion per day
Rows returned 100+ billion per day
Peak scan rate per query 100+ billion rows per second
Schema size 200+ metrics and dimensions
Tables 100+
16. Procella Latencies
Property p50 p99 p99.9
E2E Latency 25 ms 450 ms 3000 ms
Rows Scanned 17 M 270 M 1000 M
MDS Latency 6 ms 120 ms 250 ms
DS Latency 1 ms 75 ms 220 ms
Query Size 300B 4.7KB 10KB
18. Procella Secret Sauce
● Stateful Caching
○ Cached indexes - zone-maps, dictionaries, bloom filters etc.
○ File handle, metadata caching
○ Data segments caching
○ Data segments have data server affinity (Primary DS, Secondary DS,...)
○ Cached expressions
● Tail Latency Reduction
○ Backup RPCs - at about 90% of total RPCs.
○ Minibatch backup RPCs - one per n RPCs as a backup for slowest of
those n RPCs.
19. Procella Secret Sauce
● Indexing
○ Separate metadata server optimized for compact storage of zone maps
○ Dynamic range partitioning
○ Use of dictionaries and bloom filters
○ Posting lists for repeated data
● Distributed Execution
○ N tier execution, X000 parallel
● Vectorized Evaluation: Superluminal
○ Performs block eval
○ Natively understands RLE and dictionary
○ Columnar evaluation
20. Procella Secret Sauce
● Artus File Format
○ Multi-pass adaptive encodings to select the best encoding for a column
○ Uses adaptive encodings vs generalized compression
○ O(log n) lookups, O(1) seeks
○ Length based representation for repeated data types (arrays)
○ Exposes dictionary, RLE information to execution engine
○ Allows for rich metadata, inverted index
21. Procella Secret Sauce
● Real-time Ingestion
○ Data ingested into memory buffers
○ Periodically checkpointed to files
○ Small files are periodically aggregated and compacted into large files
● Virtual Tables
○ Index aware aggregate selection
○ Stitched Queries
○ Lambda architecture awareness
○ Normalization friendly (dimensional joins)
22. TPC-H queries*
* Run of TPC-H queries on:
● Static artus data
● Large shared instance
● Manually optimized queries
Geomean: 9.6 Seconds
Hi Everyone, I’m Aniket Mokashi, I am a tech lead at Google. Let me give you a little bit of background about myself. I work on Youtube’s Data team. At Youtube, I primarily contribute to Procella, which is what this talk is about. In addition, I work on Youtube Analytics Backend and on variety of projects under Youtube Data warehouse. Prior to Google, I have worked on data infra teams at Twitter and Netflix. I’m a Apache Parquet PMC member. I’ve contributed a few encodings, pig integration to Parquet. I was responsible for rolling out for first production use of Parquet at Twitter. I’m also a Apache Pig PMC member. My main contributions have been support of UDFs in scripting languages, Native Mapreduce, scalar values support, auto local mode and jar caching. I enjoy working on large scale data processing systems and in this talk I will cover Procella, a versatile analytics engine we’ve built at Youtube to serve most of our analytics needs.
We will cover a number of topics in this talk. First, we will look at products at Youtube and Google that are powered by Procella. Next, we will explore variety of use cases enabled by Procella and we will discuss features that are required in order to support those use cases. After that, I will give you all an overview of Procella’s architecture. And lastly, we will spend some time to deep dive into some of the key features that make Procella work so well. In interest of time, I will be taking questions at the end.
Let’s look at some products that are powered by Procella. But, before we get started, let me ask you all a few questions:
So, raise of hands,
How many of you use YouTube regularly like at least once a week… Perfect - that makes all of you.
How many of you are Youtube Creators - that is you have uploaded at least one video to Youtube? Those of you who haven’t done that yet, I will highly encourage you to try it out. Especially to get rich analytics... :)
How many of you have used youtube analytics for tracking their videos or at least seen youtube analytics before? Excellent.
So, Youtube Analytics is this amazingly rich analytics dashboard that lets Youtube content creators or content owners analyze usage and popularity of their videos and channels on the Youtube platform. It lets them track all of success metrics such as number of views, amount of watch time, revenue from monetized videos across various dimensions such as demographics, geo, devices, playback location etc. We also have a mobile application, that you see on the right, called creator studio which has similar details. Some of this information is also available in real-time.
Three years back, we started building Procella primarily to serve as a backend for Youtube analytics which is often referred as YTA. We wanted to make sure that backend of YTA will continue to scale horizontally as Youtube grows and as we add more features to the product. We had a few design goals in mind when we started working on Procella. First, we wanted to provide widely understood SQL interface so that navigating through this data is easy and integrating various frontends with the backend was less cumbersome. Second, we also wanted to provide flexibility to extend the product without a lot of involvement of the backend team, so we wanted to build as generic product as possible. We launched Procella as a backend for Youtube Analytics about 2 and half years ago. Since then, we have been able to make several changes to the the product seamlessly. For example, recently, we started showing impressions and impressions CTR in YTA and it required almost no involvement of the Procella team.
In addition to Youtube Analytics, over last few years Procella is being used for several other internal and external facing analytics dashboards or products. One example is Youtube Artists product that you can see on the slides. This product allows everyone to track popularity of artists and their music videos in a region on Youtube platform. If you haven’t already, I recommend you to check it out.
There are few other fairly large products at Google that are powered by Procella.
For example -
Firebase Performance Console. It lets app developers track various growth and success related metrics for their apps that are integrated with firebase platform.
Google Play Console - that lets android application developers track performance and popularity of their apps.
Also, now, various metric numbers like subscriptions, likes, dislikes that you see on the youtube.com website or app are also powered by Procella. Given the popularity of Youtube, as you can imagine, this was a significant achievement in terms of scale. I will describe later in the talk, the system properties that allowed us achieve it.
So now that you have looked at these interesting products that are using Procella. Let’s look at Procella.
What is Procella?
Procella is a widely used SQL query engine at Google.
It supports most of the standard SQL functionality including joins, set operations and analytic functions. It also supports queries on top of complex or structured data types like arrays, structs and maps.
What makes it unique is that it’s built to be super fast at scale - so most of the queries running on Procella have sub-second latencies.
It’s highly scalable - works on petabytes of data, on thousands of tables storing trillions of rows.
It can handle high qps of queries - like millions of qps of point lookup queries at millisecond latencies and thousands of qps of dashboard queries at sub-second latencies and hundreds of ad-hoc queries running in seconds.
It is also designed to serve data ingested in real-time. So, it supports millions of qps of ingestion of data rows.
It’s easy to use with SQL interface that supports DDL statements to create tables and partitions.
It also supports virtual tables. Virtual tables are used to hide the complexity of data model that is required to serve a use case.
Procella is developed to be compatible with existing tools at Google so that it can adopted easily. It exposes Dremel’s query interface which is a popular query system at Google and it can process files in Capacitor file format which is Dremel’s native file format.
Procella is a composable and versatile system. It powers a variety of use cases ranging from external facing metrics reporting applications to data pipelines. Let’s go through these use cases one by one to understand the motivations behind the architecture and various features in Procella.
External reporting applications are public facing products like Youtube analytics, firebase performance console, Google play console those I showed before. Being public facing, these applications have high qps and low-blink-of-an-eye latency requirements. These products typically surface over a hundred metrics sliced and diced by a few dozen of dimensions for a given entity or a customer.
Having a SQL interface instead of an API interface makes it easy build the backends and frontends independently.
These applications usually work on large datasets. Precomputing all the required metrics numbers required to power these dashboards is expensive or sometimes even infeasible. So, having ability to perform fast aggregations of these metrics on the fly is important.
To enable these applications, Procella provides efficient indexing and tablet pruning functionality as most of these queries need to process a small number of rows out of a very large amount of data. These applications also need ability to ingest real-time data to enable fast insights. And, in most cases, the ability to stitch between realtime datasets and batch datasets using lambda architecture is crucial. These applications can leverage temporal locality of data, so building caching at different layers helps in performance.
Another use case is internal reporting or dashboards. Some of the examples internal reporting are: product dashboards, experiment analysis dashboards, custom reports.
Internal reporting has much lower qps and relatively less aggressive latency requirements compared to external reporting. However, internal dashboards typically work on much complex and larger datasets. So, data scan efficiency and ability to handle complex data types are important to be able to serve their queries. To enable these use cases, we have developed features such as fast optimized data formats, vectorized evaluation library.
Real-time time-series analysis and monitoring is another use case. YTA provides insights in real-time. In addition, we also to track Youtube site health using Procella. These applications require native support of time dimension especially to filter based on different time boundaries like last 60 mins or last 2 days.
As I mentioned previously, use of Procella to power metrics on various pages of the Youtube platform is one of the most exciting use case. Youtube gets millions of activity updates per second. So, this requires millions of rows per second of reliable ingestion in real-time. This is mainly done through pubsub which is persistent message queue. Updates are replicated to multiple ingestion servers for reliability. On the serving side, this use case requires ability to serve millions of simple point lookup queries per seconds. These queries take only a few milliseconds to run on the ingested data. To make this possible, we have heavily optimized the query serving path with additional features that can lookup and scan required data efficiently.
Ad-hoc analytics essentially covers interactive querying at a low qps. Data scientists and analysts make use of tables stored in Procella to identify trends, growth factors and other important custom business metrics. This typically requires querying complex data types like arrays, nested structs and ability to join arbitrary datasets.
We support a number of joins that are required for these queries to work. In particular, we support broadcast join - which broadcasts the right side of the joins to all the leaf nodes so that they can construct local hash maps for lookups during the joins. We support remote lookup joins that leverage our lookup-friendly data format. We support pipelined joins that run in stages. They first compute small amount of join data from right side so that it can be used as a filter on the left side in the subsequent stages. Shuffled joins, which are essentially reduce-side joins that can scale for arbitrary large datasets. Lastly, co-partitioned joins, which leverage partitioning of left and right sides to efficiently merge them during the join.
Supporting SQL based data pipelines requires ability to handle large amount of data and process it using a complex business logic. These are typically multi-stage queries that join various data sources. These joins and aggregations require large shuffles. For logic that cannot be expressed in SQL, UDF support is necessary.
To enable this use case, we have plugable shuffle component and we leverage Google wide available efficient RDMA shuffle service. We have also implemented adaptive cost based optimizer that can optimize parts of these queries on the fly.
All of the categories of use cases I mentioned so far are being powered by Procella in production at Youtube. Procella’s unique scalable, composable architecture makes that possible. Let’s dive into the architecture now. This diagram roughly shows architecture of Procella. All rectangular boxes in the diagram are essentially independent services in the Procella query engine.
Towards your left is metadata server, metadata store and registration server. Users register their tables with Procella using DDL commands like CREATE/ALTER table or programmatically using RPCs. This is done to define the schemas of the tables. Then users setup upstream batch processes to periodically create partitions of datasets. These datasets are stored on Google’s distributed file system called Colossus. After datasets are generated, users register them with Procella using ALTER table commands or programmatically with RPCs. These datasets are typically stored in Columnar file formats such as Capacitor or Artus. Each dataset consists of many blocks of rows called tablets. A tablet typically has tens of thousands to millions of rows. Data is generally range or hash partitioned to achieve clustering of data by columns that are frequently filtered by. During the data registration, registration server extracts metadata information of all tablets such as stats, zone-maps, dictionaries, bloom filters and stores it in the metadata store in a highly compact and organized way.
In the query path, shown on your right, with yellow colored arrows, dashboard clients or human users send their SQL queries as strings. These are compiled into distributed multi-level query plans to be executed on data servers. Root server is responsible for compiling and coordinating execution of the query. It also does the last stage of query execution, before returning results back to the users.
Metadata server is primarily responsible for pruning tablets based on the partition metadata stored in the metadata store. To make this efficient, metadata is cached in the metadata servers in a LRU cache. We use zone-maps, dictionaries and bloom filters for pruning tablets required for the query.
Once the tablets to be processed are identified, distributed query plan is executed on the data servers for those tablets. Data servers load the columns required for query execution on need basis and cache them into large local caches at every data server. Data shuffles between stages of the query are handled using remote RDMA shuffle service.
In realtime ingestion, shown with blue arrows, data producers send rows of data directly via RPC or using a persistent pub sub queue. Based on the pre-configured partitioning on selected columns, data is ingested into two ingestion servers, primary and secondary for each row. This data is held into efficient lookup memory buffers and are periodically checkpointed to Colossus. These checkpointed files are eventually compacted or aggregated into larger files for efficiency of querying. Data ingested in Procella is servable from the point it is ingested into the system. So, queries on realtime tables are served from buffers and small files and the large compacted files. The lifecycle of these buffers, small files and large files is coordinated with transactions on the metadata store via registration server.
Now that we have looked at the architecture. Let’s look at some scale and latency numbers.
These are numbers from one of our instances serving analytics on Youtube. On this instance, we execute about 450+ million queries per day. Amongst which 200+ million are queries on real-time datasets. This results into over 90+ billion of leaf plans that run on the data servers. In total, this scans over 20 quadrillion rows per day and returns over 100 billion rows. We can achieve peak scan rate of 100+ billion rows per second. This instance has over 100 tables with more than 200 columns.
Let’s look at the latency numbers for our analytics instance. As you can see, our queries have just 25 ms of e2e latency at median. Our slowest queries take about 3 seconds possibly because of the size of the query and cache misses. At median, a query scans about 17M rows.
On range partitioned data, Procella metadata server can prune large number of tablets to be scanned. This is done by using efficient data structures to store tablet boundaries in memory and then binary searching through them to identify the tablets that satisfy the given filter clause.
Now, let’s deep dive into some of the features that make Procella work so well. These few features are essentially secret sauce that make Procella so fast and efficient.
Stateful caching:
For fast and efficient processing, we cache at various layers in Procella.
On the metadata server, constraints or zone-maps and other indexes are cached in a LRU cache to avoid going to metadata store. These are stored in a versioned cache so that invalidation of cache is easy.
As I mentioned before, data processed by Procella is stored remotely on Colossus, distributed file system. To allow efficient fetch of data, we cache file handles which essentially hold location of the remote Colossus server and metadata information like number of chunks, size and location of chunks. We also cache column level metadata information including size of various columns, their offset in the file. In addition, in a separate cache, we store blocks of columns required during query processing. We use a smart age and cost aware caching policy for this cache. Data segments have data server affinity to achieve high cache hit rate. So, any data segment is loaded and processed only by two data servers, a primary and a secondary for that segment. Any request to process a data segment first goes to primary and if primary does not respond fast enough, secondary handles that request. In addition, we do support expression caching that allows users to cache expensive expressions in the query.
Now, lets talk about tail latency reduction.
To achieve tail latency reduction, we use speculative execution for data server RPCs. So, during execution of any query, after 90% of query execution completes, we send backup RPCs to secondary data servers for the rest 10% incomplete RPCs. Also, we send minibatch backup RPCs, during the execution of query at regular intervals. That is, one backup RPC is sent for slowest RPCs in every set of n RPCs.
Indexing is one of the thing that makes Procella architecturally different from other query engines. Procella uses a separate metadata server to store indexing information and does a separate efficient processing pass over it before query execution to prune and select tablets. Procella uses dynamic range partitioning, for both batch and realtime tables, to achieve uniform distribution of data across tablets. Dictionary pruning is helpful when there are relatively small number of values compared to number of rows and bloom filters help significantly for needle in the haystack kind of queries for example - getting metrics for less popular videos like views for videos on my channel over last 28 days.
Our query execution is fully distributed and massively parallel so we can leverage power of all the data servers during execution of a query.
We also have a vectorized evaluation engine, called Superluminal that can work on block of values and leverage modern CPU instructions during evaluation. Superluminal is columnar evaluation library that can natively understand and leverage RLE and dictionary encodings to make group by and filtering operations much cheaper to execute.
Procella supports querying data stored in two columnar file formats- Capacitor and Artus. Capacitor is Dremel’s native file format which is widely used at Google. It supports RLE and dictionary encoded columns and has support for storing bloom filters for cheaply checking if a value exists in a column.
We have also developed another file format, Artus which in addition to features in Capacitor provides efficient lookup and seek APIs on columns to make Procella evaluation significantly more efficient. In Artus, we use multi-pass adaptive encoding to select best encoding for a column. Artus gets space savings from this without using generalized compressions as used by other file formats. Artus uses adaptive encodings instead of generalized compressions to avoid full decompression costs of columns. It allows for lookups using binary search and O(1) seek APIs to move between rows. It also simplifies handling of complex data types by using length based representation for arrays instead of using repetition and definition levels. Artus exposes dictionary and RLE information to execution engine so that operations like filtering and group-bys can be made more efficient.
We’ve already talked about realtime ingestion, but one more thing to add to it is that once the data is persisted to the files, queries on those are served by data servers. So ingestion server only manage memory buffers and queries on them.
Lastly Procella supports virtual tables. These are used for various purposes.
One, to simplify development and allow flexibility, they allow for efficient aggregate selection. So, users can write their queries to select metrics and dimensions from the virtual table without knowing what physical aggregates have them and Procella will rewrite these to query most efficient aggregate that can correctly serve this query. This also allows backend teams to add necessary aggregates, change backend data models independently without modifying any incoming queries.
Two: virtual tables provide automatic stitching between batch and realtime tables. So, for pipelines using lambda architecture, as the batch processed data arrives, queries automatically add necessary filters on real-time data to show a stitched view of datasets.
And lastly, virtual tables are normalization friendly. They provide denomalized view of the tables to the users and add necessary joins during the rewrite to perform denormalizations on the fly.
I haven’t covered all the features here but these key features are primarily responsible for making Procella a success.
Here are some results of running TPC-H benchmark queries on Procella. As you can see, using the techniques discussed before we are able to achieve one of the best results for this benchmark compared to other query engines. Note that these are slightly older numbers from running queries on static artus TPCH dataset, using a large instance with manually rewritten queries to add join hints etc.
With that, let me conclude my talk by summarizing what we learned so far. In this talk, we learned about Procella, a composable, scalable and versatile query engine we have built at Youtube. We learned about the architecture and what makes it work so well. If this is something that excites you, feel free to come talk to me later to learn about opportunities. I can take questions now. But, before that, I would like to mention that I will not be sharing any numbers other than ones mentioned on the slides in this talk. Also, I will not comment on comparison of Procella with any other existing Google products.