Procella: A fast versatile SQL query engine powering data at Youtube

•Télécharger en tant que PPTX, PDF•

4 j'aime•2,085 vues

Procella is a distributed SQL query engine built for flexible workloads within YouTube. Procella is highly scalable and is designed primarily to serve high volumes of queries at low latencies while ingesting realtime data. It is used to serve video/channel statistics for users watching videos as well as OLAP style queries for video analytics (youtube.com/analytics) and public dashboards (artists.youtube.com). Procella also supports complex SQL operations over structured data and is used by YouTube analysts for ad-hoc analysis. Procella works on the Google distributed computing stack working directly on data residing in accessible columnar formats on the Google distributed file system Colossus. The underlying data is thus producible and directly consumable by other tools such as MapReduce and Dremel. The compute runs directly on shared machines on Borg clusters, and does not need dedicated virtual (or physical) machines. These features allows Procella to fit nicely in the Google ecosystem, scale compute and storage independently, and to gracefully handle evictions and machine failures without compromising availability or performance. Procella has been in production for over two years and is currently serving billions of SQL queries per day across various workloads at YouTube and several other Google product areas. Speaker Aniket Mokashi, Google, Senior Software Engineer

Technologie

aniketmokashi@
June 2018
Procella: A Fast Versatile SQL Query Engine

About me
● Tech Lead on YouTube Data Team
○ Procella
○ Youtube Data Warehouse
○ Youtube Analytics Backend
○ Data Quality: Anomaly Detection
● Prior to Google
○ Data Infra teams at Twitter, Netflix
○ Apache Parquet PMC
○ Apache Pig PMC
○ Contributor: Hadoop, Hive

Agenda
● Products at YouTube and Google
● Use cases and features
● Architecture
● Key features deep dive
● Q&A

Youtube Analytics (youtube.com/analytics)

Youtube Music Insights (youtube.com/artists)

And more...
● Youtube Metrics
● Firebase Performance Console
● Google Play Console
● Youtube Internal Analytics
● ...

About Procella
A widely used SQL query engine at Google
● Fully featured: Most of SQL++: Structured, joins, set ops, analytic functions
● Super fast: Most queries are sub-second.
● Highly scalable: Petabytes of data, thousands of tables, trillions of rows, ...
● High QPS: Millions of point queries @ msec, thousands of reporting queries @
sub-second, hundreds of ad-hoc queries @ seconds
● Designed for real-time: Millions QPS instant ingestion, highly efficient, …
● Easy to use: SQL, DDL, Virtual tables, …
● Compatible: Capacitor, Dremel, ACLs, Encryption, …

External Reporting/Analytics
● Use cases
○ YouTube Analytics
○ Firebase Perf Monitoring
○ Google Play Console
○ Data Studio
● Properties
○ High QPS, low latency
ingestion and queries
○ Hundreds of metrics across
dozens of dimensions
○ SQL
● Technology:
○ On the fly aggregations
○ Indexing & partitioning
○ Real-time ingestion
○ Batch & Real-time Stitching
■ Lambda architecture
○ Caching

Internal Reporting/Dashboards
● Use cases
○ Dashboards
○ Experiments Analysis
○ Custom reporting
● Properties
○ Low qps
○ Speed & scale
○ SQL
● Technology
○ Stateful caching
○ Columnar optimized data
format (Artus)
○ Vector evaluation
○ Virtual tables
○ Tools (Dremel) and format
(Capacitor) compatibility

Real Time Insights/Monitoring
● Use cases
○ YTA Real Time (External)
○ YTA Insights (External)
○ YT Site Health (Internal)
● Properties
○ Native time-series support
○ Complex compaction, complex
metrics
○ High scan rate
○ Very high real time ingestion rate
● Technology
○ TTL
○ SQL based compaction
○ Approx algorithms
(Quantiles, Top, ..)

Real Time Stats Serving
● Use cases
○ YouTube metrics
(subscriptions, likes, etc.) on
various YT pages
● Properties
○ Millions of rows ingested per
sec
○ Millions of queries per sec
○ Milliseconds response time
○ Simple point queries
○ Many replicas
● Technology
○ Indexable columnar format
(Artus)
○ Heavily optimized serving
path
○ Pubsub based ingestion

Ad-hoc Analytics
● Use cases
○ Data Science
○ Analysts
● Properties
○ Complex SQL queries
○ Multi-stage queries
○ Semi structured and
complex data (eg - user
profiles)
○ Joins
● Technology
○ Efficient RDMA Shuffle
○ Efficient Joins - Broadcast,
Lookup, Pipelined, Shuffled,
Co-partitioned
○ Multi-stage queries
○ Optimizer (adaptive)
○ UDF

Data Pipelines
● Use cases
○ Pipelines
○ Logs Processing
● Properties
○ Complex business logic
○ Multi-stage queries
○ Large shuffles
○ Joins
○ UDFs
● Technology
○ Efficient RDMA Shuffle
○ Efficient Joins - Broadcast,
Lookup, Pipelined, Shuffled,
Co-partitioned
○ Multi-stage queries
○ Optimizer (adaptive)
○ UDFs

Procella Scale
Property Value
Queries executed 450+ million per day
Queries on real-time data 200+ million per day
Leaf plans executed 90+ billion per day
Rows scanned 20+ quadrillion per day
Rows returned 100+ billion per day
Peak scan rate per query 100+ billion rows per second
Schema size 200+ metrics and dimensions
Tables 100+

Procella Latencies
Property p50 p99 p99.9
E2E Latency 25 ms 450 ms 3000 ms
Rows Scanned 17 M 270 M 1000 M
MDS Latency 6 ms 120 ms 250 ms
DS Latency 1 ms 75 ms 220 ms
Query Size 300B 4.7KB 10KB

Metadata Server Latencies
Percentile Latency (ms) Tablets Pruned Tablets Scanned
p50 6 110,000 60
p90 11 900,000 250
p95 15 1,000,000 310
p99 120 33,000,000 4000
p99.9 250 38,000,000 7200
p99.99 5100 38,050,000 11,500

Procella Secret Sauce
● Stateful Caching
○ Cached indexes - zone-maps, dictionaries, bloom filters etc.
○ File handle, metadata caching
○ Data segments caching
○ Data segments have data server affinity (Primary DS, Secondary DS,...)
○ Cached expressions
● Tail Latency Reduction
○ Backup RPCs - at about 90% of total RPCs.
○ Minibatch backup RPCs - one per n RPCs as a backup for slowest of
those n RPCs.

Procella Secret Sauce
● Indexing
○ Separate metadata server optimized for compact storage of zone maps
○ Dynamic range partitioning
○ Use of dictionaries and bloom filters
○ Posting lists for repeated data
● Distributed Execution
○ N tier execution, X000 parallel
● Vectorized Evaluation: Superluminal
○ Performs block eval
○ Natively understands RLE and dictionary
○ Columnar evaluation

Procella Secret Sauce
● Artus File Format
○ Multi-pass adaptive encodings to select the best encoding for a column
○ Uses adaptive encodings vs generalized compression
○ O(log n) lookups, O(1) seeks
○ Length based representation for repeated data types (arrays)
○ Exposes dictionary, RLE information to execution engine
○ Allows for rich metadata, inverted index

Procella Secret Sauce
● Real-time Ingestion
○ Data ingested into memory buffers
○ Periodically checkpointed to files
○ Small files are periodically aggregated and compacted into large files
● Virtual Tables
○ Index aware aggregate selection
○ Stitched Queries
○ Lambda architecture awareness
○ Normalization friendly (dimensional joins)

TPC-H queries*
* Run of TPC-H queries on:
● Static artus data
● Large shared instance
● Manually optimized queries
Geomean: 9.6 Seconds

Contenu connexe

Tendances

Optimizing Hive Queries

Owen O'Malley

Migration from Oracle to PostgreSQL: NEED vs REALITY

Ashnikbiz

Hadoop and Spark

Shravan (Sean) Pabba

How to Choose The Right Database on AWS - Berlin Summit - 2019

Randall Hunt

Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018. Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.

Frame - Feature Management for Productive Machine Learning

David Stein

Building an Enterprise Knowledge Graph @Uber: Lessons from Reality

Joshua Shinavier

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Mo Patel

Hive spark-s3acommitter-hbase-nfs

Yifeng Jiang

Apache Tez - A unifying Framework for Hadoop Data Processing

DataWorks Summit

Watch this talk here: https://www.confluent.io/online-talks/bridge-to-cloud-apache-kafka-migrate-gcp Most companies start their cloud journey with a new use case, or a new application. Sometimes these applications can run independently in the cloud, but often times they need data from the on premises datacenter. Existing applications will slowly migrate, but will need a strategy and the technology to enable a multi-year migration. In this session, we will share how companies around the world are using Confluent Cloud, a fully managed Apache Kafka® service, to migrate to Google Cloud Platform. By implementing a central-pipeline architecture using Apache Kafka to sync on-prem and cloud deployments, companies can accelerate migration times and reduce costs. Register now to learn: -How to take the first step in migrating to GCP -How to reliably sync your on premises applications using a persistent bridge to cloud -How Confluent Cloud can make this daunting task simple, reliable and performant

Bridge to Cloud: Using Apache Kafka to Migrate to GCP

confluent

In this session, we explore the world's first cloud-scale file system and its targeted use cases. Learn about Amazon EFS features and benefits, how to identify applications that are appropriate to use with Amazon EFS, and details about its performance and security models. The target audience includes security administrators, application developers, and applications owners who operate or build file-based applications.

Amazon EFS: Deep Dive

Amazon Web Services

Introduction to spark

Duyhai Doan

Introduction to Apache Spark

Anastasios Skarlatidis

YARN Ready: Integrating to YARN with Tez

Hortonworks

This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture. YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Simplilearn

Tajo TPC-H Benchmark Test on AWS

Gruter

Hadoop Overview & Architecture

EMC

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Memory Management in Apache Spark

Databricks

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

Programming in Spark using PySpark

Mostafa

The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture. Key Takeaways: – Understand the key use cases that motivate the feature store from both a data science and engineering perspective. – Consider edge cases where there may be opportunities for simplification such as “online” predictions. – Review a typical logical data model for a feature store and how that can be applied to your business domain. – Consider options for physical storage of the feature store in the Delta Lake. – Understand common access patterns including metadata-based feature discovery.

A Practical Enterprise Feature Store on Delta Lake

Databricks

Tendances (20)

Optimizing Hive Queries

Migration from Oracle to PostgreSQL: NEED vs REALITY

Hadoop and Spark

How to Choose The Right Database on AWS - Berlin Summit - 2019

Frame - Feature Management for Productive Machine Learning

Building an Enterprise Knowledge Graph @Uber: Lessons from Reality

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Hive spark-s3acommitter-hbase-nfs

Apache Tez - A unifying Framework for Hadoop Data Processing

Bridge to Cloud: Using Apache Kafka to Migrate to GCP

Amazon EFS: Deep Dive

Introduction to spark

Introduction to Apache Spark

YARN Ready: Integrating to YARN with Tez

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Tajo TPC-H Benchmark Test on AWS

Hadoop Overview & Architecture

Memory Management in Apache Spark

Programming in Spark using PySpark

A Practical Enterprise Feature Store on Delta Lake

Similaire à Procella: A fast versatile SQL query engine powering data at Youtube

Introduction to Apache Tajo: Future of Data Warehouse

Gruter

Introduction to Apache Tajo: Future of Data Warehouse

Jihoon Son

Elasticsearch as a time series database

felixbarny

Extracting Insights from Data at Twitter

Prasad Wagle

Machine learning and big data @ uber a tale of two systems

Zhenxiao Luo

Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka. One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status. Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

HostedbyConfluent

Data Lessons Learned at Scale - Big Data DC

Charlie Reverte

What's new in SQL on Hadoop and Beyond

DataWorks Summit/Hadoop Summit

Real-time analytics with Druid at Appsflyer

Michael Spector

PostgreSQL and Sphinx pgcon 2013

Emanuel Calvo

Presto Bangalore Meetup1 Repertoire@Myntra

Shubham Tagra

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Omid Vahdaty

Data Lessons Learned at Scale

Charlie Reverte

20131008 - Wajug - TweetWall Pro

Pascal Alberty

This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility. As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions. We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization. We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats. Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...

StampedeCon

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu. Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com.. Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.

Data Science in the Cloud @StitchFix

C4Media

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

The Parquet Format and Performance Optimization Opportunities

Databricks

Introduction to Apache Tajo: Data Warehouse for Big Data

Jihoon Son

Presto, an open source distributed SQL engine originally built at Facebook, has a rapidly growing community of developers and users. In this talk, speakers from both Facebook and Teradata, will discuss technical details of some of the recent developments such as integration with Hadoop ecosystem (YARN/Slider and Ambari), security features (Kerberos), enabling BI tools via JDBC/ODBC drivers, new connectors (Redis, MongoDB) and storage engines (Raptor) as well as improvements in performance and ANSI SQL coverage. In addition, we will present a few use cases and major new users that leverage interactive SQL capabilities Presto offers. Finally, we will present our roadmap for the next year. See the video at https://youtu.be/wMy3LXuTb0U

Presto at Hadoop Summit 2016

kbajda

Big data @ Hootsuite analtyics

Claudiu Coman

Similaire à Procella: A fast versatile SQL query engine powering data at Youtube (20)

Introduction to Apache Tajo: Future of Data Warehouse

Elasticsearch as a time series database

Extracting Insights from Data at Twitter

Machine learning and big data @ uber a tale of two systems

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

Data Lessons Learned at Scale - Big Data DC

What's new in SQL on Hadoop and Beyond

Real-time analytics with Druid at Appsflyer

PostgreSQL and Sphinx pgcon 2013

Presto Bangalore Meetup1 Repertoire@Myntra

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Data Lessons Learned at Scale

20131008 - Wajug - TweetWall Pro

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...

Data Science in the Cloud @StitchFix

The Parquet Format and Performance Optimization Opportunities

Introduction to Apache Tajo: Data Warehouse for Big Data

Presto at Hadoop Summit 2016

Big data @ Hootsuite analtyics

Plus de DataWorks Summit

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL). Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW). Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models. Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Data Science Crash Course

DataWorks Summit

In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort. This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.

Floating on a RAFT: HBase Durability with Apache Ratis

DataWorks Summit

Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase. Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs. Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables. Resources: https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

DataWorks Summit

Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.

HBase Tales From the Trenches - Short stories about most common HBase operati...

DataWorks Summit

LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

DataWorks Summit

Managing the Dewey Decimal System

DataWorks Summit

Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL. Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist). In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.

Practical NoSQL: Accumulo's dirlist Example

DataWorks Summit

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber. Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable. At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads. At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

HBase Global Indexing to support large-scale data ingestion at Uber

DataWorks Summit

Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions. These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

DataWorks Summit

Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms. To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

DataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability Improvements

DataWorks Summit

In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”

Security Framework for Multitenant Architecture

DataWorks Summit

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

Presto: Optimizing Performance of SQL-on-Anything Engine

DataWorks Summit

Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Extending Twitter's Data Platform to Google Cloud

DataWorks Summit

At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

DataWorks Summit

Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

DataWorks Summit

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

DataWorks Summit

Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as: ● Optimizing merchandising execution, in-stocks and sell-thru ● Enhancing operational efficiencies, enable real-time customer engagement ● Enhancing loss prevention capabilities, response time ● Creating frictionless experiences for shoppers Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry. We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey. Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables. We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance. We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing. Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems. By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.

Computer Vision: Coming to a Store Near You

DataWorks Summit

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

DataWorks Summit

Plus de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

GenCyber Cyber Security Day Presentation

Michael W. Hawkins

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Real Time Object Detection Using Open CV

Khem

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Developing An App To Navigate The Roads of Brazil

V3cube

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

GenCyber Cyber Security Day Presentation

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Tech Trends Report 2024 Future Today Institute.pdf

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Real Time Object Detection Using Open CV

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Axa Assurance Maroc - Insurer Innovation Award 2024

Automating Google Workspace (GWS) & more with Apps Script

Handwritten Text Recognition for manuscripts and early printed texts

Boost PC performance: How more available memory can improve productivity

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Partners Life - Insurer Innovation Award 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Powerful Google developer tools for immediate impact! (2023-24 C)

How to Troubleshoot Apps for the Modern Connected Worker

Developing An App To Navigate The Roads of Brazil

Procella: A fast versatile SQL query engine powering data at Youtube

1. aniketmokashi@ June 2018 Procella: A Fast Versatile SQL Query Engine

2. About me ● Tech Lead on YouTube Data Team ○ Procella ○ Youtube Data Warehouse ○ Youtube Analytics Backend ○ Data Quality: Anomaly Detection ● Prior to Google ○ Data Infra teams at Twitter, Netflix ○ Apache Parquet PMC ○ Apache Pig PMC ○ Contributor: Hadoop, Hive

3. Agenda ● Products at YouTube and Google ● Use cases and features ● Architecture ● Key features deep dive ● Q&A

4. Youtube Analytics (youtube.com/analytics)

5. Youtube Music Insights (youtube.com/artists)

6. And more... ● Youtube Metrics ● Firebase Performance Console ● Google Play Console ● Youtube Internal Analytics ● ...

7. About Procella A widely used SQL query engine at Google ● Fully featured: Most of SQL++: Structured, joins, set ops, analytic functions ● Super fast: Most queries are sub-second. ● Highly scalable: Petabytes of data, thousands of tables, trillions of rows, ... ● High QPS: Millions of point queries @ msec, thousands of reporting queries @ sub-second, hundreds of ad-hoc queries @ seconds ● Designed for real-time: Millions QPS instant ingestion, highly efficient, … ● Easy to use: SQL, DDL, Virtual tables, … ● Compatible: Capacitor, Dremel, ACLs, Encryption, …

8. External Reporting/Analytics ● Use cases ○ YouTube Analytics ○ Firebase Perf Monitoring ○ Google Play Console ○ Data Studio ● Properties ○ High QPS, low latency ingestion and queries ○ Hundreds of metrics across dozens of dimensions ○ SQL ● Technology: ○ On the fly aggregations ○ Indexing & partitioning ○ Real-time ingestion ○ Batch & Real-time Stitching ■ Lambda architecture ○ Caching

9. Internal Reporting/Dashboards ● Use cases ○ Dashboards ○ Experiments Analysis ○ Custom reporting ● Properties ○ Low qps ○ Speed & scale ○ SQL ● Technology ○ Stateful caching ○ Columnar optimized data format (Artus) ○ Vector evaluation ○ Virtual tables ○ Tools (Dremel) and format (Capacitor) compatibility

10. Real Time Insights/Monitoring ● Use cases ○ YTA Real Time (External) ○ YTA Insights (External) ○ YT Site Health (Internal) ● Properties ○ Native time-series support ○ Complex compaction, complex metrics ○ High scan rate ○ Very high real time ingestion rate ● Technology ○ TTL ○ SQL based compaction ○ Approx algorithms (Quantiles, Top, ..)

11. Real Time Stats Serving ● Use cases ○ YouTube metrics (subscriptions, likes, etc.) on various YT pages ● Properties ○ Millions of rows ingested per sec ○ Millions of queries per sec ○ Milliseconds response time ○ Simple point queries ○ Many replicas ● Technology ○ Indexable columnar format (Artus) ○ Heavily optimized serving path ○ Pubsub based ingestion

12. Ad-hoc Analytics ● Use cases ○ Data Science ○ Analysts ● Properties ○ Complex SQL queries ○ Multi-stage queries ○ Semi structured and complex data (eg - user profiles) ○ Joins ● Technology ○ Efficient RDMA Shuffle ○ Efficient Joins - Broadcast, Lookup, Pipelined, Shuffled, Co-partitioned ○ Multi-stage queries ○ Optimizer (adaptive) ○ UDF

13. Data Pipelines ● Use cases ○ Pipelines ○ Logs Processing ● Properties ○ Complex business logic ○ Multi-stage queries ○ Large shuffles ○ Joins ○ UDFs ● Technology ○ Efficient RDMA Shuffle ○ Efficient Joins - Broadcast, Lookup, Pipelined, Shuffled, Co-partitioned ○ Multi-stage queries ○ Optimizer (adaptive) ○ UDFs

14. Architecture

15. Procella Scale Property Value Queries executed 450+ million per day Queries on real-time data 200+ million per day Leaf plans executed 90+ billion per day Rows scanned 20+ quadrillion per day Rows returned 100+ billion per day Peak scan rate per query 100+ billion rows per second Schema size 200+ metrics and dimensions Tables 100+

16. Procella Latencies Property p50 p99 p99.9 E2E Latency 25 ms 450 ms 3000 ms Rows Scanned 17 M 270 M 1000 M MDS Latency 6 ms 120 ms 250 ms DS Latency 1 ms 75 ms 220 ms Query Size 300B 4.7KB 10KB

17. Metadata Server Latencies Percentile Latency (ms) Tablets Pruned Tablets Scanned p50 6 110,000 60 p90 11 900,000 250 p95 15 1,000,000 310 p99 120 33,000,000 4000 p99.9 250 38,000,000 7200 p99.99 5100 38,050,000 11,500

18. Procella Secret Sauce ● Stateful Caching ○ Cached indexes - zone-maps, dictionaries, bloom filters etc. ○ File handle, metadata caching ○ Data segments caching ○ Data segments have data server affinity (Primary DS, Secondary DS,...) ○ Cached expressions ● Tail Latency Reduction ○ Backup RPCs - at about 90% of total RPCs. ○ Minibatch backup RPCs - one per n RPCs as a backup for slowest of those n RPCs.

19. Procella Secret Sauce ● Indexing ○ Separate metadata server optimized for compact storage of zone maps ○ Dynamic range partitioning ○ Use of dictionaries and bloom filters ○ Posting lists for repeated data ● Distributed Execution ○ N tier execution, X000 parallel ● Vectorized Evaluation: Superluminal ○ Performs block eval ○ Natively understands RLE and dictionary ○ Columnar evaluation

20. Procella Secret Sauce ● Artus File Format ○ Multi-pass adaptive encodings to select the best encoding for a column ○ Uses adaptive encodings vs generalized compression ○ O(log n) lookups, O(1) seeks ○ Length based representation for repeated data types (arrays) ○ Exposes dictionary, RLE information to execution engine ○ Allows for rich metadata, inverted index

21. Procella Secret Sauce ● Real-time Ingestion ○ Data ingested into memory buffers ○ Periodically checkpointed to files ○ Small files are periodically aggregated and compacted into large files ● Virtual Tables ○ Index aware aggregate selection ○ Stitched Queries ○ Lambda architecture awareness ○ Normalization friendly (dimensional joins)

22. TPC-H queries* * Run of TPC-H queries on: ● Static artus data ● Large shared instance ● Manually optimized queries Geomean: 9.6 Seconds

23. Questions?

Notes de l'éditeur

Hi Everyone, I’m Aniket Mokashi, I am a tech lead at Google. Let me give you a little bit of background about myself. I work on Youtube’s Data team. At Youtube, I primarily contribute to Procella, which is what this talk is about. In addition, I work on Youtube Analytics Backend and on variety of projects under Youtube Data warehouse. Prior to Google, I have worked on data infra teams at Twitter and Netflix. I’m a Apache Parquet PMC member. I’ve contributed a few encodings, pig integration to Parquet. I was responsible for rolling out for first production use of Parquet at Twitter. I’m also a Apache Pig PMC member. My main contributions have been support of UDFs in scripting languages, Native Mapreduce, scalar values support, auto local mode and jar caching. I enjoy working on large scale data processing systems and in this talk I will cover Procella, a versatile analytics engine we’ve built at Youtube to serve most of our analytics needs.
We will cover a number of topics in this talk. First, we will look at products at Youtube and Google that are powered by Procella. Next, we will explore variety of use cases enabled by Procella and we will discuss features that are required in order to support those use cases. After that, I will give you all an overview of Procella’s architecture. And lastly, we will spend some time to deep dive into some of the key features that make Procella work so well. In interest of time, I will be taking questions at the end.
Let’s look at some products that are powered by Procella. But, before we get started, let me ask you all a few questions: So, raise of hands, How many of you use YouTube regularly like at least once a week… Perfect - that makes all of you. How many of you are Youtube Creators - that is you have uploaded at least one video to Youtube? Those of you who haven’t done that yet, I will highly encourage you to try it out. Especially to get rich analytics... :) How many of you have used youtube analytics for tracking their videos or at least seen youtube analytics before? Excellent. So, Youtube Analytics is this amazingly rich analytics dashboard that lets Youtube content creators or content owners analyze usage and popularity of their videos and channels on the Youtube platform. It lets them track all of success metrics such as number of views, amount of watch time, revenue from monetized videos across various dimensions such as demographics, geo, devices, playback location etc. We also have a mobile application, that you see on the right, called creator studio which has similar details. Some of this information is also available in real-time. Three years back, we started building Procella primarily to serve as a backend for Youtube analytics which is often referred as YTA. We wanted to make sure that backend of YTA will continue to scale horizontally as Youtube grows and as we add more features to the product. We had a few design goals in mind when we started working on Procella. First, we wanted to provide widely understood SQL interface so that navigating through this data is easy and integrating various frontends with the backend was less cumbersome. Second, we also wanted to provide flexibility to extend the product without a lot of involvement of the backend team, so we wanted to build as generic product as possible. We launched Procella as a backend for Youtube Analytics about 2 and half years ago. Since then, we have been able to make several changes to the the product seamlessly. For example, recently, we started showing impressions and impressions CTR in YTA and it required almost no involvement of the Procella team.
In addition to Youtube Analytics, over last few years Procella is being used for several other internal and external facing analytics dashboards or products. One example is Youtube Artists product that you can see on the slides. This product allows everyone to track popularity of artists and their music videos in a region on Youtube platform. If you haven’t already, I recommend you to check it out.
There are few other fairly large products at Google that are powered by Procella. For example - Firebase Performance Console. It lets app developers track various growth and success related metrics for their apps that are integrated with firebase platform. Google Play Console - that lets android application developers track performance and popularity of their apps. Also, now, various metric numbers like subscriptions, likes, dislikes that you see on the youtube.com website or app are also powered by Procella. Given the popularity of Youtube, as you can imagine, this was a significant achievement in terms of scale. I will describe later in the talk, the system properties that allowed us achieve it.
So now that you have looked at these interesting products that are using Procella. Let’s look at Procella. What is Procella? Procella is a widely used SQL query engine at Google. It supports most of the standard SQL functionality including joins, set operations and analytic functions. It also supports queries on top of complex or structured data types like arrays, structs and maps. What makes it unique is that it’s built to be super fast at scale - so most of the queries running on Procella have sub-second latencies. It’s highly scalable - works on petabytes of data, on thousands of tables storing trillions of rows. It can handle high qps of queries - like millions of qps of point lookup queries at millisecond latencies and thousands of qps of dashboard queries at sub-second latencies and hundreds of ad-hoc queries running in seconds. It is also designed to serve data ingested in real-time. So, it supports millions of qps of ingestion of data rows. It’s easy to use with SQL interface that supports DDL statements to create tables and partitions. It also supports virtual tables. Virtual tables are used to hide the complexity of data model that is required to serve a use case. Procella is developed to be compatible with existing tools at Google so that it can adopted easily. It exposes Dremel’s query interface which is a popular query system at Google and it can process files in Capacitor file format which is Dremel’s native file format.
Procella is a composable and versatile system. It powers a variety of use cases ranging from external facing metrics reporting applications to data pipelines. Let’s go through these use cases one by one to understand the motivations behind the architecture and various features in Procella. External reporting applications are public facing products like Youtube analytics, firebase performance console, Google play console those I showed before. Being public facing, these applications have high qps and low-blink-of-an-eye latency requirements. These products typically surface over a hundred metrics sliced and diced by a few dozen of dimensions for a given entity or a customer. Having a SQL interface instead of an API interface makes it easy build the backends and frontends independently. These applications usually work on large datasets. Precomputing all the required metrics numbers required to power these dashboards is expensive or sometimes even infeasible. So, having ability to perform fast aggregations of these metrics on the fly is important. To enable these applications, Procella provides efficient indexing and tablet pruning functionality as most of these queries need to process a small number of rows out of a very large amount of data. These applications also need ability to ingest real-time data to enable fast insights. And, in most cases, the ability to stitch between realtime datasets and batch datasets using lambda architecture is crucial. These applications can leverage temporal locality of data, so building caching at different layers helps in performance.
Another use case is internal reporting or dashboards. Some of the examples internal reporting are: product dashboards, experiment analysis dashboards, custom reports. Internal reporting has much lower qps and relatively less aggressive latency requirements compared to external reporting. However, internal dashboards typically work on much complex and larger datasets. So, data scan efficiency and ability to handle complex data types are important to be able to serve their queries. To enable these use cases, we have developed features such as fast optimized data formats, vectorized evaluation library.
Real-time time-series analysis and monitoring is another use case. YTA provides insights in real-time. In addition, we also to track Youtube site health using Procella. These applications require native support of time dimension especially to filter based on different time boundaries like last 60 mins or last 2 days.
As I mentioned previously, use of Procella to power metrics on various pages of the Youtube platform is one of the most exciting use case. Youtube gets millions of activity updates per second. So, this requires millions of rows per second of reliable ingestion in real-time. This is mainly done through pubsub which is persistent message queue. Updates are replicated to multiple ingestion servers for reliability. On the serving side, this use case requires ability to serve millions of simple point lookup queries per seconds. These queries take only a few milliseconds to run on the ingested data. To make this possible, we have heavily optimized the query serving path with additional features that can lookup and scan required data efficiently.
Ad-hoc analytics essentially covers interactive querying at a low qps. Data scientists and analysts make use of tables stored in Procella to identify trends, growth factors and other important custom business metrics. This typically requires querying complex data types like arrays, nested structs and ability to join arbitrary datasets. We support a number of joins that are required for these queries to work. In particular, we support broadcast join - which broadcasts the right side of the joins to all the leaf nodes so that they can construct local hash maps for lookups during the joins. We support remote lookup joins that leverage our lookup-friendly data format. We support pipelined joins that run in stages. They first compute small amount of join data from right side so that it can be used as a filter on the left side in the subsequent stages. Shuffled joins, which are essentially reduce-side joins that can scale for arbitrary large datasets. Lastly, co-partitioned joins, which leverage partitioning of left and right sides to efficiently merge them during the join.
Supporting SQL based data pipelines requires ability to handle large amount of data and process it using a complex business logic. These are typically multi-stage queries that join various data sources. These joins and aggregations require large shuffles. For logic that cannot be expressed in SQL, UDF support is necessary. To enable this use case, we have plugable shuffle component and we leverage Google wide available efficient RDMA shuffle service. We have also implemented adaptive cost based optimizer that can optimize parts of these queries on the fly.
All of the categories of use cases I mentioned so far are being powered by Procella in production at Youtube. Procella’s unique scalable, composable architecture makes that possible. Let’s dive into the architecture now. This diagram roughly shows architecture of Procella. All rectangular boxes in the diagram are essentially independent services in the Procella query engine. Towards your left is metadata server, metadata store and registration server. Users register their tables with Procella using DDL commands like CREATE/ALTER table or programmatically using RPCs. This is done to define the schemas of the tables. Then users setup upstream batch processes to periodically create partitions of datasets. These datasets are stored on Google’s distributed file system called Colossus. After datasets are generated, users register them with Procella using ALTER table commands or programmatically with RPCs. These datasets are typically stored in Columnar file formats such as Capacitor or Artus. Each dataset consists of many blocks of rows called tablets. A tablet typically has tens of thousands to millions of rows. Data is generally range or hash partitioned to achieve clustering of data by columns that are frequently filtered by. During the data registration, registration server extracts metadata information of all tablets such as stats, zone-maps, dictionaries, bloom filters and stores it in the metadata store in a highly compact and organized way. In the query path, shown on your right, with yellow colored arrows, dashboard clients or human users send their SQL queries as strings. These are compiled into distributed multi-level query plans to be executed on data servers. Root server is responsible for compiling and coordinating execution of the query. It also does the last stage of query execution, before returning results back to the users. Metadata server is primarily responsible for pruning tablets based on the partition metadata stored in the metadata store. To make this efficient, metadata is cached in the metadata servers in a LRU cache. We use zone-maps, dictionaries and bloom filters for pruning tablets required for the query. Once the tablets to be processed are identified, distributed query plan is executed on the data servers for those tablets. Data servers load the columns required for query execution on need basis and cache them into large local caches at every data server. Data shuffles between stages of the query are handled using remote RDMA shuffle service. In realtime ingestion, shown with blue arrows, data producers send rows of data directly via RPC or using a persistent pub sub queue. Based on the pre-configured partitioning on selected columns, data is ingested into two ingestion servers, primary and secondary for each row. This data is held into efficient lookup memory buffers and are periodically checkpointed to Colossus. These checkpointed files are eventually compacted or aggregated into larger files for efficiency of querying. Data ingested in Procella is servable from the point it is ingested into the system. So, queries on realtime tables are served from buffers and small files and the large compacted files. The lifecycle of these buffers, small files and large files is coordinated with transactions on the metadata store via registration server.
Now that we have looked at the architecture. Let’s look at some scale and latency numbers. These are numbers from one of our instances serving analytics on Youtube. On this instance, we execute about 450+ million queries per day. Amongst which 200+ million are queries on real-time datasets. This results into over 90+ billion of leaf plans that run on the data servers. In total, this scans over 20 quadrillion rows per day and returns over 100 billion rows. We can achieve peak scan rate of 100+ billion rows per second. This instance has over 100 tables with more than 200 columns.
Let’s look at the latency numbers for our analytics instance. As you can see, our queries have just 25 ms of e2e latency at median. Our slowest queries take about 3 seconds possibly because of the size of the query and cache misses. At median, a query scans about 17M rows.
On range partitioned data, Procella metadata server can prune large number of tablets to be scanned. This is done by using efficient data structures to store tablet boundaries in memory and then binary searching through them to identify the tablets that satisfy the given filter clause.
Now, let’s deep dive into some of the features that make Procella work so well. These few features are essentially secret sauce that make Procella so fast and efficient. Stateful caching: For fast and efficient processing, we cache at various layers in Procella. On the metadata server, constraints or zone-maps and other indexes are cached in a LRU cache to avoid going to metadata store. These are stored in a versioned cache so that invalidation of cache is easy. As I mentioned before, data processed by Procella is stored remotely on Colossus, distributed file system. To allow efficient fetch of data, we cache file handles which essentially hold location of the remote Colossus server and metadata information like number of chunks, size and location of chunks. We also cache column level metadata information including size of various columns, their offset in the file. In addition, in a separate cache, we store blocks of columns required during query processing. We use a smart age and cost aware caching policy for this cache. Data segments have data server affinity to achieve high cache hit rate. So, any data segment is loaded and processed only by two data servers, a primary and a secondary for that segment. Any request to process a data segment first goes to primary and if primary does not respond fast enough, secondary handles that request. In addition, we do support expression caching that allows users to cache expensive expressions in the query. Now, lets talk about tail latency reduction. To achieve tail latency reduction, we use speculative execution for data server RPCs. So, during execution of any query, after 90% of query execution completes, we send backup RPCs to secondary data servers for the rest 10% incomplete RPCs. Also, we send minibatch backup RPCs, during the execution of query at regular intervals. That is, one backup RPC is sent for slowest RPCs in every set of n RPCs.
Indexing is one of the thing that makes Procella architecturally different from other query engines. Procella uses a separate metadata server to store indexing information and does a separate efficient processing pass over it before query execution to prune and select tablets. Procella uses dynamic range partitioning, for both batch and realtime tables, to achieve uniform distribution of data across tablets. Dictionary pruning is helpful when there are relatively small number of values compared to number of rows and bloom filters help significantly for needle in the haystack kind of queries for example - getting metrics for less popular videos like views for videos on my channel over last 28 days. Our query execution is fully distributed and massively parallel so we can leverage power of all the data servers during execution of a query. We also have a vectorized evaluation engine, called Superluminal that can work on block of values and leverage modern CPU instructions during evaluation. Superluminal is columnar evaluation library that can natively understand and leverage RLE and dictionary encodings to make group by and filtering operations much cheaper to execute.
Procella supports querying data stored in two columnar file formats- Capacitor and Artus. Capacitor is Dremel’s native file format which is widely used at Google. It supports RLE and dictionary encoded columns and has support for storing bloom filters for cheaply checking if a value exists in a column. We have also developed another file format, Artus which in addition to features in Capacitor provides efficient lookup and seek APIs on columns to make Procella evaluation significantly more efficient. In Artus, we use multi-pass adaptive encoding to select best encoding for a column. Artus gets space savings from this without using generalized compressions as used by other file formats. Artus uses adaptive encodings instead of generalized compressions to avoid full decompression costs of columns. It allows for lookups using binary search and O(1) seek APIs to move between rows. It also simplifies handling of complex data types by using length based representation for arrays instead of using repetition and definition levels. Artus exposes dictionary and RLE information to execution engine so that operations like filtering and group-bys can be made more efficient.
We’ve already talked about realtime ingestion, but one more thing to add to it is that once the data is persisted to the files, queries on those are served by data servers. So ingestion server only manage memory buffers and queries on them. Lastly Procella supports virtual tables. These are used for various purposes. One, to simplify development and allow flexibility, they allow for efficient aggregate selection. So, users can write their queries to select metrics and dimensions from the virtual table without knowing what physical aggregates have them and Procella will rewrite these to query most efficient aggregate that can correctly serve this query. This also allows backend teams to add necessary aggregates, change backend data models independently without modifying any incoming queries. Two: virtual tables provide automatic stitching between batch and realtime tables. So, for pipelines using lambda architecture, as the batch processed data arrives, queries automatically add necessary filters on real-time data to show a stitched view of datasets. And lastly, virtual tables are normalization friendly. They provide denomalized view of the tables to the users and add necessary joins during the rewrite to perform denormalizations on the fly. I haven’t covered all the features here but these key features are primarily responsible for making Procella a success.
Here are some results of running TPC-H benchmark queries on Procella. As you can see, using the techniques discussed before we are able to achieve one of the best results for this benchmark compared to other query engines. Note that these are slightly older numbers from running queries on static artus TPCH dataset, using a large instance with manually rewritten queries to add join hints etc.
With that, let me conclude my talk by summarizing what we learned so far. In this talk, we learned about Procella, a composable, scalable and versatile query engine we have built at Youtube. We learned about the architecture and what makes it work so well. If this is something that excites you, feel free to come talk to me later to learn about opportunities. I can take questions now. But, before that, I would like to mention that I will not be sharing any numbers other than ones mentioned on the slides in this talk. Also, I will not comment on comparison of Procella with any other existing Google products.

Procella: A fast versatile SQL query engine powering data at Youtube

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Procella: A fast versatile SQL query engine powering data at Youtube

Similaire à Procella: A fast versatile SQL query engine powering data at Youtube (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Procella: A fast versatile SQL query engine powering data at Youtube

Notes de l'éditeur