SlideShare a Scribd company logo
1 of 25
Download to read offline
A Production Quality Sketching
Library for the Analysis of Big Data
Lee Rhodes
Distinguished Architect, Verizon Media (Yahoo), Inc.
Outline
Problematic Queries of Big Data
Where traditional analysis methods don’t work well
Approximate Analysis
Using Sketches
How using stochastic processes and probabilistic
analysis wins in a systems architecture context
The Open Source Apache
DataSketches Library
A quick overview of this unique library
dedicated to production systems that
process big data
The Data Analysis Challenge …
… Analyze This Data in Near-Real Time
Time
Stamp
User
ID
Device
ID
Site Time Spent
Sec
Items
Viewed
9:00 AM U1 D1 Apps 59 5
9:30 AM U2 D2 Apps 179 15
10:00 AM U3 D3 Music 29 3
1:00 PM U1 D4 Music 89 10
Billions of Rows or Key, Value Pairs …
Example: Web Site Logs
Some Very Common Queries …
Unique Identifiers
with Set Expressions:
(AUB) ∩ (CUD)  E
Quantiles, CDFs
Histograms, PMFs
Frequent Items /
Heavy Hitters
Vector & Matrix
Operations:
SVD, etc.
Graph Analysis
Uniform
Weighted
Reservoir Sampling
5 ⋯ 2
⋮ ⋱ ⋮
4 ⋯ 3
All Non-Additive
When The Data Gets Large Or Resources Are Limited,
All Of These Queries Become Problematic,
Because the Aggregations are Non-Additive.
A
B
Result A
Result B
+ Combined
Result
Current
Result
New
Item
Updated
Result+
Col1, ..., Item, ..., Coln
… Billions of rows ...
Ω(u) size: ~ Big Data
Exact
Results
Difficult
Query
Local Item Copies
Query processing
often requires sorting
…which is very slow.
Query Engine
Traditional, Exact Analysis Methods Require Local
Copies
Note: Micro-batch “Streaming Platforms”, e.g., Storm
Do Not Solve The Fundamental Problem!
Big Data
Parallelization Does Not Help Much
Because of Non-Additivity.
You have to keep the copies somewhere!
Col1, ..., Item, ..., Coln
… Billions of rows ...
𝝨
Exact
Results
Expensive Shuffle
Copy
Copy
Copy
Copy
Copy
Copy
Example: Map-Reduce
Every dataset is processed N times for a rolling N-day window!
Traditional, Exact Time Windowing
Requires Multiple Touches of Every Item
Let’s challenge a fundamental premise:
… that our results must be exact!
If we can allow for approximation,
along with some accuracy guarantees,
we can achieve orders-of-magnitude improvement in
• speed and
• reduction of resources.
Introducing the Sketch (a.k.a., Stochastic Streaming Algorithm)
Stream
Processor
Data
Structure
Size = f(k)
Query
Processor
Stochastic Process
Merge / Set
Operations
Results +/- ε
𝛆 = f(1/k)
Probabilistic Analysis
Result Sketch
Sketch
Stream
Data
Stream
Query
Model the Problem as a Stochastic Process
Analyze using Probability & Statistics
Random Selection Sizing, Storing
How & Why Sketches Achieve Superior Performance
For Systems Processing Massive Data
Major Sketch Properties
• Small Stored Size
• Sub-linear in Space
• Single-pass, “One-Touch”
• Data Insensitive
• Mergeable
• Approximate, Probabilistic
• Mathematically Proven Error Properties
Sub-linear
Stream Size
Linear
SketchSize
Sketching Sampling
Sketches Overlap
with Sampling
Based on the
Specific Sketch
Win #1: Small Query Space
O(k) size: ~ Kilobytes
Approximate
Answer ± ε
Difficult
Query
Minimal or
no sorting required!
Ideal for Streaming & Batch
Query Engine
Sketch
Col1, ..., Item, ..., Coln
… Billions of rows ...
Sketches Start Small
Sublinear Means they Stay Small
Single Pass Simplifies Processing
Win #2: Mergeability
Sketch Approx. ± ε
Sketch
Sketch
… , Item
… many rows ...
… , Item
… many rows ...
Query
Query
Partitions
Merge
Mergeability Enables Parallelism … With No Additional Loss of Accuracy!
Sketches Transform Non-Additive Metrics Into Additive Objects
The Result of a Sketch Merge is Another Sketch … Enabling Set Expressions for Selected Sketches
Col1, ..., Item, ..., Coln
… Billions of rows ...
Intermediate Hyper-Cube Staging Enables Query Speed
Additivity Enables Simpler Architecture
Win 3: Near-Real Time Query Speed
Win 4: Simpler Architecture
Win #5: Simplified Time Windowing
Plus Late Data Processing
Near-Real time Results, with History!
Win #6: Lower System Cost ($)
Case Study: Real-time Flurry, Before and After
• Customers: >250K Mobile App Developers
• Data: 40-50 TB per day
• Platform: 2 clusters X 80 Nodes = 160 Nodes
– Node: 24 CPUs, 250GB RAM
Before Sketches After Sketches
VCS* / Mo. ~80B ~20B
Result Freshness
Daily: 2 to 8 hours; Weekly: ~3 days
Real-time Unique Counts Not Feasible
15 seconds!
Big Wins!
Near-Real Time
Lower System $
* VCS: Virtual Core Seconds
Introducing
The DataSketches Team
Core Team / Committers
• Lee Rhodes, Distinguished Architect, Yahoo/VM*. Started internal DataSketches project 2012
• Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015
• Jon Malkin, Ph.D., Research Engineer, Developer, Yahoo/VM, joined 2016
• Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015
• Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015
• Roman Leventov, Systems Developer for Apache Druid, Metamarkets, joined 2018
• Eshcar Hillel, Ph.D., Sr Scientist, Yahoo/VM, Israel, joined 2018
Extended Team & Consultants
• Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017
• Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019
• Daniel Ting, Ph.D., Sr Scientist, Tableau / Salesforce, joined 2019
… And our Community is Growing!
* VM = Verizon Media
Our Mission…
Combine Deep Science with Exceptional Engineering
To Develop Production Quality Sketches
That Address These Difficult Queries
Cardinality, 4 Families
• HLL (on/off Heap): A very high performing implementation of this well-known sketch
• CPC: The best accuracy per space
• Theta Sketches: Set Expressions (e.g., Union, Intersection, Difference), on/off Heap
• Tuple Sketches: Generic, Associative Theta Sketches, multiple derived sketches:
Quantiles Sketches, 2 Families
• Quantiles, Histograms, PMF’s and CDF’s of streams of comparable objects, on/off Heap.
KLL, highly optimized for accuracy-space.
• Relative Error Quantiles (under development)
Frequent Items (Heavy-Hitters) Sketches, 2 Families
• Frequent Items: Weighted or Unweighted
• Frequent Directions: Approximate SVD (a Vector Sketch)
Sampling: Reservoir and VarOpt (Edith Cohen) Sketches, 2 Families
• Uniform and weighted sampling to fixed-k sized buckets
Specialty Sketches
• Customer Engagement, Frequent Distinct Tuples, Maps, etc.
The Apache DataSketches Library
Languages Supported:
• Java, C++, Python
• Binary Compatibility
Bright Future for Sketching Technology & Solutions
Items (words, IDs, events, clicks, …)
• Count Distinct
• Frequent Items, Heavy-Hitters, etc
• Quantiles, Ranks, PMFs, CDFs, Histograms
• Set Operations
• Sampling
• Mobile (IoT)
• Moment and Entropy Estimation
Graphs (Social Networks, Communications, …)
• Connectivity
• Cut Sparsification
• Weighted Matching
• …
Areas where we have sketch implementations
Areas of research (World-wide)
Vectors (text docs, images, features, …) &
Matrices (text corpora, recommendations, …)
• Dimensionality Reduction (SVD)
• Covariance Estimation
• Low Rank Approximation
• Sparsification
• Clustering (k-means, k-median, …)
• Linear Regression
• Machine Learning (in some areas)
• Density Estimation
THANK YOU!
Open Invitation for
Collaboration
Learn More About Apache DataSketches
Come And Visit Us!
https://datasketches.apache.org

More Related Content

What's hot

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be StreamedDatabricks
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonDatabricks
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkDatabricks
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
 

What's hot (20)

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be Streamed
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Ray: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed PythonRay: Enterprise-Grade, Distributed Python
Ray: Enterprise-Grade, Distributed Python
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 

Similar to A Production Quality Sketching Library for the Analysis of Big Data

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAmazon Web Services
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analyticsRik Van Bruggen
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design PatternsMongoDB
 

Similar to A Production Quality Sketching Library for the Analysis of Big Data (20)

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Scalable web architecture
Scalable web architectureScalable web architecture
Scalable web architecture
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Advanced Schema Design Patterns
Advanced Schema Design PatternsAdvanced Schema Design Patterns
Advanced Schema Design Patterns
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 

Recently uploaded (20)

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

A Production Quality Sketching Library for the Analysis of Big Data

  • 1.
  • 2. A Production Quality Sketching Library for the Analysis of Big Data Lee Rhodes Distinguished Architect, Verizon Media (Yahoo), Inc.
  • 3. Outline Problematic Queries of Big Data Where traditional analysis methods don’t work well Approximate Analysis Using Sketches How using stochastic processes and probabilistic analysis wins in a systems architecture context The Open Source Apache DataSketches Library A quick overview of this unique library dedicated to production systems that process big data
  • 4. The Data Analysis Challenge … … Analyze This Data in Near-Real Time Time Stamp User ID Device ID Site Time Spent Sec Items Viewed 9:00 AM U1 D1 Apps 59 5 9:30 AM U2 D2 Apps 179 15 10:00 AM U3 D3 Music 29 3 1:00 PM U1 D4 Music 89 10 Billions of Rows or Key, Value Pairs … Example: Web Site Logs
  • 5. Some Very Common Queries … Unique Identifiers with Set Expressions: (AUB) ∩ (CUD) E Quantiles, CDFs Histograms, PMFs Frequent Items / Heavy Hitters Vector & Matrix Operations: SVD, etc. Graph Analysis Uniform Weighted Reservoir Sampling 5 ⋯ 2 ⋮ ⋱ ⋮ 4 ⋯ 3 All Non-Additive
  • 6. When The Data Gets Large Or Resources Are Limited, All Of These Queries Become Problematic, Because the Aggregations are Non-Additive. A B Result A Result B + Combined Result Current Result New Item Updated Result+
  • 7. Col1, ..., Item, ..., Coln … Billions of rows ... Ω(u) size: ~ Big Data Exact Results Difficult Query Local Item Copies Query processing often requires sorting …which is very slow. Query Engine Traditional, Exact Analysis Methods Require Local Copies Note: Micro-batch “Streaming Platforms”, e.g., Storm Do Not Solve The Fundamental Problem! Big Data
  • 8. Parallelization Does Not Help Much Because of Non-Additivity. You have to keep the copies somewhere! Col1, ..., Item, ..., Coln … Billions of rows ... 𝝨 Exact Results Expensive Shuffle Copy Copy Copy Copy Copy Copy Example: Map-Reduce
  • 9. Every dataset is processed N times for a rolling N-day window! Traditional, Exact Time Windowing Requires Multiple Touches of Every Item
  • 10. Let’s challenge a fundamental premise: … that our results must be exact! If we can allow for approximation, along with some accuracy guarantees, we can achieve orders-of-magnitude improvement in • speed and • reduction of resources.
  • 11. Introducing the Sketch (a.k.a., Stochastic Streaming Algorithm) Stream Processor Data Structure Size = f(k) Query Processor Stochastic Process Merge / Set Operations Results +/- ε 𝛆 = f(1/k) Probabilistic Analysis Result Sketch Sketch Stream Data Stream Query Model the Problem as a Stochastic Process Analyze using Probability & Statistics Random Selection Sizing, Storing
  • 12. How & Why Sketches Achieve Superior Performance For Systems Processing Massive Data
  • 13. Major Sketch Properties • Small Stored Size • Sub-linear in Space • Single-pass, “One-Touch” • Data Insensitive • Mergeable • Approximate, Probabilistic • Mathematically Proven Error Properties Sub-linear Stream Size Linear SketchSize Sketching Sampling Sketches Overlap with Sampling Based on the Specific Sketch
  • 14. Win #1: Small Query Space O(k) size: ~ Kilobytes Approximate Answer ± ε Difficult Query Minimal or no sorting required! Ideal for Streaming & Batch Query Engine Sketch Col1, ..., Item, ..., Coln … Billions of rows ... Sketches Start Small Sublinear Means they Stay Small Single Pass Simplifies Processing
  • 15. Win #2: Mergeability Sketch Approx. ± ε Sketch Sketch … , Item … many rows ... … , Item … many rows ... Query Query Partitions Merge Mergeability Enables Parallelism … With No Additional Loss of Accuracy! Sketches Transform Non-Additive Metrics Into Additive Objects The Result of a Sketch Merge is Another Sketch … Enabling Set Expressions for Selected Sketches Col1, ..., Item, ..., Coln … Billions of rows ...
  • 16. Intermediate Hyper-Cube Staging Enables Query Speed Additivity Enables Simpler Architecture Win 3: Near-Real Time Query Speed Win 4: Simpler Architecture
  • 17. Win #5: Simplified Time Windowing Plus Late Data Processing
  • 18. Near-Real time Results, with History!
  • 19. Win #6: Lower System Cost ($) Case Study: Real-time Flurry, Before and After • Customers: >250K Mobile App Developers • Data: 40-50 TB per day • Platform: 2 clusters X 80 Nodes = 160 Nodes – Node: 24 CPUs, 250GB RAM Before Sketches After Sketches VCS* / Mo. ~80B ~20B Result Freshness Daily: 2 to 8 hours; Weekly: ~3 days Real-time Unique Counts Not Feasible 15 seconds! Big Wins! Near-Real Time Lower System $ * VCS: Virtual Core Seconds
  • 21. The DataSketches Team Core Team / Committers • Lee Rhodes, Distinguished Architect, Yahoo/VM*. Started internal DataSketches project 2012 • Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015 • Jon Malkin, Ph.D., Research Engineer, Developer, Yahoo/VM, joined 2016 • Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015 • Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015 • Roman Leventov, Systems Developer for Apache Druid, Metamarkets, joined 2018 • Eshcar Hillel, Ph.D., Sr Scientist, Yahoo/VM, Israel, joined 2018 Extended Team & Consultants • Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017 • Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019 • Daniel Ting, Ph.D., Sr Scientist, Tableau / Salesforce, joined 2019 … And our Community is Growing! * VM = Verizon Media
  • 22. Our Mission… Combine Deep Science with Exceptional Engineering To Develop Production Quality Sketches That Address These Difficult Queries
  • 23. Cardinality, 4 Families • HLL (on/off Heap): A very high performing implementation of this well-known sketch • CPC: The best accuracy per space • Theta Sketches: Set Expressions (e.g., Union, Intersection, Difference), on/off Heap • Tuple Sketches: Generic, Associative Theta Sketches, multiple derived sketches: Quantiles Sketches, 2 Families • Quantiles, Histograms, PMF’s and CDF’s of streams of comparable objects, on/off Heap. KLL, highly optimized for accuracy-space. • Relative Error Quantiles (under development) Frequent Items (Heavy-Hitters) Sketches, 2 Families • Frequent Items: Weighted or Unweighted • Frequent Directions: Approximate SVD (a Vector Sketch) Sampling: Reservoir and VarOpt (Edith Cohen) Sketches, 2 Families • Uniform and weighted sampling to fixed-k sized buckets Specialty Sketches • Customer Engagement, Frequent Distinct Tuples, Maps, etc. The Apache DataSketches Library Languages Supported: • Java, C++, Python • Binary Compatibility
  • 24. Bright Future for Sketching Technology & Solutions Items (words, IDs, events, clicks, …) • Count Distinct • Frequent Items, Heavy-Hitters, etc • Quantiles, Ranks, PMFs, CDFs, Histograms • Set Operations • Sampling • Mobile (IoT) • Moment and Entropy Estimation Graphs (Social Networks, Communications, …) • Connectivity • Cut Sparsification • Weighted Matching • … Areas where we have sketch implementations Areas of research (World-wide) Vectors (text docs, images, features, …) & Matrices (text corpora, recommendations, …) • Dimensionality Reduction (SVD) • Covariance Estimation • Low Rank Approximation • Sparsification • Clustering (k-means, k-median, …) • Linear Regression • Machine Learning (in some areas) • Density Estimation
  • 25. THANK YOU! Open Invitation for Collaboration Learn More About Apache DataSketches Come And Visit Us! https://datasketches.apache.org