SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
High-Performance Analytics
with spark-alchemy
Sim Simeonov, Founder & CTO, Swoop
@simeons / sim at swoop dot com
Improving patient outcomes
LEADING HEALTH DATA LEADING CONSUMER DATA
Lifestyle
Magazinesubscriptions
Catalogpurchases
Psychographics
Animal lover
Fisherman
Demographics
Propertyrecords
Internettransactions
• 280M unique US patients
• 7 years longitudinal data
• De-identified, HIPAA-safe
1st Party Data
Proprietary tech to
integrate data
NPI Data
Attributed to the
patient
Claims
ICD 9 or 10, CPT,
Rx and J codes
• 300M US Consumers
• 3,500+ consumer attributes
• De-identified, privacy-safe
Petabyte scale privacy-preserving ML/AI
http://bit.ly/spark-records
http://bit.ly/spark-alchemy
process fewer rows of data
The key to high-performance analytics
the most important attribute of a
high-performance analytics system
is the reaggregatability of its data
count(distinct …)
is the bane of high-performance analytics
because it is not reaggregatable
Reaggregatability
root
|-- date: date
|-- generic: string
|-- brand: string
|-- product: string
|-- patient_id: long
|-- doctor_id: long
Demo system: prescriptions in 2018
• Narrow sample
• 10.7 billion rows / 150Gb
• Small-ish Spark 2.4 cluster
• 80 cores, 600Gb RAM
• Delta Lake, fully cached
select * from prescriptions
Brand nameGeneric name National Drug Code (NDC)
select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescriptions
group by 1
order by 1
Count scripts, generics & brands by month
Time: 145 secs
Input: 10.7B rows / 10Gb
Shuffle: 39M rows / 1Gb
decompose aggregate(…) into
reaggregate(preaggregate(…))
Divide & conquer
Do this onceDo this many times
Preaggregate by generic & brand by month
create table prescription_counts_by_month
select
to_date(date_trunc("month", date)) as date,
generic,
brand,
count(*) as scripts
from prescriptions
group by 1, 2, 3
select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescription_counts_by_month
group by 1
order by 1
Count scripts, generics & brands by month v2
Time: 3 secs (50x faster)
Input: 2.6M rows / 100Mb
Shuffle: 2.6M rows / 100Mb
select *, raw_count / agg_count as row_reduction
from
(select count(*) as raw_count from prescriptions)
cross join
(select count(*) as agg_count from prescription_counts_by_month)
Only 50x faster because of job startup cost
high row reduction is only possible when
preaggregating low cardinality dimensions,
such as generic (7K) and brand (20K), but not
product (350K) or patient_id (300+M)
The curse of high cardinality (1 of 2)
small shuffles are only possible with
low cardinality count(distinct …)
The curse of high cardinality (2 of 2)
select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(distinct patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Adding a high-cardinality distinct count
Time: 370 secs :(
Input: 10.7B rows / 21Gb
Shuffle: 7.5B rows / 102Gb
Maybe approximate counting can help?
select
to_date(date_trunc("month", date)) as date,
approx_count_distinct(generic) as generics,
approx_count_distinct(brand) as brands,
approx_count_distinct(patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Approximate counting, default 5% error
Time: 120 secs (3x faster)
Input: 10.7B rows / 21Gb
Shuffle: 6K rows / 7Mb
approx_count_distinct()
still has to look at every row of data
3x faster is not good enough
How do we preaggregate
high cardinality data
to compute distinct counts?
1. Preaggregate
Create an HLL sketch from data for distinct counts
2. Reaggregate
Merge HLL sketches (into HLL sketches)
3. Present
Compute cardinality of HLL sketches
Divide & conquer using HyperLogLog
HLL in spark-alchemy
https://github.com/swoop-inc/spark-alchemy
Preaggregate with HLL sketches
create table prescription_counts_by_month_hll
select
to_date(date_trunc("month", date)) as date,
generic,
brand,
count(*) as scripts,
hll_init_agg(patient_id) as patient_ids,
from prescriptions
group by 1, 2, 3
select
to_date(date_trunc("month", date)) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
hll_cardinality(hll_merge(patient_ids)) as patients,
count(*) as scripts
from prescription_counts_by_month_hll
group by 1
order by 1
Reaggregate and present with HLL sketches
Time: 7 secs (50x faster)
Input: 2.6M rows / 200Mb
Shuffle: 2.6M rows / 100Mb
the intuition behind HyperLogLog
Distribute n items randomly in k buckets
E(distance) ≅
*
+
E(min) ≅
*
+
⇒ 𝑛 ≅
*
/(12+)
more buckets == greater precision
HLL sketch ≅ a distribution of mins
true mean
HyperLogLog sketches are reaggregatable
because min reaggregates with min
Making it work in the real world
• Data is not uniformly distributed…
• Hash it!
• How do we get many “samples” from one set of hashes?
• Partition them!
• Can we get a good estimate for the mean?
• Yes, with some fancy math & empirical corrections.
• Do we actually have to keep the minimums?
• No, just keep the number of 0s before the first 1 in binary form.
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
my boss wants me to count precisely
Sketch sizes affect estimation errors
• ClearSpring HLL++ https://github.com/addthis/stream-lib
• No known interoperability
• Neustar (Aggregate Knowledge) HLL https://github.com/aggregateknowledge/java-hll
• Postgres & JavaScript interop
• BigQuery HLL++ https://github.com/google/zetasketch
• BigQuery interop (PRs welcome J)
spark-alchemy & HLL interoperability
hll_convert(hll_sketch, from, to)
• High-performance interactive analytics
• Preaggregate in Spark, push to Postgres / Citus, reaggregate there
• Better privacy
• HLL sketches contain no identifiable information
• Unions across columns
• No added error
• Intersections across columns
• Use inclusion/exclusion principle; increases estimate error
Other benefits of using HLL sketches
• Experiment with the HLL functions in spark-alchemy
• Can you keep big data in Spark only and interop with HLL sketches?
• We’d welcome a PR that adds BigQuery support to spark-alchemy
• Last but not least, do you want to build tools to make Spark great
while improving the lives of millions of patients?
Calls to Action
sim at swoop dot com

Contenu connexe

Tendances

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

Tendances (20)

How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
 
Fig 9-02
Fig 9-02Fig 9-02
Fig 9-02
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
 
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
 

Similaire à High-Performance Advanced Analytics with Spark-Alchemy

Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
akitda
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 

Similaire à High-Performance Advanced Analytics with Spark-Alchemy (20)

High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
 
Dimensional Modelling Session 2
Dimensional Modelling Session 2Dimensional Modelling Session 2
Dimensional Modelling Session 2
 
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig KerstiensFive Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
 
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Lecture 06 - CS-5040 - modern database systems
Lecture 06  - CS-5040 - modern database systemsLecture 06  - CS-5040 - modern database systems
Lecture 06 - CS-5040 - modern database systems
 
Learning content - Data Science Basics
Learning content - Data Science Basics Learning content - Data Science Basics
Learning content - Data Science Basics
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
A Metadata-Driven Approach to Computing Financial Analytics in a Relational D...
A Metadata-Driven Approach to Computing Financial Analytics in a Relational D...A Metadata-Driven Approach to Computing Financial Analytics in a Relational D...
A Metadata-Driven Approach to Computing Financial Analytics in a Relational D...
 
esProc introduction
esProc introductionesProc introduction
esProc introduction
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Dernier (20)

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

High-Performance Advanced Analytics with Spark-Alchemy

  • 1. High-Performance Analytics with spark-alchemy Sim Simeonov, Founder & CTO, Swoop @simeons / sim at swoop dot com
  • 2. Improving patient outcomes LEADING HEALTH DATA LEADING CONSUMER DATA Lifestyle Magazinesubscriptions Catalogpurchases Psychographics Animal lover Fisherman Demographics Propertyrecords Internettransactions • 280M unique US patients • 7 years longitudinal data • De-identified, HIPAA-safe 1st Party Data Proprietary tech to integrate data NPI Data Attributed to the patient Claims ICD 9 or 10, CPT, Rx and J codes • 300M US Consumers • 3,500+ consumer attributes • De-identified, privacy-safe Petabyte scale privacy-preserving ML/AI
  • 5. process fewer rows of data The key to high-performance analytics
  • 6. the most important attribute of a high-performance analytics system is the reaggregatability of its data
  • 7. count(distinct …) is the bane of high-performance analytics because it is not reaggregatable
  • 9. root |-- date: date |-- generic: string |-- brand: string |-- product: string |-- patient_id: long |-- doctor_id: long Demo system: prescriptions in 2018 • Narrow sample • 10.7 billion rows / 150Gb • Small-ish Spark 2.4 cluster • 80 cores, 600Gb RAM • Delta Lake, fully cached
  • 10. select * from prescriptions Brand nameGeneric name National Drug Code (NDC)
  • 11. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, count(*) as scripts from prescriptions group by 1 order by 1 Count scripts, generics & brands by month Time: 145 secs Input: 10.7B rows / 10Gb Shuffle: 39M rows / 1Gb
  • 13. Preaggregate by generic & brand by month create table prescription_counts_by_month select to_date(date_trunc("month", date)) as date, generic, brand, count(*) as scripts from prescriptions group by 1, 2, 3
  • 14. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, count(*) as scripts from prescription_counts_by_month group by 1 order by 1 Count scripts, generics & brands by month v2 Time: 3 secs (50x faster) Input: 2.6M rows / 100Mb Shuffle: 2.6M rows / 100Mb
  • 15. select *, raw_count / agg_count as row_reduction from (select count(*) as raw_count from prescriptions) cross join (select count(*) as agg_count from prescription_counts_by_month) Only 50x faster because of job startup cost
  • 16. high row reduction is only possible when preaggregating low cardinality dimensions, such as generic (7K) and brand (20K), but not product (350K) or patient_id (300+M) The curse of high cardinality (1 of 2)
  • 17. small shuffles are only possible with low cardinality count(distinct …) The curse of high cardinality (2 of 2)
  • 18. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, count(distinct patient_id) as patients, count(*) as scripts from prescriptions group by 1 order by 1 Adding a high-cardinality distinct count Time: 370 secs :( Input: 10.7B rows / 21Gb Shuffle: 7.5B rows / 102Gb
  • 20. select to_date(date_trunc("month", date)) as date, approx_count_distinct(generic) as generics, approx_count_distinct(brand) as brands, approx_count_distinct(patient_id) as patients, count(*) as scripts from prescriptions group by 1 order by 1 Approximate counting, default 5% error Time: 120 secs (3x faster) Input: 10.7B rows / 21Gb Shuffle: 6K rows / 7Mb
  • 21. approx_count_distinct() still has to look at every row of data 3x faster is not good enough
  • 22. How do we preaggregate high cardinality data to compute distinct counts?
  • 23. 1. Preaggregate Create an HLL sketch from data for distinct counts 2. Reaggregate Merge HLL sketches (into HLL sketches) 3. Present Compute cardinality of HLL sketches Divide & conquer using HyperLogLog
  • 25.
  • 26. Preaggregate with HLL sketches create table prescription_counts_by_month_hll select to_date(date_trunc("month", date)) as date, generic, brand, count(*) as scripts, hll_init_agg(patient_id) as patient_ids, from prescriptions group by 1, 2, 3
  • 27. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, hll_cardinality(hll_merge(patient_ids)) as patients, count(*) as scripts from prescription_counts_by_month_hll group by 1 order by 1 Reaggregate and present with HLL sketches Time: 7 secs (50x faster) Input: 2.6M rows / 200Mb Shuffle: 2.6M rows / 100Mb
  • 28. the intuition behind HyperLogLog
  • 29. Distribute n items randomly in k buckets E(distance) ≅ * + E(min) ≅ * + ⇒ 𝑛 ≅ * /(12+) more buckets == greater precision
  • 30. HLL sketch ≅ a distribution of mins true mean
  • 31. HyperLogLog sketches are reaggregatable because min reaggregates with min
  • 32. Making it work in the real world • Data is not uniformly distributed… • Hash it! • How do we get many “samples” from one set of hashes? • Partition them! • Can we get a good estimate for the mean? • Yes, with some fancy math & empirical corrections. • Do we actually have to keep the minimums? • No, just keep the number of 0s before the first 1 in binary form. https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
  • 33. my boss wants me to count precisely
  • 34. Sketch sizes affect estimation errors
  • 35. • ClearSpring HLL++ https://github.com/addthis/stream-lib • No known interoperability • Neustar (Aggregate Knowledge) HLL https://github.com/aggregateknowledge/java-hll • Postgres & JavaScript interop • BigQuery HLL++ https://github.com/google/zetasketch • BigQuery interop (PRs welcome J) spark-alchemy & HLL interoperability hll_convert(hll_sketch, from, to)
  • 36. • High-performance interactive analytics • Preaggregate in Spark, push to Postgres / Citus, reaggregate there • Better privacy • HLL sketches contain no identifiable information • Unions across columns • No added error • Intersections across columns • Use inclusion/exclusion principle; increases estimate error Other benefits of using HLL sketches
  • 37. • Experiment with the HLL functions in spark-alchemy • Can you keep big data in Spark only and interop with HLL sketches? • We’d welcome a PR that adds BigQuery support to spark-alchemy • Last but not least, do you want to build tools to make Spark great while improving the lives of millions of patients? Calls to Action sim at swoop dot com