SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Unafraid of change
Optimizing ETL, ML & AI in fast-paced environments
Sim Simeonov, CTO, Swoop
sim at swoop.com / @simeons
#SAISDD13
omni-channel marketing for your ideal population
supported by privacy-preserving petabyte-scale
ML/AI over rich data (thousands of columns)
e.g., we improve health outcomes by increasing the
diagnosis rate of rare diseases through doctor/patient education
exploratory data science
the myth of
data science
The reality of data science is
the curse of too many choices
(most of which are no good)
• 10? Too slow to train.
• 2? Insufficient differentiation.
• 5? False differentiation.
• 3? Insufficient differentiation.
• 4? Looks OK to me. Ship it!
exploratory data science is an
interactive search process
(with fuzzy termination criteria)
driven by the discovery of new information
notebook
cell evaluation
backtracking is
code modification
Exploration tree
Backtracking changes production plans
df.sample(0.1, seed = 0).select('x)
df.where('x.isNotNull).sample(0.1, seed = 0).select('x, 'y)
df.where('x > 0).sample(0.1, seed = 0).select('x, 'y)
df.where('x > 0).sample(0.2, seed = 0).select('x, 'y)
df.where('x > 0).sample(0.2, seed = 123).select('x, 'y)
Backtracking is not easy with Spark & notebooks
Go back & change code
• Past state is lost
Duplicate code
val df1 = …
val df2 = …
val df3 = …
// No, extensibility
// through functions
// doesn’t work
many do this
one does this
thick branches
benefit from caching
Team view
Spark’s df.cache() has many limitations
• Single cluster only
– cached data cannot be shared by the team
• Lost after cluster crash or restart
– hours of work gone due to OOM or spot instance churn
• Requires df.unpersist() to release resources
– but backtracking means df (the cached plan) is lost
• No dependency checks
– inconsistent computation across team; stale data inputs
Ideal exploratory data production requires…
• Automatic cross-cluster reuse based on the Spark plan
– Configuration-addressed production (CAP)
• Not having to worry about saving/updating/deleting
– Automated lifecycle management (ALM)
• Easy control over the freshness/staleness of data used
– Just-in-time dependency resolution (JDR)
df.cache()
df.capCache()
part of https://github.com/swoop-inc/spark-alchemy
large production table (updated frequently)
data sample
sub-sample i
result
data enrichment dimension (updated sometimes)
join
filter(condition) & sample(rate, seed)
1..k: sample(rate, seed)
union
project
Example: resampling
with data enhancement
project
Are primes % 10,000 interesting numbers?
1 // Get the data
2 val inum = spark.table("interesting_numbers")
3 val primes = spark.table("primes")
4 val limit = 10000
5 // Sample the primes
6 val sample = primes.sample(0.1, seed = 0)
7 // Resample, calculate stats & union
8 val stats = (1 to 30).map { i =>
9 sample.select((col("prime") % limit).as("n"))
10 .sample(0.2, seed = i).distinct()
11 .join(inum.select(col("n")), Seq("n"))
12 .select(lit(i).as("sample"), count(col("*")).as("cnt_ns"))
13 }.reduce(_ union _)
14 // Show distribution of the percentage of interesting prime modulos
15 display(stats.select(('cnt_ns * 100 / limit).as("pct_ns_of_limit")))
1 // Get the data
2 val inum = spark.table("interesting_numbers")
3 val primes = spark.table("primes")
4 val limit = 10000
5 // Sample the primes
6 val sample = primes.sample(0.1, seed = 0).capCache()
7 // Resample, calculate stats & union
8 val stats = (1 to 30).map { i =>
9 sample.select((col("prime") % limit).as("n"))
10 .sample(0.2, seed = i).distinct()
11 .join(inum.select(col("n")), Seq("n"))
12 .select(lit(i).as("sample"), count(col("*")).as("cnt_ns")).capCache()
13 }.reduce(_ union _)
14 // Show distribution of the percentage of interesting prime modulos
15 display(stats.select(('cnt_ns * 100 / limit).as("pct_ns_of_limit")))
behind the curtain…
primes.sample(0.1, seed=0).explain(true)
== Analyzed Logical Plan ==
ordinal: int, prime: int, delta: int
Sample 0.0, 0.1, false, 0
+- SubqueryAlias primes
+- Relation[ordinal#2034,prime#2035,delta#2036] parquet
== Optimized Logical Plan ==
Sample 0.0, 0.1, false, 0
+- Relation[ordinal#2034,prime#2035,delta#2036] parquet
== Physical Plan ==
*(1) Sample 0.0, 0.1, false, 0
+- *(1) FileScan parquet default.primes[ordinal#2034,prime#2035,delta#2036]
Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/primes]
ordinal: int, prime: int, delta: int
Sample 0.0, 0.1, false, 0
+- Relation[ordinal#2034,prime#2035,delta#2036] parquet
+- InMemoryFileIndex[dbfs:/user/hive/warehouse/primes]
ordinal: int, prime: int, delta: int
Sample 0.0, 0.1, false, 0
+- Relation[ordinal#0,prime#1,delta#2] parquet
+- InMemoryFileIndex[dbfs:/user/hive/warehouse/primes]
CAP (hash: 4bb0070bd5f50bec95dd95f76130f55cd2d0c6bc)
ALM (createdAt: {now}, expiresAt: {based on TTL})
JDR (dependencies: ["table:default.primes"])
cross-cluster canonicalization
Spark’s user-level data reading/writing APIs
are inconsistent, inflexible and not extensible.
To implement CAP, we had to design a parallel set of APIs.
Reading
• Inconsistent
spark.table(tableName)
spark.read + configuration + .load(path)
.load(path1, path2, …)
.jdbc(…)
• Consistent
spark.reader + configuration + .read()
Writing
• Inconsistent
df.write + configuration + .save()
.saveAsTable(tableName)
.jdbc(…)
• Consistent
df.writer + configuration + .write()
// You can read() from the result of write()
Spark has two implicit mechanisms
to manage where data is written & read from:
the Hive metastore & “you’re on your own”.
We made authorities explicit first class objects.
df.capCache() is
df.writer.authority("CAP").readOrCreate()
Spark offers no ability to inject behaviors into reading/writing.
We added reading & writing interceptors.
Dependency management works via an interceptor.
trait InterceptorSupport[In, Out] {
def intercept(f: In => Out): (In => Out)
}
It’s not hard to fix the user-level reading/writing APIs.
Let’s not waste the chance to do it for Spark 3.0.
Interested in challenging data engineering, ML & AI on big data?
I’d love to hear from you. sim at swoop.com / @simeons
Data science productivity matters.
https://github.com/swoop-inc/spark-alchemy

Contenu connexe

Tendances

TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Fayan TAO
 

Tendances (20)

Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Introducing Reactive Machine Learning
Introducing Reactive Machine LearningIntroducing Reactive Machine Learning
Introducing Reactive Machine Learning
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
India software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreIndia software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Pandas
PandasPandas
Pandas
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
HDF5 Advanced Topics: Selections, Object's Properties, Storage Methods and Fi...
HDF5 Advanced Topics: Selections, Object's Properties, Storage Methods and Fi...HDF5 Advanced Topics: Selections, Object's Properties, Storage Methods and Fi...
HDF5 Advanced Topics: Selections, Object's Properties, Storage Methods and Fi...
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetData Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat Sheet
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
 

Similaire à Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments with Sim Simeonov

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 

Similaire à Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments with Sim Simeonov (20)

Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
 
Machine Learning with Python
Machine Learning with PythonMachine Learning with Python
Machine Learning with Python
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Training Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkTraining Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in Spark
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Practical Predictive Modeling in Python
Practical Predictive Modeling in PythonPractical Predictive Modeling in Python
Practical Predictive Modeling in Python
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Data herding
Data herdingData herding
Data herding
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Dernier (20)

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments with Sim Simeonov

  • 1. Unafraid of change Optimizing ETL, ML & AI in fast-paced environments Sim Simeonov, CTO, Swoop sim at swoop.com / @simeons #SAISDD13
  • 2. omni-channel marketing for your ideal population supported by privacy-preserving petabyte-scale ML/AI over rich data (thousands of columns) e.g., we improve health outcomes by increasing the diagnosis rate of rare diseases through doctor/patient education
  • 3.
  • 6. The reality of data science is the curse of too many choices (most of which are no good) • 10? Too slow to train. • 2? Insufficient differentiation. • 5? False differentiation. • 3? Insufficient differentiation. • 4? Looks OK to me. Ship it!
  • 7. exploratory data science is an interactive search process (with fuzzy termination criteria) driven by the discovery of new information
  • 8. notebook cell evaluation backtracking is code modification Exploration tree
  • 9. Backtracking changes production plans df.sample(0.1, seed = 0).select('x) df.where('x.isNotNull).sample(0.1, seed = 0).select('x, 'y) df.where('x > 0).sample(0.1, seed = 0).select('x, 'y) df.where('x > 0).sample(0.2, seed = 0).select('x, 'y) df.where('x > 0).sample(0.2, seed = 123).select('x, 'y)
  • 10. Backtracking is not easy with Spark & notebooks Go back & change code • Past state is lost Duplicate code val df1 = … val df2 = … val df3 = … // No, extensibility // through functions // doesn’t work
  • 11. many do this one does this thick branches benefit from caching Team view
  • 12. Spark’s df.cache() has many limitations • Single cluster only – cached data cannot be shared by the team • Lost after cluster crash or restart – hours of work gone due to OOM or spot instance churn • Requires df.unpersist() to release resources – but backtracking means df (the cached plan) is lost • No dependency checks – inconsistent computation across team; stale data inputs
  • 13. Ideal exploratory data production requires… • Automatic cross-cluster reuse based on the Spark plan – Configuration-addressed production (CAP) • Not having to worry about saving/updating/deleting – Automated lifecycle management (ALM) • Easy control over the freshness/staleness of data used – Just-in-time dependency resolution (JDR)
  • 15. large production table (updated frequently) data sample sub-sample i result data enrichment dimension (updated sometimes) join filter(condition) & sample(rate, seed) 1..k: sample(rate, seed) union project Example: resampling with data enhancement project
  • 16. Are primes % 10,000 interesting numbers?
  • 17. 1 // Get the data 2 val inum = spark.table("interesting_numbers") 3 val primes = spark.table("primes") 4 val limit = 10000 5 // Sample the primes 6 val sample = primes.sample(0.1, seed = 0) 7 // Resample, calculate stats & union 8 val stats = (1 to 30).map { i => 9 sample.select((col("prime") % limit).as("n")) 10 .sample(0.2, seed = i).distinct() 11 .join(inum.select(col("n")), Seq("n")) 12 .select(lit(i).as("sample"), count(col("*")).as("cnt_ns")) 13 }.reduce(_ union _) 14 // Show distribution of the percentage of interesting prime modulos 15 display(stats.select(('cnt_ns * 100 / limit).as("pct_ns_of_limit")))
  • 18. 1 // Get the data 2 val inum = spark.table("interesting_numbers") 3 val primes = spark.table("primes") 4 val limit = 10000 5 // Sample the primes 6 val sample = primes.sample(0.1, seed = 0).capCache() 7 // Resample, calculate stats & union 8 val stats = (1 to 30).map { i => 9 sample.select((col("prime") % limit).as("n")) 10 .sample(0.2, seed = i).distinct() 11 .join(inum.select(col("n")), Seq("n")) 12 .select(lit(i).as("sample"), count(col("*")).as("cnt_ns")).capCache() 13 }.reduce(_ union _) 14 // Show distribution of the percentage of interesting prime modulos 15 display(stats.select(('cnt_ns * 100 / limit).as("pct_ns_of_limit")))
  • 20. primes.sample(0.1, seed=0).explain(true) == Analyzed Logical Plan == ordinal: int, prime: int, delta: int Sample 0.0, 0.1, false, 0 +- SubqueryAlias primes +- Relation[ordinal#2034,prime#2035,delta#2036] parquet == Optimized Logical Plan == Sample 0.0, 0.1, false, 0 +- Relation[ordinal#2034,prime#2035,delta#2036] parquet == Physical Plan == *(1) Sample 0.0, 0.1, false, 0 +- *(1) FileScan parquet default.primes[ordinal#2034,prime#2035,delta#2036] Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/primes]
  • 21. ordinal: int, prime: int, delta: int Sample 0.0, 0.1, false, 0 +- Relation[ordinal#2034,prime#2035,delta#2036] parquet +- InMemoryFileIndex[dbfs:/user/hive/warehouse/primes] ordinal: int, prime: int, delta: int Sample 0.0, 0.1, false, 0 +- Relation[ordinal#0,prime#1,delta#2] parquet +- InMemoryFileIndex[dbfs:/user/hive/warehouse/primes] CAP (hash: 4bb0070bd5f50bec95dd95f76130f55cd2d0c6bc) ALM (createdAt: {now}, expiresAt: {based on TTL}) JDR (dependencies: ["table:default.primes"]) cross-cluster canonicalization
  • 22. Spark’s user-level data reading/writing APIs are inconsistent, inflexible and not extensible. To implement CAP, we had to design a parallel set of APIs.
  • 23. Reading • Inconsistent spark.table(tableName) spark.read + configuration + .load(path) .load(path1, path2, …) .jdbc(…) • Consistent spark.reader + configuration + .read()
  • 24. Writing • Inconsistent df.write + configuration + .save() .saveAsTable(tableName) .jdbc(…) • Consistent df.writer + configuration + .write() // You can read() from the result of write()
  • 25. Spark has two implicit mechanisms to manage where data is written & read from: the Hive metastore & “you’re on your own”. We made authorities explicit first class objects. df.capCache() is df.writer.authority("CAP").readOrCreate()
  • 26. Spark offers no ability to inject behaviors into reading/writing. We added reading & writing interceptors. Dependency management works via an interceptor. trait InterceptorSupport[In, Out] { def intercept(f: In => Out): (In => Out) }
  • 27. It’s not hard to fix the user-level reading/writing APIs. Let’s not waste the chance to do it for Spark 3.0.
  • 28. Interested in challenging data engineering, ML & AI on big data? I’d love to hear from you. sim at swoop.com / @simeons Data science productivity matters. https://github.com/swoop-inc/spark-alchemy