SlideShare une entreprise Scribd logo
1  sur  63
Télécharger pour lire hors ligne
Running Spark in Production
in the Cloud is not easy
Nayur Khan
#SAISEnt12
2All content copyright © 2017 QuantumBlack, a McKinsey company
Who am I?
Nayur Khan
Head of Platform Engineering
All content copyright © 2017 QuantumBlack, a McKinsey company
Born and proven in F1
where the smallest margins are
the difference between winning
and losing
Data has emerged as a
fundamental element of
competitive advantage
3All content copyright © 2017 QuantumBlack, a McKinsey company
QuantumBlack
“Exploit Data, Analytics and Design to
help our clients be the best they can be”
All content copyright © 2018 QuantumBlack, a McKinsey company
Advanced
Industries
Financial
Services
Healthcare
Infra-
structure
Telecoms Natural
Resources
Sport Consumer
Not just Formula One…
1 – Background
2 – Observations of using Cloud Object Storage
3 – Design To Operate
4 – Pipelines!
Background
1
7All content copyright © 2017 QuantumBlack, a McKinsey company
Some of the Platforms we use
Nerve Live
Powered By
All content copyright © 2018 QuantumBlack, a McKinsey company
Typical Flow for production
Apply Advanced
Analytics and
ML to features
Leverage unconnected, unstructured
data by preparing and linking data and
creating features
Advanced
analytics
User modules
Analytics Platform
Raw Data
...
Better & more
consistent decision
making
Prepared Data
Cleaned,
Enriched and
Linked
Infrastructure / Hosting
• Data Ingested into a Raw storage area
• Data is Prepared
• Cleansed
• Enriched
• Linked
• etc
• Models are run
• User Tools consume outputs
• Rich custom built tools
• Self service BI tools
2
3
4
1
1 2 3 4
All content copyright © 2018 QuantumBlack, a McKinsey company
Analytic Modules Analytic Modules
Zookeeper
Mesos / Marathon
Master Servers
Slave Servers
Nerve Live Analytics Cluster (AWS)
Zookeeper
Mesos / Marathon
Zookeeper
Mesos / Marathon
Mesos Slave / Chronos
Analytic Modules
Mesos Slave / Chronos
Analytic Modules
Spark AnalyticsSpark Analytics
…
Data Buckets
• Mesos based Spark cluster
• Open Source versions
• Data stored mainly in S3
• Raw
• Prepared
• Features
• Models
• Data published at end of Pipeline (RDS)
• Spark Analytic workloads
• Scheduled via Marathon or Chronos
Example of physical Architecture of Nerve Live in AWS
1
2
3
4
Published
Data
Storage
ServicesServices
Logs Metastore
All content copyright © 2018 QuantumBlack, a McKinsey company
Spark-Base-2.3.1 Image
Dependencies
+ Config
+ Config
O/S Image
Java-Base Image
• Use Docker
• Provide a Spark-Base-2.x.x Image to teams to
package code into
2
1
Example of how we package Spark Jobs
All content copyright © 2018 QuantumBlack, a McKinsey company
Spark-Base-2.3.1 Image
Dependencies
+ Config
+ Config
O/S Image
Java-Base Image
App Image
+ ConfigSpark Code
• Use Docker
• Provide a Spark-Base-2.x.x Image to teams to
package code into
• Teams package up their code into provided
image
• Can tell Spark to tell Mesos which Image to use
for executors
spark.mesos.executor.docker.image
• Bonus - We can run different versions of Spark
workloads in the cluster at same time!
2
3
1
4
Example of how we package Spark Jobs
5
Observations of using Cloud
Object Storage
2
All content copyright © 2018 QuantumBlack, a McKinsey company
Background – File Storage
/
foo/ a.csv
Directory Physical
b.csv
c.avrobar/
temp/
emp/
any/
File Storage
• Organize (file) data using
hierarchical structure (folder & files)
• Allows
• Traverse a path
• Quick to “list” files in a directory
• Quick to ”rename” files
1
2
+
+
+
All content copyright © 2018 QuantumBlack, a McKinsey company
Background – Object Storage
Object Storage
Key Physical
/foo/a.csv
/bar/b.csv
/bar/c.avro
• No concept of organsiation
Keys à Data
• Unlike File Storage
• Slow to “list” items in a “directory”
• Slow to ”rename” files
• Use REST calls
1
2
-
-
-
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work...
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data
Spark Code Datasource API
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work...
Hadoop Libs AWS Libs
load( path )
S3
HTTPS:// HEAD / GET
…
(depends on number of files/partitions)
s
ms
key
REST
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema vs. Infer Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
// Define a dataframe
val df = spark.read.format( “csv” )
.option( “inferSchema”, true )
.load( “s3a://...” )
// Do some work
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema vs. Infer Schema
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
// Define a dataframe
val df = spark.read.format( “csv” )
.option( “inferSchema”, true )
.load( “s3a://...” )
// Do some work
13
14 14
10 10 10
3
4 4
0
5
10
15
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
First Observation – The “Intent” to Read Data – How many HTTPS calls?
With Schema
7
6
4
6
5
4
1 1
0
0
2
4
6
8
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
vs. Infer Schema
13
14 14
10 10 10
3
4 4
0
5
10
15
2.1 2.2 2.3
Path to 1 x CSV File
TOTAL
HEAD
GET
15
11
14
9
7
9
6
4 5
0
5
10
15
20
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
817
613
1082
610
408
643
207 205
439
0
200
400
600
800
1000
1200
2.1 2.2 2.3
Path to 200 x CSV Files
TOTAL
HEAD
GET
// Define a dataframe
val df = spark.read.format( “csv” )
.schema( csvSchema ).load( “s3a://...” )
// Do some work
// Define a dataframe
val df = spark.read.format( “csv” )
.option( “inferSchema”, true )
.load( “s3a://...” )
// Do some work
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – why does it take so long?
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – why does it take so long?
Spark Code Datasource API Hadoop Libs AWS Libs
save( path )
S3
HTTPS:// PUT / HEAD / GET / DELETE
…
s
ms
key
REST
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
# Custom log4j properties
...
log4j.logger.com.amazonaws.request=DEBUG
...
…
(depends on number of files/partitions)
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2)
df.write.format( “parquet” ).save( “s3a://...” )
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
126
112 113
79
71 72
40 34 34
5 5 52 2 2
0
20
40
60
80
100
120
140
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2)
df.write.format( “parquet” ).save( “s3a://...” )
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Second Observation – Writing Data – How many HTTPS calls?
Default
165
151 152
103 95 96
53 47 47
7 7 72 2 2
0
50
100
150
200
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
// Define a dataframe
df.write.format( “parquet” ).save( “s3a://...” )
// Do some work...
126
112 113
79
71 72
40 34 34
5 5 52 2 2
0
20
40
60
80
100
120
140
2.1 2.2 2.3
1 Partition
TOTAL
HEAD
GET
PUT
DELETE
1092 1078
381
697 689
244
323 317
114
52 52 1720 20 6
0
200
400
600
800
1000
1200
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
738 724
272
475 467
176
220 214
81
32 32 1111 11 4
0
200
400
600
800
2.1 2.2 2.3
10 Partitions
TOTAL
HEAD
GET
PUT
DELETE
Go Faster
spark.conf.set(
“spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2)
df.write.format( “parquet” ).save( “s3a://...” )
Spark Versions
Spark 2.1 - spark-2.1.0-hadoop2.7
Spark 2.2 - spark-2.2.1-hadoop2.7
Spark 2.3 - spark-2.3.1-hadoop2.7
All content copyright © 2018 QuantumBlack, a McKinsey company
Conclusion
• Cloud object storage has tradeoffs
• Gives incredible storage capacity and scalability
• However, do not expect same performance/
or characteristics as a file system
• There are lots of configuration settings that
can affect your performance with reading/writing
to cloud object storage
Note: Some are Hadoop version specific!
1
2
3
4
spark.hadoop.fs.s3a.fast.upload=true
spark.hadoop.fs.s3a.connection.maximum=xxx
spark.hadoop.fs.s3a.multipart.size=xxx
spark.hadoop.fs.s3a.attempts.maximum=xxx
spark.hadoop.fs.s3a.multipart.threshold=xxx
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
etc
...
Design to Operate
3
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Choices
• Our teams have faced choices for
Loading / Saving data
• What URI scheme to use?
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What file format?
─ i.e. Parquet, Avro, CSV, etc
• What compression to use?
─ i.e. Snappy, LZO, Gzip, etc
• What additional options….
─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc
• Even what path to load/save data
1
?
? ?
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Choices
• Our teams have faced choices for
Loading / Saving data
• What URI scheme to use?
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What file format?
─ i.e. Parquet, Avro, CSV, etc
• What compression to use?
─ i.e. Snappy, LZO, Gzip, etc
• What additional options….
─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc
• Even what path to load/save data
• Choices end up being “embedded” into code
1
2
val path = "s3n://my-bucket/csv-data/people"
val peopleDF =
spark.read.format( "csv" )
.option( "sep", ";" )
.option( "inferSchema", "true" )
.option( "header", "true" )
.load( path )
...
?
? ?
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Challenges
• Different teams working in same cluster
• Working on different use cases
• Likely that Spark Jobs are running with different settings in
same cluster
1
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Challenges
• Different teams working in same cluster
• Working on different use cases
• Likely that Spark Jobs are running with different settings in
same cluster
1
• Those that build may not be the ones that end up
running/operating Spark Jobs
• Support Ops
• Or Client IT teams
2
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Challenges
• Different teams working in same cluster
• Working on different use cases
• Likely that Spark Jobs are running with different settings in
same cluster
1
• Those that build may not be the ones that end up
running/operating Spark Jobs
• Support Ops
• Or Client IT teams
• Difficult for those operating to understand
• What configuration settings
• What does it depend on? i.e. Data it needs
• What depends on it? i.e. Data it writes
• How can I change things quickly?
3
2
?
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
• Borrow ideas from Devops world
• "Strict separation of config from code”
1
Dev Ops
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
Spark Code
Nerve Load/Save Library
+ Config
• Borrow ideas from Devops world
• "Strict separation of config from code”
• Custom library and config for loading and saving2
1
USES
USES
"system" : ”bank”,
"table": "customers",
”type": ”input",
”description": ”A Table with banking customer details.",
"format": {
"type": ”csv",
"compression": "snappy",
"partitions": "20”
"mode": "overwrite"
}
"location": {
"type": "s3",
"properties": { ... }
}
"schema": [
{
"name": "id",
"type": "int",
"doc": "User Id"
},
...
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
Spark Code
Nerve Load/Save Library
+ Config
• Borrow ideas from Devops world
• "Strict separation of config from code”
• Custom library and config for loading and saving
• For Loading / Saving data
• What URI scheme to use?
─ i.e. S3://… S3n://… S3a://…, JDBC:// etc
• What format?
─ i.e. Parquet, Avro, CSV, etc
• What compression to use?
─ i.e. Snappy, LZO, Gzip, etc
• What additional options….
─ i.e. inferSchema, mergeSchema, overwrite or append….
• Even what path to load/save data
2
3
1
USES
USES
"system" : ”bank”,
"table": "customers",
”type": ”input",
”description": ”A Table with banking customer details.",
"format": {
"type": ”csv",
"compression": "snappy",
"partitions": "20”
"mode": "overwrite"
}
"location": {
"type": "s3",
"properties": { ... }
}
"schema": [
{
"name": "id",
"type": "int",
"doc": "User Id"
},
...
All content copyright © 2018 QuantumBlack, a McKinsey company
Design to Operate – Our Approach
// Load a dataframe
val df = nerve.load( “system”, “table” )
// Save a dataframe
nerve.save( df, “system”, “table” )
Spark Code
Nerve Load/Save Library
+ Config
• Engineers use this library in their code
• Allows Engineers to focus on business logic and
not worry about concerns such as:
• Data formats
• Compressions
• Schemas
• Partitions
• Weird optimisation settings
• Allows teams that focus on operating/running to
focus on
• Consistency
• Runtime optimisations
2
3 USES
USES
1
Pipelines!
4
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines
• Pipelines mean different things
to different people
• Ultimately, its about the
Flow of data
• We need to
• Monitor
• Understand
• React
• It looks simple,
but in fact its not
1
2
4
3
Ingest Prepare Analytics Publish
Flow
Flow
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Events
Sources
Raw Prepared Published
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example
Sources
Raw Prepared Published
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500
• Agree a schedule with
client when to send data
1
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500 0700
• Agree a schedule with
client when to send data
• Configure schedule to kick
off Spark Jobs
1
2
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500 0700
• Agree a schedule with
client when to send data
• Configure schedule to kick
off Spark Jobs
• Subsequent Spark Jobs
kick off when previous
have completed
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Agreed Schedule
Sources
Raw Prepared Published
0500 0700 0900
• Agree a schedule with
client when to send data
• Configure schedule to kick
off Spark Jobs
• Subsequent Spark Jobs
kick off when previous
have completed
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Challenge 1 – Untimely Data
Sources
Raw Prepared Published
0500 0700 0900
• Data failed to be sent up on
time
• Impact to rest of the
Pipeline
• Impact to end user!
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Challenge 2 – Bad Data
Sources
Raw Prepared Published
0500 0700 0900
• Bad/Unexpected Data
• Schema change
• Change in Business rules
• Data Quality Issue
• Impact to Pipeline
• Impact to end user!
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Challenge 3 – Jobs Taking Longer
Sources
Raw Prepared Published
0500 0700 0900
• Spark Job’s taking longer
and longer
• Impact to Pipeline
• Impact to end user!
1
2
3
Events
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example - Challenges
Sources
Raw Prepared Published
0500
Events
• We need to
• Monitor
• Understand
• React
• It looks simple,
but in fact its not
1
2
0700 0900
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Our Approach
Sources
Raw Prepared Published
• Monitor all Raw sources for
timely data
1
Monitoring
Events
+
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Our Approach
Sources
Raw Prepared
• Monitor all Raw sources for
timely data
• Monitor quality of data
• Config driven
• Agreed business rules
1
Monitoring
Events
+
Metadata
2
Published
All content copyright © 2018 QuantumBlack, a McKinsey company
Pipelines – Example – Our Approach
Sources
Raw
• Monitor all Raw sources for
timely data
• Monitor quality of data
• Config driven
• Agreed business rules
• Monitor Spark Metrics
• At least how long
• Additional Metrics incl:
─ Intent to load data J
─ Saving data J
1
Monitoring
Events
+
Metadata
2
3
PublishedPrepared
1 – Observations of using Cloud Object Storage
2 – Design To Operate
3 – Pipelines!
Recap
1 – Observations of using Cloud Object Storage
2 – Design To Operate
3 – Pipelines!
Lets Talk!
Recap

Contenu connexe

Tendances

Common Application Architecture Patterns – Dan Zoltak
Common Application Architecture Patterns – Dan ZoltakCommon Application Architecture Patterns – Dan Zoltak
Common Application Architecture Patterns – Dan ZoltakAmazon Web Services
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureAntoine Poliakov
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Working with Terraform on Azure
Working with Terraform on AzureWorking with Terraform on Azure
Working with Terraform on Azuretombuildsstuff
 
Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...
Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...
Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...Amazon Web Services
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017
Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017
Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017Amazon Web Services
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks
 
Gartner evaluation criteria_for_clou_security_networking
Gartner evaluation criteria_for_clou_security_networkingGartner evaluation criteria_for_clou_security_networking
Gartner evaluation criteria_for_clou_security_networkingYerlin Sturdivant
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...Altinity Ltd
 
ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...
ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...
ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...Amazon Web Services
 
Consul 1.6: Layer 7 Traffic Management and Mesh Gateways
Consul 1.6: Layer 7 Traffic Management and Mesh GatewaysConsul 1.6: Layer 7 Traffic Management and Mesh Gateways
Consul 1.6: Layer 7 Traffic Management and Mesh GatewaysMitchell Pronschinske
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectImplementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectK.Mohamed Faizal
 
Openstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud NetworkingOpenstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud NetworkingShannon McFarland
 

Tendances (20)

Common Application Architecture Patterns – Dan Zoltak
Common Application Architecture Patterns – Dan ZoltakCommon Application Architecture Patterns – Dan Zoltak
Common Application Architecture Patterns – Dan Zoltak
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows Azure
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Working with Terraform on Azure
Working with Terraform on AzureWorking with Terraform on Azure
Working with Terraform on Azure
 
Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...
Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...
Automating Backup & Archiving with AWS and CommVault – Chris Gondek, Principa...
 
Hadoop on-mesos
Hadoop on-mesosHadoop on-mesos
Hadoop on-mesos
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017
Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017
Container Networking Deep Dive with Amazon ECS - CON401 - re:Invent 2017
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
 
Gartner evaluation criteria_for_clou_security_networking
Gartner evaluation criteria_for_clou_security_networkingGartner evaluation criteria_for_clou_security_networking
Gartner evaluation criteria_for_clou_security_networking
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
How we broke Apache Ignite by adding persistence, by Stephen Darlington (Grid...
 
ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...
ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...
ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) ...
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
Consul 1.6: Layer 7 Traffic Management and Mesh Gateways
Consul 1.6: Layer 7 Traffic Management and Mesh GatewaysConsul 1.6: Layer 7 Traffic Management and Mesh Gateways
Consul 1.6: Layer 7 Traffic Management and Mesh Gateways
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectImplementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
 
Openstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud NetworkingOpenstack Summit Vancouver 2018 - Multicloud Networking
Openstack Summit Vancouver 2018 - Multicloud Networking
 

Similaire à Running Spark In Production in the Cloud is Not Easy with Nayur Khan

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKRob Mueller
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksTheofilos Kakantousis
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Evention
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksDustin Vannoy
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullJim Dowling
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

Similaire à Running Spark In Production in the Cloud is Not Easy with Nayur Khan (20)

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASK
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in HopsworksSecure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
 
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
Spark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim DowlingSpark Summit EU talk by Jim Dowling
Spark Summit EU talk by Jim Dowling
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-full
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 

Dernier (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 

Running Spark In Production in the Cloud is Not Easy with Nayur Khan

  • 1. Running Spark in Production in the Cloud is not easy Nayur Khan #SAISEnt12
  • 2. 2All content copyright © 2017 QuantumBlack, a McKinsey company Who am I? Nayur Khan Head of Platform Engineering
  • 3. All content copyright © 2017 QuantumBlack, a McKinsey company Born and proven in F1 where the smallest margins are the difference between winning and losing Data has emerged as a fundamental element of competitive advantage 3All content copyright © 2017 QuantumBlack, a McKinsey company QuantumBlack “Exploit Data, Analytics and Design to help our clients be the best they can be”
  • 4. All content copyright © 2018 QuantumBlack, a McKinsey company Advanced Industries Financial Services Healthcare Infra- structure Telecoms Natural Resources Sport Consumer Not just Formula One…
  • 5. 1 – Background 2 – Observations of using Cloud Object Storage 3 – Design To Operate 4 – Pipelines!
  • 7. 7All content copyright © 2017 QuantumBlack, a McKinsey company Some of the Platforms we use Nerve Live Powered By
  • 8. All content copyright © 2018 QuantumBlack, a McKinsey company Typical Flow for production Apply Advanced Analytics and ML to features Leverage unconnected, unstructured data by preparing and linking data and creating features Advanced analytics User modules Analytics Platform Raw Data ... Better & more consistent decision making Prepared Data Cleaned, Enriched and Linked Infrastructure / Hosting • Data Ingested into a Raw storage area • Data is Prepared • Cleansed • Enriched • Linked • etc • Models are run • User Tools consume outputs • Rich custom built tools • Self service BI tools 2 3 4 1 1 2 3 4
  • 9. All content copyright © 2018 QuantumBlack, a McKinsey company Analytic Modules Analytic Modules Zookeeper Mesos / Marathon Master Servers Slave Servers Nerve Live Analytics Cluster (AWS) Zookeeper Mesos / Marathon Zookeeper Mesos / Marathon Mesos Slave / Chronos Analytic Modules Mesos Slave / Chronos Analytic Modules Spark AnalyticsSpark Analytics … Data Buckets • Mesos based Spark cluster • Open Source versions • Data stored mainly in S3 • Raw • Prepared • Features • Models • Data published at end of Pipeline (RDS) • Spark Analytic workloads • Scheduled via Marathon or Chronos Example of physical Architecture of Nerve Live in AWS 1 2 3 4 Published Data Storage ServicesServices Logs Metastore
  • 10. All content copyright © 2018 QuantumBlack, a McKinsey company Spark-Base-2.3.1 Image Dependencies + Config + Config O/S Image Java-Base Image • Use Docker • Provide a Spark-Base-2.x.x Image to teams to package code into 2 1 Example of how we package Spark Jobs
  • 11. All content copyright © 2018 QuantumBlack, a McKinsey company Spark-Base-2.3.1 Image Dependencies + Config + Config O/S Image Java-Base Image App Image + ConfigSpark Code • Use Docker • Provide a Spark-Base-2.x.x Image to teams to package code into • Teams package up their code into provided image • Can tell Spark to tell Mesos which Image to use for executors spark.mesos.executor.docker.image • Bonus - We can run different versions of Spark workloads in the cluster at same time! 2 3 1 4 Example of how we package Spark Jobs 5
  • 12. Observations of using Cloud Object Storage 2
  • 13. All content copyright © 2018 QuantumBlack, a McKinsey company Background – File Storage / foo/ a.csv Directory Physical b.csv c.avrobar/ temp/ emp/ any/ File Storage • Organize (file) data using hierarchical structure (folder & files) • Allows • Traverse a path • Quick to “list” files in a directory • Quick to ”rename” files 1 2 + + +
  • 14. All content copyright © 2018 QuantumBlack, a McKinsey company Background – Object Storage Object Storage Key Physical /foo/a.csv /bar/b.csv /bar/c.avro • No concept of organsiation Keys à Data • Unlike File Storage • Slow to “list” items in a “directory” • Slow to ”rename” files • Use REST calls 1 2 - - -
  • 15. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work... # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ...
  • 16. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data Spark Code Datasource API // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work... Hadoop Libs AWS Libs load( path ) S3 HTTPS:// HEAD / GET … (depends on number of files/partitions) s ms key REST # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ...
  • 17. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 18. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 19. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 20. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema vs. Infer Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work // Define a dataframe val df = spark.read.format( “csv” ) .option( “inferSchema”, true ) .load( “s3a://...” ) // Do some work 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 21. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema vs. Infer Schema // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work // Define a dataframe val df = spark.read.format( “csv” ) .option( “inferSchema”, true ) .load( “s3a://...” ) // Do some work 13 14 14 10 10 10 3 4 4 0 5 10 15 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 22. All content copyright © 2018 QuantumBlack, a McKinsey company First Observation – The “Intent” to Read Data – How many HTTPS calls? With Schema 7 6 4 6 5 4 1 1 0 0 2 4 6 8 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET vs. Infer Schema 13 14 14 10 10 10 3 4 4 0 5 10 15 2.1 2.2 2.3 Path to 1 x CSV File TOTAL HEAD GET 15 11 14 9 7 9 6 4 5 0 5 10 15 20 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET 817 613 1082 610 408 643 207 205 439 0 200 400 600 800 1000 1200 2.1 2.2 2.3 Path to 200 x CSV Files TOTAL HEAD GET // Define a dataframe val df = spark.read.format( “csv” ) .schema( csvSchema ).load( “s3a://...” ) // Do some work // Define a dataframe val df = spark.read.format( “csv” ) .option( “inferSchema”, true ) .load( “s3a://...” ) // Do some work Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 23. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – why does it take so long? // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ...
  • 24. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – why does it take so long? Spark Code Datasource API Hadoop Libs AWS Libs save( path ) S3 HTTPS:// PUT / HEAD / GET / DELETE … s ms key REST // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... # Custom log4j properties ... log4j.logger.com.amazonaws.request=DEBUG ... … (depends on number of files/partitions)
  • 25. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 26. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 27. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 28. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Go Faster spark.conf.set( “spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2) df.write.format( “parquet” ).save( “s3a://...” ) Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 29. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 126 112 113 79 71 72 40 34 34 5 5 52 2 2 0 20 40 60 80 100 120 140 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Go Faster spark.conf.set( “spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2) df.write.format( “parquet” ).save( “s3a://...” ) Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 30. All content copyright © 2018 QuantumBlack, a McKinsey company Second Observation – Writing Data – How many HTTPS calls? Default 165 151 152 103 95 96 53 47 47 7 7 72 2 2 0 50 100 150 200 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE // Define a dataframe df.write.format( “parquet” ).save( “s3a://...” ) // Do some work... 126 112 113 79 71 72 40 34 34 5 5 52 2 2 0 20 40 60 80 100 120 140 2.1 2.2 2.3 1 Partition TOTAL HEAD GET PUT DELETE 1092 1078 381 697 689 244 323 317 114 52 52 1720 20 6 0 200 400 600 800 1000 1200 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE 738 724 272 475 467 176 220 214 81 32 32 1111 11 4 0 200 400 600 800 2.1 2.2 2.3 10 Partitions TOTAL HEAD GET PUT DELETE Go Faster spark.conf.set( “spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version”, 2) df.write.format( “parquet” ).save( “s3a://...” ) Spark Versions Spark 2.1 - spark-2.1.0-hadoop2.7 Spark 2.2 - spark-2.2.1-hadoop2.7 Spark 2.3 - spark-2.3.1-hadoop2.7
  • 31. All content copyright © 2018 QuantumBlack, a McKinsey company Conclusion • Cloud object storage has tradeoffs • Gives incredible storage capacity and scalability • However, do not expect same performance/ or characteristics as a file system • There are lots of configuration settings that can affect your performance with reading/writing to cloud object storage Note: Some are Hadoop version specific! 1 2 3 4 spark.hadoop.fs.s3a.fast.upload=true spark.hadoop.fs.s3a.connection.maximum=xxx spark.hadoop.fs.s3a.multipart.size=xxx spark.hadoop.fs.s3a.attempts.maximum=xxx spark.hadoop.fs.s3a.multipart.threshold=xxx spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 etc ...
  • 33. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Choices • Our teams have faced choices for Loading / Saving data • What URI scheme to use? ─ i.e. S3://… S3n://… S3a://…, JDBC:// etc • What file format? ─ i.e. Parquet, Avro, CSV, etc • What compression to use? ─ i.e. Snappy, LZO, Gzip, etc • What additional options…. ─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc • Even what path to load/save data 1 ? ? ?
  • 34. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Choices • Our teams have faced choices for Loading / Saving data • What URI scheme to use? ─ i.e. S3://… S3n://… S3a://…, JDBC:// etc • What file format? ─ i.e. Parquet, Avro, CSV, etc • What compression to use? ─ i.e. Snappy, LZO, Gzip, etc • What additional options…. ─ i.e. inferSchema, mergeSchema, pushDownPredicate…etc • Even what path to load/save data • Choices end up being “embedded” into code 1 2 val path = "s3n://my-bucket/csv-data/people" val peopleDF = spark.read.format( "csv" ) .option( "sep", ";" ) .option( "inferSchema", "true" ) .option( "header", "true" ) .load( path ) ... ? ? ?
  • 35. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Challenges • Different teams working in same cluster • Working on different use cases • Likely that Spark Jobs are running with different settings in same cluster 1
  • 36. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Challenges • Different teams working in same cluster • Working on different use cases • Likely that Spark Jobs are running with different settings in same cluster 1 • Those that build may not be the ones that end up running/operating Spark Jobs • Support Ops • Or Client IT teams 2
  • 37. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Challenges • Different teams working in same cluster • Working on different use cases • Likely that Spark Jobs are running with different settings in same cluster 1 • Those that build may not be the ones that end up running/operating Spark Jobs • Support Ops • Or Client IT teams • Difficult for those operating to understand • What configuration settings • What does it depend on? i.e. Data it needs • What depends on it? i.e. Data it writes • How can I change things quickly? 3 2 ?
  • 38. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach • Borrow ideas from Devops world • "Strict separation of config from code” 1 Dev Ops
  • 39. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach Spark Code Nerve Load/Save Library + Config • Borrow ideas from Devops world • "Strict separation of config from code” • Custom library and config for loading and saving2 1 USES USES "system" : ”bank”, "table": "customers", ”type": ”input", ”description": ”A Table with banking customer details.", "format": { "type": ”csv", "compression": "snappy", "partitions": "20” "mode": "overwrite" } "location": { "type": "s3", "properties": { ... } } "schema": [ { "name": "id", "type": "int", "doc": "User Id" }, ...
  • 40. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach Spark Code Nerve Load/Save Library + Config • Borrow ideas from Devops world • "Strict separation of config from code” • Custom library and config for loading and saving • For Loading / Saving data • What URI scheme to use? ─ i.e. S3://… S3n://… S3a://…, JDBC:// etc • What format? ─ i.e. Parquet, Avro, CSV, etc • What compression to use? ─ i.e. Snappy, LZO, Gzip, etc • What additional options…. ─ i.e. inferSchema, mergeSchema, overwrite or append…. • Even what path to load/save data 2 3 1 USES USES "system" : ”bank”, "table": "customers", ”type": ”input", ”description": ”A Table with banking customer details.", "format": { "type": ”csv", "compression": "snappy", "partitions": "20” "mode": "overwrite" } "location": { "type": "s3", "properties": { ... } } "schema": [ { "name": "id", "type": "int", "doc": "User Id" }, ...
  • 41. All content copyright © 2018 QuantumBlack, a McKinsey company Design to Operate – Our Approach // Load a dataframe val df = nerve.load( “system”, “table” ) // Save a dataframe nerve.save( df, “system”, “table” ) Spark Code Nerve Load/Save Library + Config • Engineers use this library in their code • Allows Engineers to focus on business logic and not worry about concerns such as: • Data formats • Compressions • Schemas • Partitions • Weird optimisation settings • Allows teams that focus on operating/running to focus on • Consistency • Runtime optimisations 2 3 USES USES 1
  • 43. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines • Pipelines mean different things to different people • Ultimately, its about the Flow of data • We need to • Monitor • Understand • React • It looks simple, but in fact its not 1 2 4 3 Ingest Prepare Analytics Publish Flow Flow
  • 44. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Events Sources Raw Prepared Published
  • 45. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 46. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 47. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 48. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 49. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 50. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example Sources Raw Prepared Published Events
  • 51. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 • Agree a schedule with client when to send data 1 Events
  • 52. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 0700 • Agree a schedule with client when to send data • Configure schedule to kick off Spark Jobs 1 2 Events
  • 53. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 0700 • Agree a schedule with client when to send data • Configure schedule to kick off Spark Jobs • Subsequent Spark Jobs kick off when previous have completed 1 2 3 Events
  • 54. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Agreed Schedule Sources Raw Prepared Published 0500 0700 0900 • Agree a schedule with client when to send data • Configure schedule to kick off Spark Jobs • Subsequent Spark Jobs kick off when previous have completed 1 2 3 Events
  • 55. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Challenge 1 – Untimely Data Sources Raw Prepared Published 0500 0700 0900 • Data failed to be sent up on time • Impact to rest of the Pipeline • Impact to end user! 1 2 3 Events
  • 56. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Challenge 2 – Bad Data Sources Raw Prepared Published 0500 0700 0900 • Bad/Unexpected Data • Schema change • Change in Business rules • Data Quality Issue • Impact to Pipeline • Impact to end user! 1 2 3 Events
  • 57. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Challenge 3 – Jobs Taking Longer Sources Raw Prepared Published 0500 0700 0900 • Spark Job’s taking longer and longer • Impact to Pipeline • Impact to end user! 1 2 3 Events
  • 58. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example - Challenges Sources Raw Prepared Published 0500 Events • We need to • Monitor • Understand • React • It looks simple, but in fact its not 1 2 0700 0900
  • 59. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Our Approach Sources Raw Prepared Published • Monitor all Raw sources for timely data 1 Monitoring Events +
  • 60. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Our Approach Sources Raw Prepared • Monitor all Raw sources for timely data • Monitor quality of data • Config driven • Agreed business rules 1 Monitoring Events + Metadata 2 Published
  • 61. All content copyright © 2018 QuantumBlack, a McKinsey company Pipelines – Example – Our Approach Sources Raw • Monitor all Raw sources for timely data • Monitor quality of data • Config driven • Agreed business rules • Monitor Spark Metrics • At least how long • Additional Metrics incl: ─ Intent to load data J ─ Saving data J 1 Monitoring Events + Metadata 2 3 PublishedPrepared
  • 62. 1 – Observations of using Cloud Object Storage 2 – Design To Operate 3 – Pipelines! Recap
  • 63. 1 – Observations of using Cloud Object Storage 2 – Design To Operate 3 – Pipelines! Lets Talk! Recap