SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
In-Memory Logical Data Warehouse for accelerating
Machine Learning Pipelines on top of Spark and Alluxio
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Gianmario Spacagna
Senior Data Scientist, Pirelli Tyre
London | October 8-9th 2016
Takeaways
¨ Main concepts of a “logical data warehouse”
¨ A way to handle governance issues
¨ An agile workflow made of both iterative exploratory
analysis and production-quality development
¨ A fully in-memory stack for fast computation and
storage on top of Spark and Alluxio
¨ How to successfully do data science if data reside in
a RDBMS and you don’t have a data lake… yet!
About me
¨ Engineering background in Distributed Systems
¤ (Polytechnic of Turin, KTH of Stockholm)
¨ Data-relevant experience
¤ Predictive Marketing (AgilOne, StreamSend)
¤ Cyber Security (Cisco)
¤ Financial Services (Barclays) <- Where I started this use case
¤ Automotive (Pirelli Tyre)
@gm_spacagna
¨ Team of:
¤ Data Scientists
¤ Data Engineers
¤ Solutions architect
¤ Full-stack developers
¤ Industrial Experts
¤ Business Analysts
Smart Manufacturing - Industry 4.0
¨ Anomaly detection
¨ Predictive maintenance
¨ Real-time analytics of
machine status and
industrial KPIs on
screens in the plant
Pirelli Cloud
Cyber Tyre predictive services
Pirelli Cloud
Demand Insight Platform
¨ Long-term and
short-term
forecasting of
sales
¨ Prestige cars
segment
Pirelli Cloud
My areas of interest
¨ Functional Programming and Apache Spark
¨ Contributor of the
Professional Data Science Manifesto
¨ Founder of Data Science Milan meetup
community
¨ Co-authoring Python Deep Learning book,
coming soon…
Building production-ready and scalable machine
learning systems
(continue with list of principles...)
Data Science Agile cycle
1. Get
access to
data
2. Explore
3. Transform4. Train
5. Evaluate
6.
Analyze
results
Even dozens of
iterations per
day!!!
Successful development
of new data products
requires proper
infrastructure and tools
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
¨ You can’t solve problems with data science if
data is not largely available
¨ Data processing should be fast and reactive to
allow quick iterations
¨ The core DS team cannot depend on IT folks
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
Data Lake in a legacy enterprise
environment
Great, but a few technical issues
¨ Engineering effort
¤ dedicated infrastructure team => expensive
¨ Synchronization with new data from source
¤ Report what portion of data has been exported and what not
¨ Consistency / Data Versioning / Duplication
¤ ETL logic and requirements change very often
¤ Memory is cheap but when you have hundreds of sparse copies
of same data is confusing and error-prone
¨ I/O cost
¤ Reading/writing is expensive for iterative and explorative jobs
(machine learning)
“Logical Data Warehouse”
¨ View and access cleaned versions of data
¨ Always show latest version by default
¨ Apply transformations on-the-fly
(discovery-oriented analytics)
¨ Abstract data representation from rigid structures
of the DB’s persistence store
¨ Simply add new data sources using virtualization
¨ Flexible, faster time-to-market, lower costs
What about governance issues?
¨ Large corporations can’t move data before an
approved governance plan
¨ Data can only be stored in a safe environment
administered by only a few authorized people
who don’t necessary are aligned with data
scientists needs
¨ Data leakage paranoia
¨ As result, data cannot be easily/quickly pulled
from the central data warehouse and stored
into an external infrastructure
Long time and large investment for
setting up a new project
That’s not Agile!
Wait a moment, analysts don’t seem to
have this problem…
From disk to volatile memory
Distribute and make data temporary available in-
memory in an ad-hoc development cluster
¨ In-memory engine for distributed data processing
¨ JDBC drivers to connect to relational databases
¨ Structured data represented using DataFrame API
¨ Fully-functional data manipulation via RDD API
¨ Machine learning libraries (ML/MLllib)
¨ Interaction and visualization through
Spark Notebook or Zeppelin
In-memory Agile workflow
JDBC Drivers
¨ The JDBC drivers must be visible to the primordial
class loader on the client session and on all executors
¤ Copy the jar files to each node using rsync
¨ To spark submit command add the following:
¤ --driver-class-path ”~/drivers/driver1.jar:~/drivers/driver2.jar"
¤ --class
"spark.executor.extraClassPath=/usr/local/lib/drivers/driver1.jar:/usr/local/lib
/drivers/driver2.jar”
¤ Alternatively, modify the compute_classpath.sh script
¨ If you are using the SparkNotebook also do:
¤ export EXTRA_CLASSPATH=”~/drivers/driver1.jar :~/drivers/driver1.jar”
DataFrame single partition
¨ Since Spark 1.4.0
¨ sqlContext.read.jdbc(
¤ url: String = “jdbc:teradata://db.mycompany.com/”
¤ table: String =
n “AWESOME_PEOPLE” => SELECT * _
n “(SELECT name FROM AWESOME_PEOPLE where
awesomeness > 90)”
¤ connectionProperties: Map[String, String] =
n Map(“username” -> ”pluto”, “password” -> “E_2Utpff”))
DataFrame uniform partitioning
¨ sqlContext.read.jdbc(url, table, connectionProperties,
¤ columnName: String = “PERSON_ID”,
¤ lowerBound: String = ”0”,
¤ upperBound: String = “1200045”,
¤ numPartitions: Int = 20)
¨ Must have some prior knowledge (min/max value) and a
numerical column uniformly distributed
¨ Will create 20 partitions dividing the specified column in
adjacent ranges of equal size
DataFrame custom partitioning
¨ sqlContext.read.jdbc(url, table, connectionProperties,
¤ Predicates: Array[String] =
¨ Will create as many partitions as the number of SQL queries:
“WHERE cast(DAT_TME as date) >= date '2015-06-20' AND
cast(DAT_TME as date) <= date '2015-06-30’”
Array("2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10").map { case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' " + "AND
cast(DAT_TME as date) <= date '$end'"
})
Union of tables
¨ Concatenate many tables with same schema in a single
DataFrame
¨ .par will spin as many concurrent threads as the number of idle
CPUs
¤ be careful with overloading the DB
¤ don’t use .par if you want to query the DB sequentially
def readTable(table: String): DataFrame
List("<TABLE1>", "<TABLE2>", "<TABLE3>").par
.map(readTable)
.reduce(_ unionAll _)
DataFrame to typed RDD conversion
(Using Spark 1.x, no DataSet*)
case class MyClass(a: Long, b: String, c: Int,
d: String, e: String)
dataframe.map {
case Row(a: java.math.BigDecimal, b: String, c: Int,
_: String, d: java.sql.Date, e: java.sql.Date,
_: java.sql.Timestamp, _: java.sql.Timestamp,
_: java.math.BigDecimal) =>
MyClass(a = a.longValue(), b = b, c = c, d =
d.toString, e = e.toString)
}
dataframe.na.drop()
¨ Must map every single column, even ones not needed
¨ Will fail if any of those are null due to casting in the unapply method of class Row
¨ Remove rows containing at least one null value:
(*) See latest documentation at https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
Null handling using Scala Options
case class MyClass(a: Long, b: String, c: Option[Int],
d: String, e: String)
dataframe.map(_.toSeq.toList match {
case List(a: java.math.BigDecimal, b: String, c: Int,
_: String, d: java.sql.Date, e: java.sql.Date,
_: java.sql.Timestamp, _: java.sql.Timestamp,
_: java.math.BigDecimal) =>
MyClass(a = a.longValue(), b = b, c = Option(c), d =
d.toString, e = e.toString)
}
row.getAs[SQLPrimitiveType](columnIndex: Int)
row.getAs[SQLPrimitiveType](columnName: String)
¨ All columns handling (convert Row to a List[Any]):
¨ Sparse columns handling:
Just Spark cache is not enough
¨ Data is dropped from memory
at each context restart due to
¤ Update dependency jar
(common for mixed IDE
development / notebook analysis)
¤ Re-submit the job execution
¤ Kerberos ticket expires!?!
¨ Fetching 600M rows can take
~ 1 hour in a 5 nodes cluster
Dozens iterations per day => spending most of the time
waiting for data to reload at each iteration!
Distribute and make data temporary persistently
available in-memory in the development cluster and
shared among multiple concurrent applications
From volatile memory to
persistent memory storage
¨ Formerly known as Tachyon
¨ In-memory distributed storage system
¨ Long-term caching of raw data and intermediate
results
¨ Spark can read/write in Alluxio seamlessly instead
of using HDFS
¨ 1-tier configuration safely leaves no traces to disk
¨ Data is loaded once and available for the whole
development period to multiple applications
Alluxio as the Key Enabling Technology
1-tier configuration
¨ ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk
¨ alluxio.worker.memory.size=24GB
¨ alluxio.worker.tieredstore
¤ levels=1
¤ level0.alias=MEM
¤ level0.dirs.path=${ALLUXIO_RAM_FOLDER}
¤ level0.dirs.quota=24G
¨ We leave empty the under FS configuration
¨ Deploy without mount (no root access required)
¤ ./bin/alluxio-start.sh all NoMount
Spark read/write APIs
¨ DataFrame
¤ dataframe.write.save(”alluxio://master_ip:port/mydata/mydata
frame.parquet")
¤ val dataframe: DataFrame =
sqlContext.read.load(”alluxio://master_ip:port/mydata/mydataf
rame.parquet")
¨ RDD
¤ rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object")
¤ val rdd: RDD[MyClass] = sc.objectFile[MyClass]
(”alluxio://master_ip:port/mydata/myrdd.object")
“Data Lake” as a Service
¨ A dev cluster spun on-demand for each project
¨ Only data required for development is loaded
¨ Data live in cluster memory only for as long as it is needed
¨ At the end of each iteration we are ready to go in production
Making the impossible possible
¨ Agile workflow combining Spark, Scala, DataFrame,
JDBC, Parquet, Kryo and Alluxio to create a
scalable, in-memory, reactive stack to explore data
directly from source and develop production-quality
machine learning pipelines
¨ Data available since day 1 and at every iteration
¤ Alluxio decreased loading time from hours to seconds
¨ Avoid complicated and time-consuming
Data Plumbing operations
Further developments
1. Memory size limitation
¤ Add external in-memory tiers?
2. Set-up overhead
¤ JDBC drivers, partitioning strategy and data frame from/to case
class conversion (Spark 2 and frameless aim to solve this)
3. Shared memory resources between Spark and Alluxio
¤ Set Alluxio as OFF_HEAP memory as well and divide memory in
storage and cache
4. In-Memory replication for read availability
¤ If an Alluxio node fails, data is lost due the absence of an
underlying file system
5. Would be nice if Alluxio could handle this and mount a
relational table/view in the form of data files
(csv, parquet…)
Follow-up links
¨ Original article on DZone:
¤ dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon
¨ Professional Data Science Manifesto:
¤ datasciencemanifesto.org
¨ Vademecum of Practical Data Science:
¤ Handbook and recipes for data-driven solutions
datasciencevademecum.wordpress.com

Contenu connexe

Tendances

Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...DataStax
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIParesh Nayak,OCP®,Prince2®
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...DataStax
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...DataStax
 
Performance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxPerformance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxScyllaDB
 
Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...
Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...
Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...huguk
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAsMaxym Kharchenko
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...DataStax
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...DataStax
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyondMatija Gobec
 

Tendances (20)

Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part II
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
 
Performance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxPerformance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. Datastax
 
Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...
Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...
Dave Shuttleworth - Platform performance comparisons, bare metal and cloud ho...
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 

Similaire à In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupGianmario Spacagna
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud RightScale
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIn-Memory Computing Summit
 

Similaire à In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016 (20)

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Shark
SharkShark
Shark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
11g R2
11g R211g R2
11g R2
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
 

Plus de Gianmario Spacagna

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyGianmario Spacagna
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsGianmario Spacagna
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsGianmario Spacagna
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersGianmario Spacagna
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Gianmario Spacagna
 
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalParallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalGianmario Spacagna
 

Plus de Gianmario Spacagna (7)

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case study
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalParallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
 

Dernier

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016

  • 1. In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio @ODSC OPEN DATA SCIENCE CONFERENCE Gianmario Spacagna Senior Data Scientist, Pirelli Tyre London | October 8-9th 2016
  • 2. Takeaways ¨ Main concepts of a “logical data warehouse” ¨ A way to handle governance issues ¨ An agile workflow made of both iterative exploratory analysis and production-quality development ¨ A fully in-memory stack for fast computation and storage on top of Spark and Alluxio ¨ How to successfully do data science if data reside in a RDBMS and you don’t have a data lake… yet!
  • 3. About me ¨ Engineering background in Distributed Systems ¤ (Polytechnic of Turin, KTH of Stockholm) ¨ Data-relevant experience ¤ Predictive Marketing (AgilOne, StreamSend) ¤ Cyber Security (Cisco) ¤ Financial Services (Barclays) <- Where I started this use case ¤ Automotive (Pirelli Tyre) @gm_spacagna
  • 4. ¨ Team of: ¤ Data Scientists ¤ Data Engineers ¤ Solutions architect ¤ Full-stack developers ¤ Industrial Experts ¤ Business Analysts
  • 5. Smart Manufacturing - Industry 4.0 ¨ Anomaly detection ¨ Predictive maintenance ¨ Real-time analytics of machine status and industrial KPIs on screens in the plant Pirelli Cloud
  • 6. Cyber Tyre predictive services Pirelli Cloud
  • 7. Demand Insight Platform ¨ Long-term and short-term forecasting of sales ¨ Prestige cars segment Pirelli Cloud
  • 8. My areas of interest ¨ Functional Programming and Apache Spark ¨ Contributor of the Professional Data Science Manifesto ¨ Founder of Data Science Milan meetup community ¨ Co-authoring Python Deep Learning book, coming soon… Building production-ready and scalable machine learning systems (continue with list of principles...)
  • 9. Data Science Agile cycle 1. Get access to data 2. Explore 3. Transform4. Train 5. Evaluate 6. Analyze results Even dozens of iterations per day!!!
  • 10. Successful development of new data products requires proper infrastructure and tools
  • 11. Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  • 12. ¨ You can’t solve problems with data science if data is not largely available ¨ Data processing should be fast and reactive to allow quick iterations ¨ The core DS team cannot depend on IT folks Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  • 13. Data Lake in a legacy enterprise environment
  • 14. Great, but a few technical issues ¨ Engineering effort ¤ dedicated infrastructure team => expensive ¨ Synchronization with new data from source ¤ Report what portion of data has been exported and what not ¨ Consistency / Data Versioning / Duplication ¤ ETL logic and requirements change very often ¤ Memory is cheap but when you have hundreds of sparse copies of same data is confusing and error-prone ¨ I/O cost ¤ Reading/writing is expensive for iterative and explorative jobs (machine learning)
  • 15. “Logical Data Warehouse” ¨ View and access cleaned versions of data ¨ Always show latest version by default ¨ Apply transformations on-the-fly (discovery-oriented analytics) ¨ Abstract data representation from rigid structures of the DB’s persistence store ¨ Simply add new data sources using virtualization ¨ Flexible, faster time-to-market, lower costs
  • 16. What about governance issues? ¨ Large corporations can’t move data before an approved governance plan ¨ Data can only be stored in a safe environment administered by only a few authorized people who don’t necessary are aligned with data scientists needs ¨ Data leakage paranoia ¨ As result, data cannot be easily/quickly pulled from the central data warehouse and stored into an external infrastructure
  • 17. Long time and large investment for setting up a new project That’s not Agile!
  • 18. Wait a moment, analysts don’t seem to have this problem…
  • 19. From disk to volatile memory Distribute and make data temporary available in- memory in an ad-hoc development cluster
  • 20. ¨ In-memory engine for distributed data processing ¨ JDBC drivers to connect to relational databases ¨ Structured data represented using DataFrame API ¨ Fully-functional data manipulation via RDD API ¨ Machine learning libraries (ML/MLllib) ¨ Interaction and visualization through Spark Notebook or Zeppelin
  • 22. JDBC Drivers ¨ The JDBC drivers must be visible to the primordial class loader on the client session and on all executors ¤ Copy the jar files to each node using rsync ¨ To spark submit command add the following: ¤ --driver-class-path ”~/drivers/driver1.jar:~/drivers/driver2.jar" ¤ --class "spark.executor.extraClassPath=/usr/local/lib/drivers/driver1.jar:/usr/local/lib /drivers/driver2.jar” ¤ Alternatively, modify the compute_classpath.sh script ¨ If you are using the SparkNotebook also do: ¤ export EXTRA_CLASSPATH=”~/drivers/driver1.jar :~/drivers/driver1.jar”
  • 23. DataFrame single partition ¨ Since Spark 1.4.0 ¨ sqlContext.read.jdbc( ¤ url: String = “jdbc:teradata://db.mycompany.com/” ¤ table: String = n “AWESOME_PEOPLE” => SELECT * _ n “(SELECT name FROM AWESOME_PEOPLE where awesomeness > 90)” ¤ connectionProperties: Map[String, String] = n Map(“username” -> ”pluto”, “password” -> “E_2Utpff”))
  • 24. DataFrame uniform partitioning ¨ sqlContext.read.jdbc(url, table, connectionProperties, ¤ columnName: String = “PERSON_ID”, ¤ lowerBound: String = ”0”, ¤ upperBound: String = “1200045”, ¤ numPartitions: Int = 20) ¨ Must have some prior knowledge (min/max value) and a numerical column uniformly distributed ¨ Will create 20 partitions dividing the specified column in adjacent ranges of equal size
  • 25. DataFrame custom partitioning ¨ sqlContext.read.jdbc(url, table, connectionProperties, ¤ Predicates: Array[String] = ¨ Will create as many partitions as the number of SQL queries: “WHERE cast(DAT_TME as date) >= date '2015-06-20' AND cast(DAT_TME as date) <= date '2015-06-30’” Array("2015-06-20" -> "2015-06-30", "2015-07-01" -> "2015-07-10").map { case (start, end) => s"cast(DAT_TME as date) >= date '$start' " + "AND cast(DAT_TME as date) <= date '$end'" })
  • 26. Union of tables ¨ Concatenate many tables with same schema in a single DataFrame ¨ .par will spin as many concurrent threads as the number of idle CPUs ¤ be careful with overloading the DB ¤ don’t use .par if you want to query the DB sequentially def readTable(table: String): DataFrame List("<TABLE1>", "<TABLE2>", "<TABLE3>").par .map(readTable) .reduce(_ unionAll _)
  • 27. DataFrame to typed RDD conversion (Using Spark 1.x, no DataSet*) case class MyClass(a: Long, b: String, c: Int, d: String, e: String) dataframe.map { case Row(a: java.math.BigDecimal, b: String, c: Int, _: String, d: java.sql.Date, e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString) } dataframe.na.drop() ¨ Must map every single column, even ones not needed ¨ Will fail if any of those are null due to casting in the unapply method of class Row ¨ Remove rows containing at least one null value: (*) See latest documentation at https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
  • 28. Null handling using Scala Options case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String) dataframe.map(_.toSeq.toList match { case List(a: java.math.BigDecimal, b: String, c: Int, _: String, d: java.sql.Date, e: java.sql.Date, _: java.sql.Timestamp, _: java.sql.Timestamp, _: java.math.BigDecimal) => MyClass(a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString) } row.getAs[SQLPrimitiveType](columnIndex: Int) row.getAs[SQLPrimitiveType](columnName: String) ¨ All columns handling (convert Row to a List[Any]): ¨ Sparse columns handling:
  • 29. Just Spark cache is not enough ¨ Data is dropped from memory at each context restart due to ¤ Update dependency jar (common for mixed IDE development / notebook analysis) ¤ Re-submit the job execution ¤ Kerberos ticket expires!?! ¨ Fetching 600M rows can take ~ 1 hour in a 5 nodes cluster Dozens iterations per day => spending most of the time waiting for data to reload at each iteration!
  • 30. Distribute and make data temporary persistently available in-memory in the development cluster and shared among multiple concurrent applications From volatile memory to persistent memory storage
  • 31. ¨ Formerly known as Tachyon ¨ In-memory distributed storage system ¨ Long-term caching of raw data and intermediate results ¨ Spark can read/write in Alluxio seamlessly instead of using HDFS ¨ 1-tier configuration safely leaves no traces to disk ¨ Data is loaded once and available for the whole development period to multiple applications
  • 32. Alluxio as the Key Enabling Technology
  • 33.
  • 34. 1-tier configuration ¨ ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk ¨ alluxio.worker.memory.size=24GB ¨ alluxio.worker.tieredstore ¤ levels=1 ¤ level0.alias=MEM ¤ level0.dirs.path=${ALLUXIO_RAM_FOLDER} ¤ level0.dirs.quota=24G ¨ We leave empty the under FS configuration ¨ Deploy without mount (no root access required) ¤ ./bin/alluxio-start.sh all NoMount
  • 35. Spark read/write APIs ¨ DataFrame ¤ dataframe.write.save(”alluxio://master_ip:port/mydata/mydata frame.parquet") ¤ val dataframe: DataFrame = sqlContext.read.load(”alluxio://master_ip:port/mydata/mydataf rame.parquet") ¨ RDD ¤ rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object") ¤ val rdd: RDD[MyClass] = sc.objectFile[MyClass] (”alluxio://master_ip:port/mydata/myrdd.object")
  • 36. “Data Lake” as a Service ¨ A dev cluster spun on-demand for each project ¨ Only data required for development is loaded ¨ Data live in cluster memory only for as long as it is needed ¨ At the end of each iteration we are ready to go in production
  • 37. Making the impossible possible ¨ Agile workflow combining Spark, Scala, DataFrame, JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop production-quality machine learning pipelines ¨ Data available since day 1 and at every iteration ¤ Alluxio decreased loading time from hours to seconds ¨ Avoid complicated and time-consuming Data Plumbing operations
  • 38. Further developments 1. Memory size limitation ¤ Add external in-memory tiers? 2. Set-up overhead ¤ JDBC drivers, partitioning strategy and data frame from/to case class conversion (Spark 2 and frameless aim to solve this) 3. Shared memory resources between Spark and Alluxio ¤ Set Alluxio as OFF_HEAP memory as well and divide memory in storage and cache 4. In-Memory replication for read availability ¤ If an Alluxio node fails, data is lost due the absence of an underlying file system 5. Would be nice if Alluxio could handle this and mount a relational table/view in the form of data files (csv, parquet…)
  • 39. Follow-up links ¨ Original article on DZone: ¤ dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon ¨ Professional Data Science Manifesto: ¤ datasciencemanifesto.org ¨ Vademecum of Practical Data Science: ¤ Handbook and recipes for data-driven solutions datasciencevademecum.wordpress.com