In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016

In-Memory Logical Data Warehouse for accelerating
Machine Learning Pipelines on top of Spark and Alluxio
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Gianmario Spacagna
Senior Data Scientist, Pirelli Tyre
London | October 8-9th 2016

Takeaways
¨ Main concepts of a “logical data warehouse”
¨ A way to handle governance issues
¨ An agile workflow made of both iterative exploratory
analysis and production-quality development
¨ A fully in-memory stack for fast computation and
storage on top of Spark and Alluxio
¨ How to successfully do data science if data reside in
a RDBMS and you don’t have a data lake… yet!

About me
¨ Engineering background in Distributed Systems
¤ (Polytechnic of Turin, KTH of Stockholm)
¨ Data-relevant experience
¤ Predictive Marketing (AgilOne, StreamSend)
¤ Cyber Security (Cisco)
¤ Financial Services (Barclays) <- Where I started this use case
¤ Automotive (Pirelli Tyre)
@gm_spacagna

¨ Team of:
¤ Data Scientists
¤ Data Engineers
¤ Solutions architect
¤ Full-stack developers
¤ Industrial Experts
¤ Business Analysts

Smart Manufacturing - Industry 4.0
¨ Anomaly detection
¨ Predictive maintenance
¨ Real-time analytics of
machine status and
industrial KPIs on
screens in the plant
Pirelli Cloud

Cyber Tyre predictive services
Pirelli Cloud

Demand Insight Platform
¨ Long-term and
short-term
forecasting of
sales
¨ Prestige cars
segment
Pirelli Cloud

My areas of interest
¨ Functional Programming and Apache Spark
¨ Contributor of the
Professional Data Science Manifesto
¨ Founder of Data Science Milan meetup
community
¨ Co-authoring Python Deep Learning book,
coming soon…
Building production-ready and scalable machine
learning systems
(continue with list of principles...)

Data Science Agile cycle
1. Get
access to
data
2. Explore
3. Transform4. Train
5. Evaluate
6.
Analyze
results
Even dozens of
iterations per
day!!!

Successful development
of new data products
requires proper
infrastructure and tools

Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources

¨ You can’t solve problems with data science if
data is not largely available
¨ Data processing should be fast and reactive to
allow quick iterations
¨ The core DS team cannot depend on IT folks
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources

Data Lake in a legacy enterprise
environment

Great, but a few technical issues
¨ Engineering effort
¤ dedicated infrastructure team => expensive
¨ Synchronization with new data from source
¤ Report what portion of data has been exported and what not
¨ Consistency / Data Versioning / Duplication
¤ ETL logic and requirements change very often
¤ Memory is cheap but when you have hundreds of sparse copies
of same data is confusing and error-prone
¨ I/O cost
¤ Reading/writing is expensive for iterative and explorative jobs
(machine learning)

“Logical Data Warehouse”
¨ View and access cleaned versions of data
¨ Always show latest version by default
¨ Apply transformations on-the-fly
(discovery-oriented analytics)
¨ Abstract data representation from rigid structures
of the DB’s persistence store
¨ Simply add new data sources using virtualization
¨ Flexible, faster time-to-market, lower costs

What about governance issues?
¨ Large corporations can’t move data before an
approved governance plan
¨ Data can only be stored in a safe environment
administered by only a few authorized people
who don’t necessary are aligned with data
scientists needs
¨ Data leakage paranoia
¨ As result, data cannot be easily/quickly pulled
from the central data warehouse and stored
into an external infrastructure

Long time and large investment for
setting up a new project
That’s not Agile!

Wait a moment, analysts don’t seem to
have this problem…

From disk to volatile memory
Distribute and make data temporary available in-
memory in an ad-hoc development cluster

¨ In-memory engine for distributed data processing
¨ JDBC drivers to connect to relational databases
¨ Structured data represented using DataFrame API
¨ Fully-functional data manipulation via RDD API
¨ Machine learning libraries (ML/MLllib)
¨ Interaction and visualization through
Spark Notebook or Zeppelin

JDBC Drivers
¨ The JDBC drivers must be visible to the primordial
class loader on the client session and on all executors
¤ Copy the jar files to each node using rsync
¨ To spark submit command add the following:
¤ --driver-class-path ”~/drivers/driver1.jar:~/drivers/driver2.jar"
¤ --class
"spark.executor.extraClassPath=/usr/local/lib/drivers/driver1.jar:/usr/local/lib
/drivers/driver2.jar”
¤ Alternatively, modify the compute_classpath.sh script
¨ If you are using the SparkNotebook also do:
¤ export EXTRA_CLASSPATH=”~/drivers/driver1.jar :~/drivers/driver1.jar”

DataFrame single partition
¨ Since Spark 1.4.0
¨ sqlContext.read.jdbc(
¤ url: String = “jdbc:teradata://db.mycompany.com/”
¤ table: String =
n “AWESOME_PEOPLE” => SELECT * _
n “(SELECT name FROM AWESOME_PEOPLE where
awesomeness > 90)”
¤ connectionProperties: Map[String, String] =
n Map(“username” -> ”pluto”, “password” -> “E_2Utpff”))

DataFrame uniform partitioning
¨ sqlContext.read.jdbc(url, table, connectionProperties,
¤ columnName: String = “PERSON_ID”,
¤ lowerBound: String = ”0”,
¤ upperBound: String = “1200045”,
¤ numPartitions: Int = 20)
¨ Must have some prior knowledge (min/max value) and a
numerical column uniformly distributed
¨ Will create 20 partitions dividing the specified column in
adjacent ranges of equal size

DataFrame custom partitioning
¨ sqlContext.read.jdbc(url, table, connectionProperties,
¤ Predicates: Array[String] =
¨ Will create as many partitions as the number of SQL queries:
“WHERE cast(DAT_TME as date) >= date '2015-06-20' AND
cast(DAT_TME as date) <= date '2015-06-30’”
Array("2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10").map { case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' " + "AND
cast(DAT_TME as date) <= date '$end'"
})

Union of tables
¨ Concatenate many tables with same schema in a single
DataFrame
¨ .par will spin as many concurrent threads as the number of idle
CPUs
¤ be careful with overloading the DB
¤ don’t use .par if you want to query the DB sequentially
def readTable(table: String): DataFrame
List("<TABLE1>", "<TABLE2>", "<TABLE3>").par
.map(readTable)
.reduce(_ unionAll _)

DataFrame to typed RDD conversion
(Using Spark 1.x, no DataSet*)
case class MyClass(a: Long, b: String, c: Int,
d: String, e: String)
dataframe.map {
case Row(a: java.math.BigDecimal, b: String, c: Int,
_: String, d: java.sql.Date, e: java.sql.Date,
_: java.sql.Timestamp, _: java.sql.Timestamp,
_: java.math.BigDecimal) =>
MyClass(a = a.longValue(), b = b, c = c, d =
d.toString, e = e.toString)
}
dataframe.na.drop()
¨ Must map every single column, even ones not needed
¨ Will fail if any of those are null due to casting in the unapply method of class Row
¨ Remove rows containing at least one null value:
(*) See latest documentation at https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html

Null handling using Scala Options
case class MyClass(a: Long, b: String, c: Option[Int],
d: String, e: String)
dataframe.map(_.toSeq.toList match {
case List(a: java.math.BigDecimal, b: String, c: Int,
_: String, d: java.sql.Date, e: java.sql.Date,
_: java.sql.Timestamp, _: java.sql.Timestamp,
_: java.math.BigDecimal) =>
MyClass(a = a.longValue(), b = b, c = Option(c), d =
d.toString, e = e.toString)
}
row.getAs[SQLPrimitiveType](columnIndex: Int)
row.getAs[SQLPrimitiveType](columnName: String)
¨ All columns handling (convert Row to a List[Any]):
¨ Sparse columns handling:

Just Spark cache is not enough
¨ Data is dropped from memory
at each context restart due to
¤ Update dependency jar
(common for mixed IDE
development / notebook analysis)
¤ Re-submit the job execution
¤ Kerberos ticket expires!?!
¨ Fetching 600M rows can take
~ 1 hour in a 5 nodes cluster
Dozens iterations per day => spending most of the time
waiting for data to reload at each iteration!

Distribute and make data temporary persistently
available in-memory in the development cluster and
shared among multiple concurrent applications
From volatile memory to
persistent memory storage

¨ Formerly known as Tachyon
¨ In-memory distributed storage system
¨ Long-term caching of raw data and intermediate
results
¨ Spark can read/write in Alluxio seamlessly instead
of using HDFS
¨ 1-tier configuration safely leaves no traces to disk
¨ Data is loaded once and available for the whole
development period to multiple applications

Alluxio as the Key Enabling Technology

1-tier configuration
¨ ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk
¨ alluxio.worker.memory.size=24GB
¨ alluxio.worker.tieredstore
¤ levels=1
¤ level0.alias=MEM
¤ level0.dirs.path=${ALLUXIO_RAM_FOLDER}
¤ level0.dirs.quota=24G
¨ We leave empty the under FS configuration
¨ Deploy without mount (no root access required)
¤ ./bin/alluxio-start.sh all NoMount

Spark read/write APIs
¨ DataFrame
¤ dataframe.write.save(”alluxio://master_ip:port/mydata/mydata
frame.parquet")
¤ val dataframe: DataFrame =
sqlContext.read.load(”alluxio://master_ip:port/mydata/mydataf
rame.parquet")
¨ RDD
¤ rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object")
¤ val rdd: RDD[MyClass] = sc.objectFile[MyClass]
(”alluxio://master_ip:port/mydata/myrdd.object")

“Data Lake” as a Service
¨ A dev cluster spun on-demand for each project
¨ Only data required for development is loaded
¨ Data live in cluster memory only for as long as it is needed
¨ At the end of each iteration we are ready to go in production

Making the impossible possible
¨ Agile workflow combining Spark, Scala, DataFrame,
JDBC, Parquet, Kryo and Alluxio to create a
scalable, in-memory, reactive stack to explore data
directly from source and develop production-quality
machine learning pipelines
¨ Data available since day 1 and at every iteration
¤ Alluxio decreased loading time from hours to seconds
¨ Avoid complicated and time-consuming
Data Plumbing operations

Further developments
1. Memory size limitation
¤ Add external in-memory tiers?
2. Set-up overhead
¤ JDBC drivers, partitioning strategy and data frame from/to case
class conversion (Spark 2 and frameless aim to solve this)
3. Shared memory resources between Spark and Alluxio
¤ Set Alluxio as OFF_HEAP memory as well and divide memory in
storage and cache
4. In-Memory replication for read availability
¤ If an Alluxio node fails, data is lost due the absence of an
underlying file system
5. Would be nice if Alluxio could handle this and mount a
relational table/view in the form of data files
(csv, parquet…)

Follow-up links
¨ Original article on DZone:
¤ dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon
¨ Professional Data Science Manifesto:
¤ datasciencemanifesto.org
¨ Vademecum of Practical Data Science:
¤ Handbook and recipes for data-driven solutions
datasciencevademecum.wordpress.com

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016

Similaire à In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016 (20)

Plus de Gianmario Spacagna

Plus de Gianmario Spacagna (7)

Dernier

Dernier (20)

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016