Abstract:
Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called "Data Lake" where raw data is stored and periodically ingested into a distributed file system such as HDFS.
Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.
We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.
In this talk we will:
* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet
* Explain the iterative exploratory process and advantages of adopting functional programming
* Make a crucial analysis on the issues faced with the existing methodology
* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds
* Discuss some future improvements to the overall architecture
Bio:
Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications.
His main expertise is on building production-oriented machine learning systems.
Co-author of the Professional Manifesto for Data Science (datasciencemanifesto.com), founder of the Data Science Milan Meetup group and currently writing "Python Deep Learning" book (will be published soon).
He loves evangelising his passion for best practices and effective methodologies amongst the community.
Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).
Similaire à In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016 (20)
TeamStation AI System Report LATAM IT Salaries 2024
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines on top of Spark and Alluxio - ODSC London 9th October 2016
1. In-Memory Logical Data Warehouse for accelerating
Machine Learning Pipelines on top of Spark and Alluxio
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Gianmario Spacagna
Senior Data Scientist, Pirelli Tyre
London | October 8-9th 2016
2. Takeaways
¨ Main concepts of a “logical data warehouse”
¨ A way to handle governance issues
¨ An agile workflow made of both iterative exploratory
analysis and production-quality development
¨ A fully in-memory stack for fast computation and
storage on top of Spark and Alluxio
¨ How to successfully do data science if data reside in
a RDBMS and you don’t have a data lake… yet!
3. About me
¨ Engineering background in Distributed Systems
¤ (Polytechnic of Turin, KTH of Stockholm)
¨ Data-relevant experience
¤ Predictive Marketing (AgilOne, StreamSend)
¤ Cyber Security (Cisco)
¤ Financial Services (Barclays) <- Where I started this use case
¤ Automotive (Pirelli Tyre)
@gm_spacagna
4. ¨ Team of:
¤ Data Scientists
¤ Data Engineers
¤ Solutions architect
¤ Full-stack developers
¤ Industrial Experts
¤ Business Analysts
5. Smart Manufacturing - Industry 4.0
¨ Anomaly detection
¨ Predictive maintenance
¨ Real-time analytics of
machine status and
industrial KPIs on
screens in the plant
Pirelli Cloud
7. Demand Insight Platform
¨ Long-term and
short-term
forecasting of
sales
¨ Prestige cars
segment
Pirelli Cloud
8. My areas of interest
¨ Functional Programming and Apache Spark
¨ Contributor of the
Professional Data Science Manifesto
¨ Founder of Data Science Milan meetup
community
¨ Co-authoring Python Deep Learning book,
coming soon…
Building production-ready and scalable machine
learning systems
(continue with list of principles...)
9. Data Science Agile cycle
1. Get
access to
data
2. Explore
3. Transform4. Train
5. Evaluate
6.
Analyze
results
Even dozens of
iterations per
day!!!
11. Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
12. ¨ You can’t solve problems with data science if
data is not largely available
¨ Data processing should be fast and reactive to
allow quick iterations
¨ The core DS team cannot depend on IT folks
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
14. Great, but a few technical issues
¨ Engineering effort
¤ dedicated infrastructure team => expensive
¨ Synchronization with new data from source
¤ Report what portion of data has been exported and what not
¨ Consistency / Data Versioning / Duplication
¤ ETL logic and requirements change very often
¤ Memory is cheap but when you have hundreds of sparse copies
of same data is confusing and error-prone
¨ I/O cost
¤ Reading/writing is expensive for iterative and explorative jobs
(machine learning)
15. “Logical Data Warehouse”
¨ View and access cleaned versions of data
¨ Always show latest version by default
¨ Apply transformations on-the-fly
(discovery-oriented analytics)
¨ Abstract data representation from rigid structures
of the DB’s persistence store
¨ Simply add new data sources using virtualization
¨ Flexible, faster time-to-market, lower costs
16. What about governance issues?
¨ Large corporations can’t move data before an
approved governance plan
¨ Data can only be stored in a safe environment
administered by only a few authorized people
who don’t necessary are aligned with data
scientists needs
¨ Data leakage paranoia
¨ As result, data cannot be easily/quickly pulled
from the central data warehouse and stored
into an external infrastructure
17. Long time and large investment for
setting up a new project
That’s not Agile!
18. Wait a moment, analysts don’t seem to
have this problem…
19. From disk to volatile memory
Distribute and make data temporary available in-
memory in an ad-hoc development cluster
20. ¨ In-memory engine for distributed data processing
¨ JDBC drivers to connect to relational databases
¨ Structured data represented using DataFrame API
¨ Fully-functional data manipulation via RDD API
¨ Machine learning libraries (ML/MLllib)
¨ Interaction and visualization through
Spark Notebook or Zeppelin
22. JDBC Drivers
¨ The JDBC drivers must be visible to the primordial
class loader on the client session and on all executors
¤ Copy the jar files to each node using rsync
¨ To spark submit command add the following:
¤ --driver-class-path ”~/drivers/driver1.jar:~/drivers/driver2.jar"
¤ --class
"spark.executor.extraClassPath=/usr/local/lib/drivers/driver1.jar:/usr/local/lib
/drivers/driver2.jar”
¤ Alternatively, modify the compute_classpath.sh script
¨ If you are using the SparkNotebook also do:
¤ export EXTRA_CLASSPATH=”~/drivers/driver1.jar :~/drivers/driver1.jar”
23. DataFrame single partition
¨ Since Spark 1.4.0
¨ sqlContext.read.jdbc(
¤ url: String = “jdbc:teradata://db.mycompany.com/”
¤ table: String =
n “AWESOME_PEOPLE” => SELECT * _
n “(SELECT name FROM AWESOME_PEOPLE where
awesomeness > 90)”
¤ connectionProperties: Map[String, String] =
n Map(“username” -> ”pluto”, “password” -> “E_2Utpff”))
24. DataFrame uniform partitioning
¨ sqlContext.read.jdbc(url, table, connectionProperties,
¤ columnName: String = “PERSON_ID”,
¤ lowerBound: String = ”0”,
¤ upperBound: String = “1200045”,
¤ numPartitions: Int = 20)
¨ Must have some prior knowledge (min/max value) and a
numerical column uniformly distributed
¨ Will create 20 partitions dividing the specified column in
adjacent ranges of equal size
25. DataFrame custom partitioning
¨ sqlContext.read.jdbc(url, table, connectionProperties,
¤ Predicates: Array[String] =
¨ Will create as many partitions as the number of SQL queries:
“WHERE cast(DAT_TME as date) >= date '2015-06-20' AND
cast(DAT_TME as date) <= date '2015-06-30’”
Array("2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10").map { case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' " + "AND
cast(DAT_TME as date) <= date '$end'"
})
26. Union of tables
¨ Concatenate many tables with same schema in a single
DataFrame
¨ .par will spin as many concurrent threads as the number of idle
CPUs
¤ be careful with overloading the DB
¤ don’t use .par if you want to query the DB sequentially
def readTable(table: String): DataFrame
List("<TABLE1>", "<TABLE2>", "<TABLE3>").par
.map(readTable)
.reduce(_ unionAll _)
27. DataFrame to typed RDD conversion
(Using Spark 1.x, no DataSet*)
case class MyClass(a: Long, b: String, c: Int,
d: String, e: String)
dataframe.map {
case Row(a: java.math.BigDecimal, b: String, c: Int,
_: String, d: java.sql.Date, e: java.sql.Date,
_: java.sql.Timestamp, _: java.sql.Timestamp,
_: java.math.BigDecimal) =>
MyClass(a = a.longValue(), b = b, c = c, d =
d.toString, e = e.toString)
}
dataframe.na.drop()
¨ Must map every single column, even ones not needed
¨ Will fail if any of those are null due to casting in the unapply method of class Row
¨ Remove rows containing at least one null value:
(*) See latest documentation at https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
28. Null handling using Scala Options
case class MyClass(a: Long, b: String, c: Option[Int],
d: String, e: String)
dataframe.map(_.toSeq.toList match {
case List(a: java.math.BigDecimal, b: String, c: Int,
_: String, d: java.sql.Date, e: java.sql.Date,
_: java.sql.Timestamp, _: java.sql.Timestamp,
_: java.math.BigDecimal) =>
MyClass(a = a.longValue(), b = b, c = Option(c), d =
d.toString, e = e.toString)
}
row.getAs[SQLPrimitiveType](columnIndex: Int)
row.getAs[SQLPrimitiveType](columnName: String)
¨ All columns handling (convert Row to a List[Any]):
¨ Sparse columns handling:
29. Just Spark cache is not enough
¨ Data is dropped from memory
at each context restart due to
¤ Update dependency jar
(common for mixed IDE
development / notebook analysis)
¤ Re-submit the job execution
¤ Kerberos ticket expires!?!
¨ Fetching 600M rows can take
~ 1 hour in a 5 nodes cluster
Dozens iterations per day => spending most of the time
waiting for data to reload at each iteration!
30. Distribute and make data temporary persistently
available in-memory in the development cluster and
shared among multiple concurrent applications
From volatile memory to
persistent memory storage
31. ¨ Formerly known as Tachyon
¨ In-memory distributed storage system
¨ Long-term caching of raw data and intermediate
results
¨ Spark can read/write in Alluxio seamlessly instead
of using HDFS
¨ 1-tier configuration safely leaves no traces to disk
¨ Data is loaded once and available for the whole
development period to multiple applications
36. “Data Lake” as a Service
¨ A dev cluster spun on-demand for each project
¨ Only data required for development is loaded
¨ Data live in cluster memory only for as long as it is needed
¨ At the end of each iteration we are ready to go in production
37. Making the impossible possible
¨ Agile workflow combining Spark, Scala, DataFrame,
JDBC, Parquet, Kryo and Alluxio to create a
scalable, in-memory, reactive stack to explore data
directly from source and develop production-quality
machine learning pipelines
¨ Data available since day 1 and at every iteration
¤ Alluxio decreased loading time from hours to seconds
¨ Avoid complicated and time-consuming
Data Plumbing operations
38. Further developments
1. Memory size limitation
¤ Add external in-memory tiers?
2. Set-up overhead
¤ JDBC drivers, partitioning strategy and data frame from/to case
class conversion (Spark 2 and frameless aim to solve this)
3. Shared memory resources between Spark and Alluxio
¤ Set Alluxio as OFF_HEAP memory as well and divide memory in
storage and cache
4. In-Memory replication for read availability
¤ If an Alluxio node fails, data is lost due the absence of an
underlying file system
5. Would be nice if Alluxio could handle this and mount a
relational table/view in the form of data files
(csv, parquet…)
39. Follow-up links
¨ Original article on DZone:
¤ dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-from-Hours-to-Seconds-With-Tachyon
¨ Professional Data Science Manifesto:
¤ datasciencemanifesto.org
¨ Vademecum of Practical Data Science:
¤ Handbook and recipes for data-driven solutions
datasciencevademecum.wordpress.com