Slides from the Data Engineering meetup @ Flatiron School in Houston.
Jupyter notebook examples: https://github.com/spraguesy/spark-ncaa-bb
About iland cloud: https://www.iland.com/
2. @anguenot @ilandcloud
OUTLINE
● What is Apache Spark?
● The issue of Big Data
● Past, Present and Future of Spark
● Spark architecture quick overview
● Spark’s languages and APIs
● PySpark
● Community & Ecosystem
● Examples and demo using Jupyter notebooks from Andrew Spargue
3. @anguenot @ilandcloud
What is Apache Spark?
● Unified computing engine
● Libraries for parallel data processing on clusters
● Support multiple programming language: SQL, Scala, Java, Python, R
● Provides libraries for streaming, machine learning and graph computing
● Spark UI: Web UI to monitor and inspect jobs and tasks
● Can run anywhere: laptop to clusters (on-prem or cloud)
● De facto standard for Big Data processing across all industries and use
cases
● Open Source @ Apache Software Foundation
4. @anguenot @ilandcloud
Unified
● One (1) compute engine
● One (1) set of APIs
● One (1) way of developing and deploying applications (or jobs)
● Data loading, machine learning, streaming computation, etc.
● Interactive or traditional application deployment
● Code reuse and access to multiple libraries
5. @anguenot @ilandcloud
Compute engine
● Compute engine only: not a persistent data storage system
● Supports a wide range of persistent storage systems: Amazon S3, Azure
Storage, Apache Hadoop, Apache Cassandra, etc.
● Easier to deploy and maintain without persistent storage layer
● Also supports message buses such as Apache Kafka
6. @anguenot @ilandcloud
The issue of Big Data
● Before applications mainly using single processor
● Processors stopped going faster since 2005
● Amount of data increases
● Price of storage decreases: cheap to store data
● Solution: cluster and parallel CPU cores
● Performances
○ In-memory processing faster
○ Exit Hadoop / MapReduce at least for real-time analytics and streaming
7. @anguenot @ilandcloud
Past, Present and future of Spark
● UC Berkeley, CA in 2009 @ Spark research project
● Hadoop / MapReduce first Open Source parallel computing engine for clusters
● Issue of multiple passes over the data requiring multiple jobs with each pass writing
results on disks
● Spark enabled multistep applications with efficient in-memory data sharing in
between steps (batch only at first)
● Interactive and ad-hoc queries (data scientist)
● Apache foundation project in 2013
● Spark 1.0 in 2014 introduced Spark SQL and structured data
● Then came: structured streaming, machine learning pipelines and graphs
● Now: de facto standard in all industries: Netflix, Uber, CERN, MIT, Harvard, etc.
9. @anguenot @ilandcloud
Spark Language’s
● Scala: default language
● Java: available but not popular
● Python: supports nearly all constructs that Scala supports
● SQL: subset of SQL 2003 Standard
● R: SparkR and sparklyr (community) but Python in the process of eating R in
the data scientist communities
10. @anguenot @ilandcloud
Spark’s high level structured APIs (1/2)
● Spark Session
○ driver process controlling the Spark application throughout the cluster
○ One-to-one correspondence
● DataFrames
○ Most common structured API
○ Table of data with rows and columns with a schema (column name and value type)
○ Think distributed!
○ Columns / Rows and Spark Types
○ Basically the same as Table and views against which you execute SQL w/ Spark SQL
○ “Untyped” DataSet
● Partitions
○ Parallel executions of chunked data. Spark does it for you by default if using Dataframes
11. @anguenot @ilandcloud
● Transformations
○ Data structures are immutables: you need to apply instructions called transformations
○ In-memory (narrow transformations) with pipelining when one-to-one partition transformation
○ Shuffled (wide transformations) with disk writes when one-to-many partitions transformation
● Lazy Evaluation
○ Streamlined plan of transformations
○ Optimized graph of computation instructions: logical plan
● Actions
○ count(), collect(), take(n), top(), countByValues(), reduce(), fold(), aggregate(), foreach()
○ Triggers computation against the plan of transformations
○ Single job, broken down in multiple stages and tasks to execute across the cluster
○ Physical plan (clustered RDD manipulations)
Spark’s high level structured APIs (2/2)
14. @anguenot @ilandcloud
The Catalyst Optimizer
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
15. @anguenot @ilandcloud
The Structured API logical planning process
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
16. @anguenot @ilandcloud
The Physical Planning Process
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
17. @anguenot @ilandcloud
PySpark
● Use cases:
○ exploratory data analysis at scale
○ building machine learning pipelines
○ creating ETLs for a data platform
○ streaming data pipelines
● Interactive pyspark-shell
● Since Spark 2.2 available as a PyPI package
● Differences with native Scala?
● Spark Catalyst Engine optimization w/ structured APIs
● Pandas Dataframes & UDFs integration (single node)
● Python is the fastest growing language for data science & machine learning