Apache Spark and Python: unified Big Data analytics

@anguenot @ilandcloud
Apache Spark and Python
unified big data analytics
Flatiron school, Houston - July 2019

OUTLINE
● What is Apache Spark?
● The issue of Big Data
● Past, Present and Future of Spark
● Spark architecture quick overview
● Spark’s languages and APIs
● PySpark
● Community & Ecosystem
● Examples and demo using Jupyter notebooks from Andrew Spargue

What is Apache Spark?
● Unified computing engine
● Libraries for parallel data processing on clusters
● Support multiple programming language: SQL, Scala, Java, Python, R
● Provides libraries for streaming, machine learning and graph computing
● Spark UI: Web UI to monitor and inspect jobs and tasks
● Can run anywhere: laptop to clusters (on-prem or cloud)
● De facto standard for Big Data processing across all industries and use
cases
● Open Source @ Apache Software Foundation

Unified
● One (1) compute engine
● One (1) set of APIs
● One (1) way of developing and deploying applications (or jobs)
● Data loading, machine learning, streaming computation, etc.
● Interactive or traditional application deployment
● Code reuse and access to multiple libraries

Compute engine
● Compute engine only: not a persistent data storage system
● Supports a wide range of persistent storage systems: Amazon S3, Azure
Storage, Apache Hadoop, Apache Cassandra, etc.
● Easier to deploy and maintain without persistent storage layer
● Also supports message buses such as Apache Kafka

The issue of Big Data
● Before applications mainly using single processor
● Processors stopped going faster since 2005
● Amount of data increases
● Price of storage decreases: cheap to store data
● Solution: cluster and parallel CPU cores
● Performances
○ In-memory processing faster
○ Exit Hadoop / MapReduce at least for real-time analytics and streaming

Past, Present and future of Spark
● UC Berkeley, CA in 2009 @ Spark research project
● Hadoop / MapReduce first Open Source parallel computing engine for clusters
● Issue of multiple passes over the data requiring multiple jobs with each pass writing
results on disks
● Spark enabled multistep applications with efficient in-memory data sharing in
between steps (batch only at first)
● Interactive and ad-hoc queries (data scientist)
● Apache foundation project in 2013
● Spark 1.0 in 2014 introduced Spark SQL and structured data
● Then came: structured streaming, machine learning pipelines and graphs
● Now: de facto standard in all industries: Netflix, Uber, CERN, MIT, Harvard, etc.

Spark Architecture Overview
Graphics from https://spark.apache.org

Spark Language’s
● Scala: default language
● Java: available but not popular
● Python: supports nearly all constructs that Scala supports
● SQL: subset of SQL 2003 Standard
● R: SparkR and sparklyr (community) but Python in the process of eating R in
the data scientist communities

Spark’s high level structured APIs (1/2)
● Spark Session
○ driver process controlling the Spark application throughout the cluster
○ One-to-one correspondence
● DataFrames
○ Most common structured API
○ Table of data with rows and columns with a schema (column name and value type)
○ Think distributed!
○ Columns / Rows and Spark Types
○ Basically the same as Table and views against which you execute SQL w/ Spark SQL
○ “Untyped” DataSet
● Partitions
○ Parallel executions of chunked data. Spark does it for you by default if using Dataframes

● Transformations
○ Data structures are immutables: you need to apply instructions called transformations
○ In-memory (narrow transformations) with pipelining when one-to-one partition transformation
○ Shuffled (wide transformations) with disk writes when one-to-many partitions transformation
● Lazy Evaluation
○ Streamlined plan of transformations
○ Optimized graph of computation instructions: logical plan
● Actions
○ count(), collect(), take(n), top(), countByValues(), reduce(), fold(), aggregate(), foreach()
○ Triggers computation against the plan of transformations
○ Single job, broken down in multiple stages and tasks to execute across the cluster
○ Physical plan (clustered RDD manipulations)
Spark’s high level structured APIs (2/2)

Spark’s application
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do

Spark’s toolkit

The Catalyst Optimizer

The Structured API logical planning process

The Physical Planning Process

PySpark
● Use cases:
○ exploratory data analysis at scale
○ building machine learning pipelines
○ creating ETLs for a data platform
○ streaming data pipelines
● Interactive pyspark-shell
● Since Spark 2.2 available as a PyPI package
● Differences with native Scala?
● Spark Catalyst Engine optimization w/ structured APIs
● Pandas Dataframes & UDFs integration (single node)
● Python is the fastest growing language for data science & machine learning

● Apache Spark website: https://spark.apache.org/
● Mailing lists:
user@spark.apache.org
dev@spark.apache.org
● Community resources:
○ https://spark.apache.org/community.html
● Spark Packages: https://spark-packages.org/
● Spark Summit
● Local meetups:
○ @ Houston: https://www.meetup.com/Houston-Spark-Meetup/
Ecosystem & Community

“Software is like sex: it's better when it's free.”
-Linus Torvalds, creator of Linux (and GIT)

Examples and demo with
Andrew Spargue
https://github.com/spraguesy/spark-ncaa-bb

Apache Spark and Python: unified Big Data analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark and Python: unified Big Data analytics

Similar to Apache Spark and Python: unified Big Data analytics (20)

More from Julien Anguenot

More from Julien Anguenot (7)

Recently uploaded

Recently uploaded (20)

Apache Spark and Python: unified Big Data analytics