SlideShare a Scribd company logo
1 of 21
Download to read offline
@anguenot @ilandcloud
Apache Spark and Python
unified big data analytics
Flatiron school, Houston - July 2019
@anguenot @ilandcloud
OUTLINE
● What is Apache Spark?
● The issue of Big Data
● Past, Present and Future of Spark
● Spark architecture quick overview
● Spark’s languages and APIs
● PySpark
● Community & Ecosystem
● Examples and demo using Jupyter notebooks from Andrew Spargue
@anguenot @ilandcloud
What is Apache Spark?
● Unified computing engine
● Libraries for parallel data processing on clusters
● Support multiple programming language: SQL, Scala, Java, Python, R
● Provides libraries for streaming, machine learning and graph computing
● Spark UI: Web UI to monitor and inspect jobs and tasks
● Can run anywhere: laptop to clusters (on-prem or cloud)
● De facto standard for Big Data processing across all industries and use
cases
● Open Source @ Apache Software Foundation
@anguenot @ilandcloud
Unified
● One (1) compute engine
● One (1) set of APIs
● One (1) way of developing and deploying applications (or jobs)
● Data loading, machine learning, streaming computation, etc.
● Interactive or traditional application deployment
● Code reuse and access to multiple libraries
@anguenot @ilandcloud
Compute engine
● Compute engine only: not a persistent data storage system
● Supports a wide range of persistent storage systems: Amazon S3, Azure
Storage, Apache Hadoop, Apache Cassandra, etc.
● Easier to deploy and maintain without persistent storage layer
● Also supports message buses such as Apache Kafka
@anguenot @ilandcloud
The issue of Big Data
● Before applications mainly using single processor
● Processors stopped going faster since 2005
● Amount of data increases
● Price of storage decreases: cheap to store data
● Solution: cluster and parallel CPU cores
● Performances
○ In-memory processing faster
○ Exit Hadoop / MapReduce at least for real-time analytics and streaming
@anguenot @ilandcloud
Past, Present and future of Spark
● UC Berkeley, CA in 2009 @ Spark research project
● Hadoop / MapReduce first Open Source parallel computing engine for clusters
● Issue of multiple passes over the data requiring multiple jobs with each pass writing
results on disks
● Spark enabled multistep applications with efficient in-memory data sharing in
between steps (batch only at first)
● Interactive and ad-hoc queries (data scientist)
● Apache foundation project in 2013
● Spark 1.0 in 2014 introduced Spark SQL and structured data
● Then came: structured streaming, machine learning pipelines and graphs
● Now: de facto standard in all industries: Netflix, Uber, CERN, MIT, Harvard, etc.
@anguenot @ilandcloud
Spark Architecture Overview
Graphics from https://spark.apache.org
@anguenot @ilandcloud
Spark Language’s
● Scala: default language
● Java: available but not popular
● Python: supports nearly all constructs that Scala supports
● SQL: subset of SQL 2003 Standard
● R: SparkR and sparklyr (community) but Python in the process of eating R in
the data scientist communities
@anguenot @ilandcloud
Spark’s high level structured APIs (1/2)
● Spark Session
○ driver process controlling the Spark application throughout the cluster
○ One-to-one correspondence
● DataFrames
○ Most common structured API
○ Table of data with rows and columns with a schema (column name and value type)
○ Think distributed!
○ Columns / Rows and Spark Types
○ Basically the same as Table and views against which you execute SQL w/ Spark SQL
○ “Untyped” DataSet
● Partitions
○ Parallel executions of chunked data. Spark does it for you by default if using Dataframes
@anguenot @ilandcloud
● Transformations
○ Data structures are immutables: you need to apply instructions called transformations
○ In-memory (narrow transformations) with pipelining when one-to-one partition transformation
○ Shuffled (wide transformations) with disk writes when one-to-many partitions transformation
● Lazy Evaluation
○ Streamlined plan of transformations
○ Optimized graph of computation instructions: logical plan
● Actions
○ count(), collect(), take(n), top(), countByValues(), reduce(), fold(), aggregate(), foreach()
○ Triggers computation against the plan of transformations
○ Single job, broken down in multiple stages and tasks to execute across the cluster
○ Physical plan (clustered RDD manipulations)
Spark’s high level structured APIs (2/2)
@anguenot @ilandcloud
Spark’s application
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud
Spark’s toolkit
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud
The Catalyst Optimizer
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud
The Structured API logical planning process
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud
The Physical Planning Process
Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
@anguenot @ilandcloud
PySpark
● Use cases:
○ exploratory data analysis at scale
○ building machine learning pipelines
○ creating ETLs for a data platform
○ streaming data pipelines
● Interactive pyspark-shell
● Since Spark 2.2 available as a PyPI package
● Differences with native Scala?
● Spark Catalyst Engine optimization w/ structured APIs
● Pandas Dataframes & UDFs integration (single node)
● Python is the fastest growing language for data science & machine learning
@anguenot @ilandcloud
● Apache Spark website: https://spark.apache.org/
● Mailing lists:
user@spark.apache.org
dev@spark.apache.org
● Community resources:
○ https://spark.apache.org/community.html
● Spark Packages: https://spark-packages.org/
● Spark Summit
● Local meetups:
○ @ Houston: https://www.meetup.com/Houston-Spark-Meetup/
Ecosystem & Community
@anguenot @ilandcloud
“Software is like sex: it's better when it's free.”
-Linus Torvalds, creator of Linux (and GIT)
@anguenot @ilandcloud
Examples and demo with
Andrew Spargue
https://github.com/spraguesy/spark-ncaa-bb
@anguenot @ilandcloud
Q&A

More Related Content

What's hot

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Python Programming and GIS
Python Programming and GISPython Programming and GIS
Python Programming and GISJohn Reiser
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathSpark Summit
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...Henry Saputra
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduJen Aman
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Python in geospatial analysis
Python in geospatial analysisPython in geospatial analysis
Python in geospatial analysisSakthivel R
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Apache spark vs hadoop
Apache spark vs hadoopApache spark vs hadoop
Apache spark vs hadoopPrwaTech
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Databricks
 
Python and GIS: Improving Your Workflow
Python and GIS: Improving Your WorkflowPython and GIS: Improving Your Workflow
Python and GIS: Improving Your WorkflowJohn Reiser
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsDatabricks
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow ManagementRomi Kuntsman
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
 

What's hot (20)

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Python Programming and GIS
Python Programming and GISPython Programming and GIS
Python Programming and GIS
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Python in geospatial analysis
Python in geospatial analysisPython in geospatial analysis
Python in geospatial analysis
 
Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Apache spark vs hadoop
Apache spark vs hadoopApache spark vs hadoop
Apache spark vs hadoop
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2
 
Python and GIS: Improving Your Workflow
Python and GIS: Improving Your WorkflowPython and GIS: Improving Your Workflow
Python and GIS: Improving Your Workflow
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 

Similar to Apache Spark and Python: unified Big Data analytics

Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark ScalaKnoldus Inc.
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 

Similar to Apache Spark and Python: unified Big Data analytics (20)

Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache spark
Apache sparkApache spark
Apache spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 

More from Julien Anguenot

Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...
Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...
Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...Julien Anguenot
 
Architecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.KArchitecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.KJulien Anguenot
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsJulien Anguenot
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsJulien Anguenot
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
 
Real-time data analytics with Cassandra at iland
Real-time data analytics with Cassandra at ilandReal-time data analytics with Cassandra at iland
Real-time data analytics with Cassandra at ilandJulien Anguenot
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersJulien Anguenot
 

More from Julien Anguenot (7)

Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...
Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...
Traitement temps réel de flux réseaux IPFIX/Netflow avec PySpark, Kafka et Ca...
 
Architecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.KArchitecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.K
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Real-time data analytics with Cassandra at iland
Real-time data analytics with Cassandra at ilandReal-time data analytics with Cassandra at iland
Real-time data analytics with Cassandra at iland
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
 

Recently uploaded

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 

Recently uploaded (20)

Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 

Apache Spark and Python: unified Big Data analytics

  • 1. @anguenot @ilandcloud Apache Spark and Python unified big data analytics Flatiron school, Houston - July 2019
  • 2. @anguenot @ilandcloud OUTLINE ● What is Apache Spark? ● The issue of Big Data ● Past, Present and Future of Spark ● Spark architecture quick overview ● Spark’s languages and APIs ● PySpark ● Community & Ecosystem ● Examples and demo using Jupyter notebooks from Andrew Spargue
  • 3. @anguenot @ilandcloud What is Apache Spark? ● Unified computing engine ● Libraries for parallel data processing on clusters ● Support multiple programming language: SQL, Scala, Java, Python, R ● Provides libraries for streaming, machine learning and graph computing ● Spark UI: Web UI to monitor and inspect jobs and tasks ● Can run anywhere: laptop to clusters (on-prem or cloud) ● De facto standard for Big Data processing across all industries and use cases ● Open Source @ Apache Software Foundation
  • 4. @anguenot @ilandcloud Unified ● One (1) compute engine ● One (1) set of APIs ● One (1) way of developing and deploying applications (or jobs) ● Data loading, machine learning, streaming computation, etc. ● Interactive or traditional application deployment ● Code reuse and access to multiple libraries
  • 5. @anguenot @ilandcloud Compute engine ● Compute engine only: not a persistent data storage system ● Supports a wide range of persistent storage systems: Amazon S3, Azure Storage, Apache Hadoop, Apache Cassandra, etc. ● Easier to deploy and maintain without persistent storage layer ● Also supports message buses such as Apache Kafka
  • 6. @anguenot @ilandcloud The issue of Big Data ● Before applications mainly using single processor ● Processors stopped going faster since 2005 ● Amount of data increases ● Price of storage decreases: cheap to store data ● Solution: cluster and parallel CPU cores ● Performances ○ In-memory processing faster ○ Exit Hadoop / MapReduce at least for real-time analytics and streaming
  • 7. @anguenot @ilandcloud Past, Present and future of Spark ● UC Berkeley, CA in 2009 @ Spark research project ● Hadoop / MapReduce first Open Source parallel computing engine for clusters ● Issue of multiple passes over the data requiring multiple jobs with each pass writing results on disks ● Spark enabled multistep applications with efficient in-memory data sharing in between steps (batch only at first) ● Interactive and ad-hoc queries (data scientist) ● Apache foundation project in 2013 ● Spark 1.0 in 2014 introduced Spark SQL and structured data ● Then came: structured streaming, machine learning pipelines and graphs ● Now: de facto standard in all industries: Netflix, Uber, CERN, MIT, Harvard, etc.
  • 8. @anguenot @ilandcloud Spark Architecture Overview Graphics from https://spark.apache.org
  • 9. @anguenot @ilandcloud Spark Language’s ● Scala: default language ● Java: available but not popular ● Python: supports nearly all constructs that Scala supports ● SQL: subset of SQL 2003 Standard ● R: SparkR and sparklyr (community) but Python in the process of eating R in the data scientist communities
  • 10. @anguenot @ilandcloud Spark’s high level structured APIs (1/2) ● Spark Session ○ driver process controlling the Spark application throughout the cluster ○ One-to-one correspondence ● DataFrames ○ Most common structured API ○ Table of data with rows and columns with a schema (column name and value type) ○ Think distributed! ○ Columns / Rows and Spark Types ○ Basically the same as Table and views against which you execute SQL w/ Spark SQL ○ “Untyped” DataSet ● Partitions ○ Parallel executions of chunked data. Spark does it for you by default if using Dataframes
  • 11. @anguenot @ilandcloud ● Transformations ○ Data structures are immutables: you need to apply instructions called transformations ○ In-memory (narrow transformations) with pipelining when one-to-one partition transformation ○ Shuffled (wide transformations) with disk writes when one-to-many partitions transformation ● Lazy Evaluation ○ Streamlined plan of transformations ○ Optimized graph of computation instructions: logical plan ● Actions ○ count(), collect(), take(n), top(), countByValues(), reduce(), fold(), aggregate(), foreach() ○ Triggers computation against the plan of transformations ○ Single job, broken down in multiple stages and tasks to execute across the cluster ○ Physical plan (clustered RDD manipulations) Spark’s high level structured APIs (2/2)
  • 12. @anguenot @ilandcloud Spark’s application Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 13. @anguenot @ilandcloud Spark’s toolkit Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 14. @anguenot @ilandcloud The Catalyst Optimizer Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 15. @anguenot @ilandcloud The Structured API logical planning process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 16. @anguenot @ilandcloud The Physical Planning Process Graphics from the book “Spark The Definitive Guide”: http://shop.oreilly.com/product/0636920034957.do
  • 17. @anguenot @ilandcloud PySpark ● Use cases: ○ exploratory data analysis at scale ○ building machine learning pipelines ○ creating ETLs for a data platform ○ streaming data pipelines ● Interactive pyspark-shell ● Since Spark 2.2 available as a PyPI package ● Differences with native Scala? ● Spark Catalyst Engine optimization w/ structured APIs ● Pandas Dataframes & UDFs integration (single node) ● Python is the fastest growing language for data science & machine learning
  • 18. @anguenot @ilandcloud ● Apache Spark website: https://spark.apache.org/ ● Mailing lists: user@spark.apache.org dev@spark.apache.org ● Community resources: ○ https://spark.apache.org/community.html ● Spark Packages: https://spark-packages.org/ ● Spark Summit ● Local meetups: ○ @ Houston: https://www.meetup.com/Houston-Spark-Meetup/ Ecosystem & Community
  • 19. @anguenot @ilandcloud “Software is like sex: it's better when it's free.” -Linus Torvalds, creator of Linux (and GIT)
  • 20. @anguenot @ilandcloud Examples and demo with Andrew Spargue https://github.com/spraguesy/spark-ncaa-bb