SlideShare a Scribd company logo
1 of 31
Download to read offline
INTRO TO PYSPARK
Jon Haddad, Technical Evangelist, DataStax
@rustyrazorblade
WHAT TOOLS ARE YOU ALREADY
USING FOR DATA ANALYSIS?
NumPy / SciPy
Pandas
iPython Notebooks
scikit-learn
hdf5
pybrain
WHAT'S THE PROBLEM?
GREAT TOOLS
BUT NOT BUILT FOR BIG DATA SETS
And not real time...
LIMITED TO 1 MACHINE
What if we have a lot of data?
What if we use Cassandra?
We need distributed computing
Use when we have more data what fits on a single machine
WHAT IS SPARK?
Fast and general purpose cluster computing system
LANGUAGES
Scala
Java
R (version >= 1.4)
Python
WHAT CAN I DO WITH IT?
Read and write data in bulk to and from Cassandra
Batch processing
Stream processing
Machine Learning
Distributed SQL
Operate on entire dataset (or at least a big chunk of it)
BATCH PROCESSING
RDD
Resilliant Distributed Dataset (it's a big list)
Use functional concepts like map, filter, reduce
Caveat: Will always pay penalty going from JVM <> Python
DATA MIGRATIONS
USERS
name favorite_food
jon bacon
luke pie
patrick pizza
rachel pizza
SET UP OUR KEYSPACE
create KEYSPACE demo WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1};
use demo ;
CREATE OUR DEMO USER TABLE
create TABLE user ( name text PRIMARY KEY,
favorite_food text );
insert into user (name, favorite_food) values ('jon', 'bacon');
insert into user (name, favorite_food) values ('luke', 'pie');
insert into user (name, favorite_food) values ('patrick', 'pizza');
insert into user (name, favorite_food) values ('rachel', 'pizza');
create table favorite_foods ( food text, name text,
primary key (food, name));
MAPPING FOODS TO USERS
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
conf = SparkConf() 
.setAppName("User Food Migration") 
.setMaster("spark://127.0.0.1:7077") 
.set("spark.cassandra.connection.host", "127.0.0.1")
sc = CassandraSparkContext(conf=conf)
users = sc.cassandraTable("demo", "user")
favorite_foods = users.map(lambda x:
{"food":x['favorite_food'],
"name":x['name']} )
favorite_foods.saveToCassandra("demo", "favorite_foods")
MIGRATION RESULTS
cqlsh:demo> select * from favorite_foods ;
food | name
-------+---------
pizza | patrick
pizza | rachel
pie | luke
bacon | jon
(4 rows)
cqlsh:demo> select * from favorite_foods where food = 'pizza';
food | name
-------+---------
pizza | patrick
pizza | rachel
AGGREGATIONS
u = sc.cassandraTable("demo", "user")
u.map(lambda x: (x['favorite_food'], 1)).
reduceByKey(lambda x, y: x + y).collect()
[(u'bacon', 1), (u'pie', 1), (u'pizza', 2)]
RDDS ARE COOL
And very powerful
But kind of annoying
DATAFRAMES
From R language
Available in Python via Pandas
DataFrames allow for optimized filters, sorting, grouping
With Spark, all the data stays in the JVM
With Cassandra it's still expensive due to JVM <> Python
But it can be fixed
DATAFRAMES EXAMPLE
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext # needed for toDF()
users = sc.cassandraTable("demo", "user").toDF()
food_count = users.select("favorite_food").
groupBy("favorite_food").count()
food_count.collect()
[Row(favorite_food=u'bacon', count=1),
Row(favorite_food=u'pizza', count=2),
Row(favorite_food=u'pie', count=1)]
SPARKSQL
Register dataframes as tables
JOIN, GROUP BY
SPARKSQL IN ACTION
sql = SQLContext(sc)
users = sc.cassandraTable("demo", "user").toDF()
users.registerTempTable("users")
sql.sql("""select favorite_food, count(favorite_food)
from users group by favorite_food """).collect()
[Row(favorite_food=u'bacon', c1=1),
Row(favorite_food=u'pizza', c1=2),
Row(favorite_food=u'pie', c1=1)]
STREAMING
Operate on batch windows
Each batch is a small RDD
PRETTY PICTURE
STREAMING
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
stream = StreamingContext(sc, 1) # 1 second window
kafka_stream = KafkaUtils.createStream(stream, 
"localhost:2181", 
"raw-event-streaming-consumer",
{"pageviews":1})
# manipulate kafka_stream as an RDD
stream.start()
stream.awaitTermination()
MACHINE LEARNING
Supervised learning
Unsupervised learning
SUPERVISED LEARNING
When we know the inputs and outputs
Example: Real estate prices
Take existing knowledge about houses and prices
Build a model to predict the future
UNSUPERVISED LEARNING
When we don't know the output
Popular usage: discover groups
INTERACTIVE IPYTHON NOTEBOOKS
Iterate quickly
Visualize your data
GET STARTED!
Open Source:
Download Cassandra
Download Spark
Cassandra PySpark Repo:
https://github.com/TargetHolding/pyspark-cassandra
Integrated solution
Download DataStax Enterprise

More Related Content

What's hot

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookDatabricks
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
SQL Queries
SQL QueriesSQL Queries
SQL QueriesNilt1234
 
Oracle Table Partitioning - Introduction
Oracle Table Partitioning  - IntroductionOracle Table Partitioning  - Introduction
Oracle Table Partitioning - IntroductionMyOnlineITCourses
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexingYoshinori Matsunobu
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and OptimizationPgDay.Seoul
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 

What's hot (20)

Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
SQL Queries
SQL QueriesSQL Queries
SQL Queries
 
Oracle Table Partitioning - Introduction
Oracle Table Partitioning  - IntroductionOracle Table Partitioning  - Introduction
Oracle Table Partitioning - Introduction
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
[Pgday.Seoul 2021] 2. Porting Oracle UDF and Optimization
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 

Similar to Intro to py spark (and cassandra)

Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
PYTHON PANDAS.pptx
PYTHON PANDAS.pptxPYTHON PANDAS.pptx
PYTHON PANDAS.pptxXICSStudents
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Applicationsupertom
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 

Similar to Intro to py spark (and cassandra) (20)

Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Data Science
Data ScienceData Science
Data Science
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Pig
PigPig
Pig
 
PYTHON PANDAS.pptx
PYTHON PANDAS.pptxPYTHON PANDAS.pptx
PYTHON PANDAS.pptx
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 

More from Jon Haddad

Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsJon Haddad
 
Performance tuning
Performance tuningPerformance tuning
Performance tuningJon Haddad
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoJon Haddad
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Jon Haddad
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core ConceptsJon Haddad
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkJon Haddad
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessJon Haddad
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Jon Haddad
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to CassandraJon Haddad
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandraJon Haddad
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profilingJon Haddad
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best FriendsJon Haddad
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - DenverJon Haddad
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Jon Haddad
 
Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandraJon Haddad
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftJon Haddad
 

More from Jon Haddad (17)

Cassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten YearsCassandra Performance Tuning Like You've Been Doing It for Ten Years
Cassandra Performance Tuning Like You've Been Doing It for Ten Years
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day Toronto
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)Spark and cassandra (Hulu Talk)
Spark and cassandra (Hulu Talk)
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandra
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best Friends
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - Denver
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014
 
Crash course intro to cassandra
Crash course   intro to cassandraCrash course   intro to cassandra
Crash course intro to cassandra
 
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Intro to py spark (and cassandra)

  • 1. INTRO TO PYSPARK Jon Haddad, Technical Evangelist, DataStax @rustyrazorblade
  • 2. WHAT TOOLS ARE YOU ALREADY USING FOR DATA ANALYSIS? NumPy / SciPy Pandas iPython Notebooks scikit-learn hdf5 pybrain
  • 3. WHAT'S THE PROBLEM? GREAT TOOLS BUT NOT BUILT FOR BIG DATA SETS And not real time...
  • 4. LIMITED TO 1 MACHINE What if we have a lot of data? What if we use Cassandra? We need distributed computing
  • 5. Use when we have more data what fits on a single machine WHAT IS SPARK? Fast and general purpose cluster computing system
  • 7. WHAT CAN I DO WITH IT? Read and write data in bulk to and from Cassandra Batch processing Stream processing Machine Learning Distributed SQL
  • 8. Operate on entire dataset (or at least a big chunk of it) BATCH PROCESSING
  • 9. RDD Resilliant Distributed Dataset (it's a big list) Use functional concepts like map, filter, reduce Caveat: Will always pay penalty going from JVM <> Python
  • 11. USERS name favorite_food jon bacon luke pie patrick pizza rachel pizza
  • 12. SET UP OUR KEYSPACE create KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; use demo ;
  • 13. CREATE OUR DEMO USER TABLE create TABLE user ( name text PRIMARY KEY, favorite_food text ); insert into user (name, favorite_food) values ('jon', 'bacon'); insert into user (name, favorite_food) values ('luke', 'pie'); insert into user (name, favorite_food) values ('patrick', 'pizza'); insert into user (name, favorite_food) values ('rachel', 'pizza'); create table favorite_foods ( food text, name text, primary key (food, name));
  • 14. MAPPING FOODS TO USERS from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf conf = SparkConf() .setAppName("User Food Migration") .setMaster("spark://127.0.0.1:7077") .set("spark.cassandra.connection.host", "127.0.0.1") sc = CassandraSparkContext(conf=conf) users = sc.cassandraTable("demo", "user") favorite_foods = users.map(lambda x: {"food":x['favorite_food'], "name":x['name']} ) favorite_foods.saveToCassandra("demo", "favorite_foods")
  • 15. MIGRATION RESULTS cqlsh:demo> select * from favorite_foods ; food | name -------+--------- pizza | patrick pizza | rachel pie | luke bacon | jon (4 rows) cqlsh:demo> select * from favorite_foods where food = 'pizza'; food | name -------+--------- pizza | patrick pizza | rachel
  • 16. AGGREGATIONS u = sc.cassandraTable("demo", "user") u.map(lambda x: (x['favorite_food'], 1)). reduceByKey(lambda x, y: x + y).collect() [(u'bacon', 1), (u'pie', 1), (u'pizza', 2)]
  • 17. RDDS ARE COOL And very powerful But kind of annoying
  • 18. DATAFRAMES From R language Available in Python via Pandas DataFrames allow for optimized filters, sorting, grouping With Spark, all the data stays in the JVM With Cassandra it's still expensive due to JVM <> Python But it can be fixed
  • 19. DATAFRAMES EXAMPLE from pyspark_cassandra import CassandraSparkContext, Row from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext # needed for toDF() users = sc.cassandraTable("demo", "user").toDF() food_count = users.select("favorite_food"). groupBy("favorite_food").count() food_count.collect() [Row(favorite_food=u'bacon', count=1), Row(favorite_food=u'pizza', count=2), Row(favorite_food=u'pie', count=1)]
  • 20. SPARKSQL Register dataframes as tables JOIN, GROUP BY
  • 21. SPARKSQL IN ACTION sql = SQLContext(sc) users = sc.cassandraTable("demo", "user").toDF() users.registerTempTable("users") sql.sql("""select favorite_food, count(favorite_food) from users group by favorite_food """).collect() [Row(favorite_food=u'bacon', c1=1), Row(favorite_food=u'pizza', c1=2), Row(favorite_food=u'pie', c1=1)]
  • 22. STREAMING Operate on batch windows Each batch is a small RDD
  • 24. STREAMING from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils stream = StreamingContext(sc, 1) # 1 second window kafka_stream = KafkaUtils.createStream(stream, "localhost:2181", "raw-event-streaming-consumer", {"pageviews":1}) # manipulate kafka_stream as an RDD stream.start() stream.awaitTermination()
  • 26. SUPERVISED LEARNING When we know the inputs and outputs Example: Real estate prices Take existing knowledge about houses and prices Build a model to predict the future
  • 27. UNSUPERVISED LEARNING When we don't know the output Popular usage: discover groups
  • 28.
  • 29. INTERACTIVE IPYTHON NOTEBOOKS Iterate quickly Visualize your data
  • 30.
  • 31. GET STARTED! Open Source: Download Cassandra Download Spark Cassandra PySpark Repo: https://github.com/TargetHolding/pyspark-cassandra Integrated solution Download DataStax Enterprise