SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
May, 2015
Douglas Eisenstein - Advanti
Stanislav Seltser - Advanti
BOSTON 2015
@opendatasci
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
Spark, Python, and Parquet
Learn How to Use Spark, Python, and Parquet for
Loading and Transforming Data in 45 Minutes
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Agenda
Use Case Background
What’s Spark and Parquet?
Demo: code + data = fun part
Questions
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
TAKEAWAYS: WHAT WILL YOU LEARN TODAY?
Learn new technologies and be motivated to start using them
3	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
In about the same time it will take you to mow your lawn…
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Takeaways: you will learn …
1.  What is Spark and Parquet and why to consider it?
2.  A live demo showing you 5 useful transforms in Spark
3.  Instructions for DIY cluster: Spark, Hadoop, Hive, Parquet, and C*
4.  “Take home” fund holdings open dataset from Morningstar
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
USE CASE: DATASET, TRADEOFFS, SOLUTION
What’s our use case? Why choose Spark?
6	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Holdings Data
“The contents of an investment portfolio held by an individual or entity such as a mutual
fund or pension fund.” – Investopedia 2015
Sources:
o  3rd Parties: Morningstar, Lipper, FactSet, Bloomberg, Thomson Reuters
o  Internal: Specialized portfolio accounting systems per asset type
o  Public: SEC 13F’s when AUM is >= $100M, includes hedge funds
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Use Case: Challenges
1.  Millions of files, Reshape data, pre-aggregate holdings,
disambiguate securities, unstack/stack data
2.  Create wide tables (think columnar) for storing time series data
about 1M instruments over 12 months for 381k funds
3.  Rules are abstracted to work on “holdings” making them vendor-
agnostic and reusable (13F’s, 3rd parties, proprietary, etc)
4.  All sorts of “data usability” issues: missing positions, missing
identifiers, sparse derivative data, irregular fund reporting, etc
5.  Need to create your own “ownership views” by asset class and
period (ex. Fixed Income Ownership View)
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Open Dataset Description
Key Points
o  Holdings are electively contributed for fresher ranks
o  Long and Short positions included, unique…
o  Datasets: free open data, trial data, paid licensed data
o  Our dataset starts in 2014-01 across all fund types
Descriptive Statistics
o  History starts in 2000+
o  Fund types: Open, Closed, ETFs, SMAs
o  381k funds since 2014
o  1M unique securities since 2014
o  82 legal types, ex Corporate Bond or Equity Swap
Open Dataset: open-ended funds for report period 2014-06-10
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Common Usages
1.  Ownership: Top N largest long/short positions per asset class
2.  Flow: Compute holdings flow across various financial assets
3.  Comparison: Peer group analysis through holdings attribution
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What’s the best part about starting a new project? You get to “play”
with new tech toys right? But which ones….?
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Data Processing Landscape
o  f
Too
many
choices
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Modern General Purpose Data Pipeline
Why:
1.  Spark for in-memory transforms/analytics and fast exploration
2.  Parquet for fast columnar data retrieval
3.  Python as the glue layer and to re-use data transforms
Data Pipeline:
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Granular Options
Remember… Technology moves quickly!
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
SPARK: WHAT IS IT?
Everyone hears about it, but what is Spark?
15	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
If you do any Googling, don’t search for Spark 1.4
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Spark?
Highlights
o  General purpose data processing engine
o  In-memory data persistence
o  Interactive shells in Scala and Python
Key Concepts
o  RDD: Collections of objects stored in RAM or on Disk
o  Transforms: map(), filter(), reduceByKey(), join()
o  Actions: count(), collect(), save()
Spark SQL
o  Runs SQL or HQL queries
o  In-line SQL UDF
o  Distributed DataFrame
o  Parquet, Hive, Cassandra
Graphic	
  sourced:	
  hAps://goo.gl/fpLrSs	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is a Spark DataFrame?
What’s DataFrame?
•  A collection of rows organized into named columns
•  Expressive language for wrangling data into something usable
•  Ability to select, filter, and aggregate structured data
What’s a Spark DataFrame?
•  Distributed collection of data organized into named columns
•  Conceptually equivalent to a table in a relational database
•  Relational data processing: project, filter, aggregate, join, etc
•  Operations: groupBy(),
join(), sql(),
unique()
•  Create UDF’s and push them
into SparkSQL
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Distributed Data Aggregation
o  Distributed high-performance data, once only available by
expensive appliances: Netezza, GreenPlum, Exadata, Teradata, etc
o  Code 1-liner doing a SUM() between Pandas (standalone) and
Spark (distributed)
Pandas DataFrame Spark DataFrame
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
PARQUET: WHAT IS IT?
No, it’s not your Grandma’s flooring…
20	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Parquet?
Parquet File Format Parquet in HDFS
“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model or programming language.” –
parquet.apache.org
•  Columnar File Format
•  Supports Nested Data Structures
•  Not tied to any commercial framework
•  Accessible by HIVE, Spark, Pig, Drill, MR
•  R/W in HDFS or local file system
•  Gaining Strong Usage…
Features	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
What is Columnar Data?
•  Limit’s IO to data needed
•  Columnar compresses better
•  Type specific encodings available
•  Enables vectorized execution engines
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
How is Columnar Data Read?
Projection à
9 columns
Predicate à
50 Columns Wide
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Spark DataFrame è Parquet
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
DEMO: HIVE, PARQUET, SPARK SQL, DATAFRAME
Prepare CSV’s in HIVE, persist in Parquet, show Spark SQL and
DataFrame transforms using interactive shell in PySpark
25	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Spark SQL is Experimental
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Demo Outline
T
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
OBSTACLES: NOTHING IS EASY
Expect workarounds and time spent hacking, but consider this, it’s
a learning opportunity…
28	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Nothing is easy – unless you’re Ricky Bobby – “Shake and Bake”
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Obstacles
Spark
•  Don’t expect Pandas-convenient Spark DataFrames (yet at least)
(e.g. no upsampling, backfilling, etc)
•  RDD’s are powerful, but you’ll need to get your hands dirty writing
lower-level code (not a bad thing)
•  Allocate enough memory on all of your nodes, it’s a hog!
Parquet
•  Date and binary support are pending although timestamp,
decimal, char, and varchar are now supported in Hive 0.14.0
•  You won’t get Vertica or Cassandra level response times (maybe in
the future – that’s my speculation)
Getting Started
•  Learn by doing: reading is useful to gain the basics but just “jump
right in and do it”
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
QUESTIONS
Douglas Eisenstein
@dougeisenstein
doug.eisenstein@advantisolutions.com
Helping to create fast, reliable, and transparent modern data pipelines for financial analytics
Talk feedback: https://goo.gl/Y0UuWy
31	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
APPENDIX
Using Spark, Python, and Cassandra for Loading and
Transformations at Scale
32	
  
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
Resources
hAp://training.databricks.com/workshop/itas_workshop.pdf	
  (Intro	
  to	
  Spark	
  PDF)	
  	
  
hAps://databricks-­‐training.s3.amazonaws.com/slides/SparkSQLTraining.Summit.July2014.pdf	
  (Spark	
  SQL	
  Training	
  Module)	
  	
  
hAp://training.databricks.com/workshop/itas_workshop.pdf	
  (Spark	
  101)	
  	
  
hAp://spark-­‐summit.org/2014/training	
  (General	
  Spark	
  Videos)	
  	
  
hAps://spark.apache.org/docs/1.3.0/sql-­‐programming-­‐guide.html	
  (Spark	
  SQL	
  and	
  DataFrame’s	
  —	
  awesome)	
  	
  
hAp://www.slideshare.net/EvanChan2/2014-­‐07olapcassspark	
  (OLAP	
  with	
  Cassandra	
  and	
  Spark)	
  	
  
hAps://databricks.com/blog/2015/03/24/spark-­‐sql-­‐graduates-­‐from-­‐alpha-­‐in-­‐spark-­‐1-­‐3.html	
  (Latest	
  Spark/DataFrame/Parquet)	
  	
  
hAps://spark.apache.org/docs/latest/programming-­‐guide.html	
  (Basics	
  of	
  Spark	
  Development)	
  	
  
hAps://spark.apache.org/docs/latest/configura=on.html	
  (Spark	
  Configura=on)	
  	
  
hAps://academy.datastax.com/demos/datastax-­‐enterprise-­‐joining-­‐tables-­‐apache-­‐spark	
  (Joining	
  Cassandra	
  Tables	
  in	
  Spark)	
  	
  
hAp://www.infoobjects.com/author/rishi/	
  (Spark	
  /	
  Parquet	
  Integra=on)	
  	
  
hAps://parquet.incubator.apache.org/presenta=ons/	
  (Parquet	
  Videos/Slides,	
  awesome)	
  	
  
hAp://www.slideshare.net/databricks/spark-­‐sqlsse2015public	
  (All	
  about	
  DataFrame	
  by	
  the	
  author)	
  	
  
hAp://www.infoobjects.com/category/spark_cookbook/	
  (Good	
  walkthrough	
  of	
  Spark	
  Demos)	
  	
  
hAp://tobert.github.io/post/2014-­‐07-­‐15-­‐installing-­‐cassandra-­‐spark-­‐stack.html	
  (Spark	
  /	
  Cassandra	
  Integra=on	
  from	
  scratch)	
  	
  
hAp://blog.cloudera.com/blog/2015/05/working-­‐with-­‐apache-­‐spark-­‐or-­‐how-­‐i-­‐learned-­‐to-­‐stop-­‐worrying-­‐and-­‐love-­‐the-­‐
shuffle/	
  (Spark	
  by	
  Data	
  Engineer)	
  	
  
hAp://blog.cloudera.com/blog/2015/03/how-­‐to-­‐tune-­‐your-­‐apache-­‐spark-­‐jobs-­‐part-­‐1/	
  (Tuning	
  Spark)	
  	
  
hAps://spark-­‐summit.org/2015-­‐east/wp-­‐content/uploads/2015/03/SSE15-­‐21-­‐Sandy-­‐Ryza.pdf	
  ()	
  	
  
hAps://www.youtube.com/watch?v=0OM68k3np0E&list=PL-­‐x35fyliRwiiYSXHyI61RXdHlYR3QjZ1&index=5	
  ()	
  
http://www.slideshare.net/dataera/parquet-format
Open	
  Data	
  Science	
  Conference	
  2015	
  –	
  Douglas	
  Eisenstein	
  of	
  Advan=	
  
About Me
o  My vision is to make data preparation fast and reliable
o  I help financial firms with data-intensive processes
o 
o  In my spare time: CrossFit, Baseball, No Gardening!
@dougeisenstein

Contenu connexe

Tendances

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 

Tendances (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
PySaprk
PySaprkPySaprk
PySaprk
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 

Similaire à Spark, Python and Parquet

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summitSujee Maniyam
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Ischools workshop - 4 - data discovery
Ischools workshop - 4 - data discoveryIschools workshop - 4 - data discovery
Ischools workshop - 4 - data discoveryARDC
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
 
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into DatabricksData Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into DatabricksKnoldus Inc.
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackDenodo
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark ZaranTech LLC
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSchool of Data
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Scienceds4good
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 

Similaire à Spark, Python and Parquet (20)

Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Ischools workshop - 4 - data discovery
Ischools workshop - 4 - data discoveryIschools workshop - 4 - data discovery
Ischools workshop - 4 - data discovery
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into DatabricksData Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 

Plus de odsc

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer odsc
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discoveryodsc
 
API Driven Development
API Driven Development API Driven Development
API Driven Development odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysisodsc
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Upodsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hiveodsc
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depthodsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Informationodsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLodsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Namesodsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Dataodsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Scienceodsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Toolsodsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypseodsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Researchodsc
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 

Plus de odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Spark, Python and Parquet

  • 1. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   May, 2015 Douglas Eisenstein - Advanti Stanislav Seltser - Advanti BOSTON 2015 @opendatasci O P E N D A T A S C I E N C E C O N F E R E N C E_ Spark, Python, and Parquet Learn How to Use Spark, Python, and Parquet for Loading and Transforming Data in 45 Minutes
  • 2. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Agenda Use Case Background What’s Spark and Parquet? Demo: code + data = fun part Questions
  • 3. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   TAKEAWAYS: WHAT WILL YOU LEARN TODAY? Learn new technologies and be motivated to start using them 3  
  • 4. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   In about the same time it will take you to mow your lawn…
  • 5. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Takeaways: you will learn … 1.  What is Spark and Parquet and why to consider it? 2.  A live demo showing you 5 useful transforms in Spark 3.  Instructions for DIY cluster: Spark, Hadoop, Hive, Parquet, and C* 4.  “Take home” fund holdings open dataset from Morningstar
  • 6. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   USE CASE: DATASET, TRADEOFFS, SOLUTION What’s our use case? Why choose Spark? 6  
  • 7. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Holdings Data “The contents of an investment portfolio held by an individual or entity such as a mutual fund or pension fund.” – Investopedia 2015 Sources: o  3rd Parties: Morningstar, Lipper, FactSet, Bloomberg, Thomson Reuters o  Internal: Specialized portfolio accounting systems per asset type o  Public: SEC 13F’s when AUM is >= $100M, includes hedge funds
  • 8. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Use Case: Challenges 1.  Millions of files, Reshape data, pre-aggregate holdings, disambiguate securities, unstack/stack data 2.  Create wide tables (think columnar) for storing time series data about 1M instruments over 12 months for 381k funds 3.  Rules are abstracted to work on “holdings” making them vendor- agnostic and reusable (13F’s, 3rd parties, proprietary, etc) 4.  All sorts of “data usability” issues: missing positions, missing identifiers, sparse derivative data, irregular fund reporting, etc 5.  Need to create your own “ownership views” by asset class and period (ex. Fixed Income Ownership View)
  • 9. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Open Dataset Description Key Points o  Holdings are electively contributed for fresher ranks o  Long and Short positions included, unique… o  Datasets: free open data, trial data, paid licensed data o  Our dataset starts in 2014-01 across all fund types Descriptive Statistics o  History starts in 2000+ o  Fund types: Open, Closed, ETFs, SMAs o  381k funds since 2014 o  1M unique securities since 2014 o  82 legal types, ex Corporate Bond or Equity Swap Open Dataset: open-ended funds for report period 2014-06-10
  • 10. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Common Usages 1.  Ownership: Top N largest long/short positions per asset class 2.  Flow: Compute holdings flow across various financial assets 3.  Comparison: Peer group analysis through holdings attribution
  • 11. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What’s the best part about starting a new project? You get to “play” with new tech toys right? But which ones….?
  • 12. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Data Processing Landscape o  f Too many choices
  • 13. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Modern General Purpose Data Pipeline Why: 1.  Spark for in-memory transforms/analytics and fast exploration 2.  Parquet for fast columnar data retrieval 3.  Python as the glue layer and to re-use data transforms Data Pipeline:
  • 14. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Granular Options Remember… Technology moves quickly!
  • 15. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   SPARK: WHAT IS IT? Everyone hears about it, but what is Spark? 15  
  • 16. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   If you do any Googling, don’t search for Spark 1.4
  • 17. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Spark? Highlights o  General purpose data processing engine o  In-memory data persistence o  Interactive shells in Scala and Python Key Concepts o  RDD: Collections of objects stored in RAM or on Disk o  Transforms: map(), filter(), reduceByKey(), join() o  Actions: count(), collect(), save() Spark SQL o  Runs SQL or HQL queries o  In-line SQL UDF o  Distributed DataFrame o  Parquet, Hive, Cassandra Graphic  sourced:  hAps://goo.gl/fpLrSs  
  • 18. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is a Spark DataFrame? What’s DataFrame? •  A collection of rows organized into named columns •  Expressive language for wrangling data into something usable •  Ability to select, filter, and aggregate structured data What’s a Spark DataFrame? •  Distributed collection of data organized into named columns •  Conceptually equivalent to a table in a relational database •  Relational data processing: project, filter, aggregate, join, etc •  Operations: groupBy(), join(), sql(), unique() •  Create UDF’s and push them into SparkSQL
  • 19. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Distributed Data Aggregation o  Distributed high-performance data, once only available by expensive appliances: Netezza, GreenPlum, Exadata, Teradata, etc o  Code 1-liner doing a SUM() between Pandas (standalone) and Spark (distributed) Pandas DataFrame Spark DataFrame
  • 20. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   PARQUET: WHAT IS IT? No, it’s not your Grandma’s flooring… 20  
  • 21. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Parquet? Parquet File Format Parquet in HDFS “Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.” – parquet.apache.org •  Columnar File Format •  Supports Nested Data Structures •  Not tied to any commercial framework •  Accessible by HIVE, Spark, Pig, Drill, MR •  R/W in HDFS or local file system •  Gaining Strong Usage… Features  
  • 22. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   What is Columnar Data? •  Limit’s IO to data needed •  Columnar compresses better •  Type specific encodings available •  Enables vectorized execution engines
  • 23. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   How is Columnar Data Read? Projection à 9 columns Predicate à 50 Columns Wide
  • 24. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Spark DataFrame è Parquet
  • 25. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   DEMO: HIVE, PARQUET, SPARK SQL, DATAFRAME Prepare CSV’s in HIVE, persist in Parquet, show Spark SQL and DataFrame transforms using interactive shell in PySpark 25  
  • 26. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Spark SQL is Experimental
  • 27. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Demo Outline T
  • 28. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   OBSTACLES: NOTHING IS EASY Expect workarounds and time spent hacking, but consider this, it’s a learning opportunity… 28  
  • 29. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Nothing is easy – unless you’re Ricky Bobby – “Shake and Bake”
  • 30. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Obstacles Spark •  Don’t expect Pandas-convenient Spark DataFrames (yet at least) (e.g. no upsampling, backfilling, etc) •  RDD’s are powerful, but you’ll need to get your hands dirty writing lower-level code (not a bad thing) •  Allocate enough memory on all of your nodes, it’s a hog! Parquet •  Date and binary support are pending although timestamp, decimal, char, and varchar are now supported in Hive 0.14.0 •  You won’t get Vertica or Cassandra level response times (maybe in the future – that’s my speculation) Getting Started •  Learn by doing: reading is useful to gain the basics but just “jump right in and do it”
  • 31. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   QUESTIONS Douglas Eisenstein @dougeisenstein doug.eisenstein@advantisolutions.com Helping to create fast, reliable, and transparent modern data pipelines for financial analytics Talk feedback: https://goo.gl/Y0UuWy 31  
  • 32. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   APPENDIX Using Spark, Python, and Cassandra for Loading and Transformations at Scale 32  
  • 33. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   Resources hAp://training.databricks.com/workshop/itas_workshop.pdf  (Intro  to  Spark  PDF)     hAps://databricks-­‐training.s3.amazonaws.com/slides/SparkSQLTraining.Summit.July2014.pdf  (Spark  SQL  Training  Module)     hAp://training.databricks.com/workshop/itas_workshop.pdf  (Spark  101)     hAp://spark-­‐summit.org/2014/training  (General  Spark  Videos)     hAps://spark.apache.org/docs/1.3.0/sql-­‐programming-­‐guide.html  (Spark  SQL  and  DataFrame’s  —  awesome)     hAp://www.slideshare.net/EvanChan2/2014-­‐07olapcassspark  (OLAP  with  Cassandra  and  Spark)     hAps://databricks.com/blog/2015/03/24/spark-­‐sql-­‐graduates-­‐from-­‐alpha-­‐in-­‐spark-­‐1-­‐3.html  (Latest  Spark/DataFrame/Parquet)     hAps://spark.apache.org/docs/latest/programming-­‐guide.html  (Basics  of  Spark  Development)     hAps://spark.apache.org/docs/latest/configura=on.html  (Spark  Configura=on)     hAps://academy.datastax.com/demos/datastax-­‐enterprise-­‐joining-­‐tables-­‐apache-­‐spark  (Joining  Cassandra  Tables  in  Spark)     hAp://www.infoobjects.com/author/rishi/  (Spark  /  Parquet  Integra=on)     hAps://parquet.incubator.apache.org/presenta=ons/  (Parquet  Videos/Slides,  awesome)     hAp://www.slideshare.net/databricks/spark-­‐sqlsse2015public  (All  about  DataFrame  by  the  author)     hAp://www.infoobjects.com/category/spark_cookbook/  (Good  walkthrough  of  Spark  Demos)     hAp://tobert.github.io/post/2014-­‐07-­‐15-­‐installing-­‐cassandra-­‐spark-­‐stack.html  (Spark  /  Cassandra  Integra=on  from  scratch)     hAp://blog.cloudera.com/blog/2015/05/working-­‐with-­‐apache-­‐spark-­‐or-­‐how-­‐i-­‐learned-­‐to-­‐stop-­‐worrying-­‐and-­‐love-­‐the-­‐ shuffle/  (Spark  by  Data  Engineer)     hAp://blog.cloudera.com/blog/2015/03/how-­‐to-­‐tune-­‐your-­‐apache-­‐spark-­‐jobs-­‐part-­‐1/  (Tuning  Spark)     hAps://spark-­‐summit.org/2015-­‐east/wp-­‐content/uploads/2015/03/SSE15-­‐21-­‐Sandy-­‐Ryza.pdf  ()     hAps://www.youtube.com/watch?v=0OM68k3np0E&list=PL-­‐x35fyliRwiiYSXHyI61RXdHlYR3QjZ1&index=5  ()   http://www.slideshare.net/dataera/parquet-format
  • 34. Open  Data  Science  Conference  2015  –  Douglas  Eisenstein  of  Advan=   About Me o  My vision is to make data preparation fast and reliable o  I help financial firms with data-intensive processes o  o  In my spare time: CrossFit, Baseball, No Gardening! @dougeisenstein