Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Frustration-Reduced
PySpark
Data engineering with DataFrames
Ilya Ganelin
Why are we here?
 Spark for quick and easy batch ETL (no streaming)
 Actually using data frames
 Creation
 Modificatio...
What does it take to solve a
data science problem?
 Data Prep
 Ingest
 Cleanup
 Error-handling &
missing values
 Data...
Why Spark?
 Batch/micro-batch processing of large datasets
 Easy to use, easy to iterate, wealth of common
industry-stan...
Why not Spark?
 Breaks easily with poor usage or improperly specified
configs
 Scaling up to larger datasets 500GB -> TB...
Scala
 Yes, I recommend Scala
 Python API is underdeveloped, especially for ML Lib
 Java (until Java 8) is a second cla...
Why DataFrames?
 Iterate on datasets MUCH faster
 Column access is easier
 Data inspection is easier
 groupBy, join, a...
Why not DataFrames?
 RDD API is still much better developed
 Getting data into DataFrames can be clunky
 Transforming d...
Creation
 Read in a file with an embedded header
 http://stackoverflow.com/questions/24718697/pyspark-drop-rows
 Create a DF
 Option A – Inferred types from Rows RDD
 Option B – Specify schema as strings
DataFrame Creation
 Option C – Define the schema explicitly
 Check your work with df.show()
DataFrame Creation
Column Manipulation
 Selection
 GroupBy
 Confusing! You get a GroupedData object, not an RDD or
DataFrame
 Use agg or ...
Custom Column Functions
 Add a column with a custom function:
 http://stackoverflow.com/questions/33287886/replace-
empt...
Row Manipulation
 Filter
 Range:
 Equality:
 Column functions
 https://spark.apache.org/docs/1.6.0/api/python/pyspark...
Joins
 Option A (inner join)
 Option B (explicit)
 Join types: inner, outer, left_outer, right_outer, leftsemi
 DataFr...
Null Handling
 Built in support for handling nulls/NA in data frames.
 Drop, fill, replace
 https://spark.apache.org/do...
What does it take to solve a
data science problem?
 Data Prep
 Ingest
 Cleanup
 Error-handling &
missing values
 Data...
Lab Rules
 Ask Google and StackOverflow before you ask me 
 You do not have to use my code.
 Use DataFrames until you ...
Lab
 Ingest Data
 Remove invalid entrees or fill missing entries
 Split into test, train, validate
 Reformat a single ...
What problems did you
encounter?
What are you still confused
about?
Spark Architecture
 Partitions
 How data is split on disk
 Affects memory usage, shuffle size
 Count ~ speed, Count ~ 1/memory
 Caching
...
Shuffle!
 All-all operations
 reduceByKey, groupByKey
 Data movement
 Serialization
 Akka
 Memory overhead
 Dumps t...
What else?
 Save your work => Write completed datasets to file
 Work on small data first, then go to big data
 Create t...
By popular demand:
screen pyspark
--driver-memory 100g 
--num-executors 60 
--executor-cores 5 
--master yarn-client 
--co...
Any Spark on YARN
 E.g. Deploy Spark 1.6 on CDH 5.4
 Download your Spark binary to the cluster and untar
 In $SPARK_HOM...
References
 http://spark.apache.org/docs/latest/programming-guide.html
 http://spark.apache.org/docs/latest/sql-programm...
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Prochain SlideShare
Chargement dans…5
×

Frustration-Reduced PySpark: Data engineering with DataFrames

9 007 vues

Publié le

In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.

Publié dans : Ingénierie
  • Soyez le premier à commenter

Frustration-Reduced PySpark: Data engineering with DataFrames

  1. 1. Frustration-Reduced PySpark Data engineering with DataFrames Ilya Ganelin
  2. 2. Why are we here?  Spark for quick and easy batch ETL (no streaming)  Actually using data frames  Creation  Modification  Access  Transformation  Lab!  Performance tuning and operationalization
  3. 3. What does it take to solve a data science problem?  Data Prep  Ingest  Cleanup  Error-handling & missing values  Data munging  Transformation  Formatting  Splitting  Modeling  Feature extraction  Algorithm selection  Data creation  Train  Test  Validate  Model building  Model scoring
  4. 4. Why Spark?  Batch/micro-batch processing of large datasets  Easy to use, easy to iterate, wealth of common industry-standard ML algorithms  Super fast if properly configured  Bridges the gap between the old (SQL, single machine analytics) and the new (declarative/functional distributed programming)
  5. 5. Why not Spark?  Breaks easily with poor usage or improperly specified configs  Scaling up to larger datasets 500GB -> TB scale requires deep understanding of internal configurations, garbage collection tuning, and Spark mechanisms  While there are lots of ML algorithms, a lot of them simply don’t work, don’t work at scale, or have poorly defined interfaces / documentation
  6. 6. Scala  Yes, I recommend Scala  Python API is underdeveloped, especially for ML Lib  Java (until Java 8) is a second class citizen as far as convenience vs. Scala  Spark is written in Scala – understanding Scala helps you navigate the source  Can leverage the spark-shell to rapidly prototype new code and constructs  http://www.scala- lang.org/docu/files/ScalaByExample.pdf
  7. 7. Why DataFrames?  Iterate on datasets MUCH faster  Column access is easier  Data inspection is easier  groupBy, join, are faster due to under-the-hood optimizations  Some chunks of ML Lib now optimized to use data frames
  8. 8. Why not DataFrames?  RDD API is still much better developed  Getting data into DataFrames can be clunky  Transforming data inside DataFrames can be clunky  Many of the algorithms in ML Lib still depend on RDDs
  9. 9. Creation  Read in a file with an embedded header  http://stackoverflow.com/questions/24718697/pyspark-drop-rows
  10. 10.  Create a DF  Option A – Inferred types from Rows RDD  Option B – Specify schema as strings DataFrame Creation
  11. 11.  Option C – Define the schema explicitly  Check your work with df.show() DataFrame Creation
  12. 12. Column Manipulation  Selection  GroupBy  Confusing! You get a GroupedData object, not an RDD or DataFrame  Use agg or built-ins to get back to a DataFrame.  Can convert to RDD with dataFrame.rdd
  13. 13. Custom Column Functions  Add a column with a custom function:  http://stackoverflow.com/questions/33287886/replace- empty-strings-with-none-null-values-in-dataframe
  14. 14. Row Manipulation  Filter  Range:  Equality:  Column functions  https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.ht ml#pyspark.sql.Column
  15. 15. Joins  Option A (inner join)  Option B (explicit)  Join types: inner, outer, left_outer, right_outer, leftsemi  DataFrame joins benefit from Tungsten optimizations  Note: PySpark will not drop columns for outer joins
  16. 16. Null Handling  Built in support for handling nulls/NA in data frames.  Drop, fill, replace  https://spark.apache.org/docs/1.6.0/api/python/pyspark. sql.html#pyspark.sql.DataFrameNaFunctions
  17. 17. What does it take to solve a data science problem?  Data Prep  Ingest  Cleanup  Error-handling & missing values  Data munging  Transformation  Formatting  Splitting  Modeling  Feature extraction  Algorithm selection  Data creation  Train  Test  Validate  Model building  Model scoring
  18. 18. Lab Rules  Ask Google and StackOverflow before you ask me   You do not have to use my code.  Use DataFrames until you can’t.  Keep track of what breaks!  There are no stupid questions.
  19. 19. Lab  Ingest Data  Remove invalid entrees or fill missing entries  Split into test, train, validate  Reformat a single column, e.g. map IDs or change format  Add a custom metric or feature based on other columns  Run a classification algorithm on this data to figure out who will survive!
  20. 20. What problems did you encounter? What are you still confused about?
  21. 21. Spark Architecture
  22. 22.  Partitions  How data is split on disk  Affects memory usage, shuffle size  Count ~ speed, Count ~ 1/memory  Caching  Persist RDDs in distributed memory  Major speedup for repeated operations  Serialization  Efficient movement of data  Java vs. Kryo Partitions, Caching, and Serialization
  23. 23. Shuffle!  All-all operations  reduceByKey, groupByKey  Data movement  Serialization  Akka  Memory overhead  Dumps to disk when OOM  Garbage collection  EXPENSIVE! Map Reduce
  24. 24. What else?  Save your work => Write completed datasets to file  Work on small data first, then go to big data  Create test data to capture edge cases  LMGTFY
  25. 25. By popular demand: screen pyspark --driver-memory 100g --num-executors 60 --executor-cores 5 --master yarn-client --conf "spark.executor.memory=20g” --conf "spark.io.compression.codec=lz4" --conf "spark.shuffle.consolidateFiles=true" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.shuffle.manager=tungsten-sort" --conf "spark.akka.frameSize=1028" --conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m - XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 - XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 - XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts - XX:+UseCompressedOops"
  26. 26. Any Spark on YARN  E.g. Deploy Spark 1.6 on CDH 5.4  Download your Spark binary to the cluster and untar  In $SPARK_HOME/conf/spark-env.sh:  export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co nf  This tells Spark where Hadoop is deployed, this also gives it the link it needs to run on YARN  export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop classpath)  This defines the location of the Hadoop binaries used at run time
  27. 27. References  http://spark.apache.org/docs/latest/programming-guide.html  http://spark.apache.org/docs/latest/sql-programming-guide.html  http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-1/ (by Sandy Ryza)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-2/ (by Sandy Ryza)  http://www.slideshare.net/ilganeli/frustrationreduced-pyspark-data- engineering-with-dataframes

×