Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

New Developments in Spark

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 43 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à New Developments in Spark (20)

Plus par Databricks (20)

Publicité

Plus récents (20)

New Developments in Spark

  1. 1. New Developments in Spark Matei Zaharia August 18th,2015
  2. 2. About Databricks Founded by creatorsof Spark in 2013 and remains the top contributor End-to-end service for Spark on EC2 • Interactive notebooks,dashboards, and production jobs
  3. 3. Our Goal for Spark Unified engineacross data workloads and platforms … SQLStreaming ML Graph Batch …
  4. 4. Past 2 Years Fast growth in libraries and integration points • New library for SQL + DataFrames • 10xgrowth of ML library • Pluggable data source API • R language Result: very diverse use of Spark • Only 40% of userson Hadoop YARN • Most users use at least 2 of Spark’s built-in libraries • 98%of Databricks customers use SQL, 60% use Python
  5. 5. Beyond Libraries Best thing about basing Spark’s libraries on a high-level API is that we can also make big changesunderneaththem Now working on some of the largestchangesto Spark Core since the projectbegan
  6. 6. This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution
  7. 7. Hardware Trends Storage Network CPU
  8. 8. Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz
  9. 9. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz
  10. 10. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10x Network 1Gbps 10Gbps 10x CPU ~3GHz ~3GHz L
  11. 11. Tungsten: Preparing Spark for Next 5 Years Substantially speed up execution by optimizing CPU efficiency, via: (1) Off-heap memory management (2) Runtime code generation (3) Cache-awarealgorithms
  12. 12. Interfaces to Tungsten DataFrames (Python, Java, Scala, R) RDDsSpark SQL … Data schema + query plan LLVMJVM GPU NVRAM Tungsten backends …
  13. 13. DataFrame API Single-node tabularstructure in R and Python,with APIs for: relational algebra (filter, join,…) math and stats input/output(CSV, JSON, …) Google Trends for “data frame”
  14. 14. DataFrame: lingua franca for “small data” head(flights) #>  Source:  local  data  frame  [6  x  16] #>   #>        year  month  day  dep_time dep_delay arr_time arr_delay carrier  tailnum #>  1    2013          1      1            517                  2            830                11            UA    N14228 #>  2    2013          1      1            533                  4            850                20            UA    N24211 #>  3    2013          1      1            542                  2            923                33            AA    N619AA #>  4    2013          1      1            544                -­‐1          1004              -­‐18            B6    N804JB #>  ..    ...      ...  ...            ...              ...            ...              ...          ...          ...
  15. 15. 15 Spark DataFrames Structureddata collections with similar API to R/Python • DataFrame = RDD + schema Capture many operations as expressionsin a DSL • Enablesrich optimizations df = jsonFile(“tweets.json”) df(df(“user”) === “matei”) .groupBy(“date”) .sum(“retweets”) 0 5 10 Python RDD Scala RDD DataFrame RunningTime
  16. 16. How does Tungsten help?
  17. 17. 1. Off-Heap Memory Management Store data outside JVM heap to avoid object overhead & GC • For RDDs: fast serialization libraries • For DataFrames & SQL: binary format we compute on directly 2-10x space saving, especiallyfor strings, nested objects Can use new RAM-like devices, e.g. flash, 3D XPoint
  18. 18. 2. Runtime Code Generation GenerateJava code for DataFrame and SQL expressionsrequestedby user Avoids virtual calls and generics/boxing Can do same in core, ML and graph • Code-gen serializers,fused functions, math expressions 9.3 9.4 36.7 Hand writtenCodegen Interpreted Projection Evaluating“SELECTa+a+a” (timein seconds)
  19. 19. 3. Cache-Aware Algorithms Use custom memory layout to better leverageCPU cache Example: AlphaSort-style prefix sort • Store prefixes of sort key inside pointerarray • Compare prefixes to avoid full record fetches+ comparisons pointer record key prefix pointer record Naïve layout Cache friendly layout
  20. 20. Tungsten Performance Results 0 200 400 600 800 1000 1200 1x 2x 4x 8x 16x Run time (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap
  21. 21. This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution
  22. 22. Motivation Network and storage speedshave improved 10x, but this speed isn’t always easyto leverage! Many challengeswith: • Keeping diskoperationslarge (even on SSDs) • Keeping networkconnectionsbusy & balanced across cluster • Doing all this on many cores and many disks
  23. 23. Sort Benchmark Started by Jim Grayin 1987 to measure HW+SW advances • Many entrantsuse purpose-builthardware & software Participated in largestcategory: Daytona GraySort • Sort 100 TB of 100-byte recordsin a fault-tolerant manner Seta new world record (tied with UCSD) • Saturated 8 SSDs and 10 Gbps network/ node • 1st time public cloud + open source won
  24. 24. On-Disk Sort Record Time to sort 100 TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1 PB in 4 hours
  25. 25. Saturating the Network 1.1GB/sec per node
  26. 26. This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution
  27. 27. Motivation Queryplanning is crucial to performancein distributed setting • Level of parallelismin operations • Choice of algorithm(e.g. broadcast vs. shuffle join) Hard to do well for big data even with cost-based optimization • Unindexed data => don’t have statistics • User-defined functions=> hard to predict Solution: letSpark changequery plan adaptively
  28. 28. Traditional Spark Scheduling file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce sort
  29. 29. Adaptive Planning map file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  30. 30. Adaptive Planning map file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  31. 31. Adaptive Planning map reduce file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  32. 32. Adaptive Planning map reduce file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  33. 33. Adaptive Planning map reduce file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  34. 34. Adaptive Planning map reduce sort file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey()
  35. 35. Advanced Example: Join Goal: Bringtogetherdata items with the same key
  36. 36. Advanced Example: Join Shuffle join (good if both datasets large) Goal: Bringtogetherdata items with the same key
  37. 37. Advanced Example: Join Broadcast join (good if top dataset small) Goal: Bringtogetherdata items with the same key
  38. 38. Advanced Example: Join Hybrid join (broadcast popular key, shuffle rest) Goal: Bringtogetherdata items with the same key
  39. 39. Advanced Example: Join Hybrid join (broadcast popular key, shuffle rest) Goal: Bringtogetherdata items with the same key More details: SPARK-9850
  40. 40. Impact of Adaptive Planning Level of parallelism: 2-3x Choice of join algorithm: as much as 10x Follow it at SPARK-9850
  41. 41. Effect of Optimizations in Core Often, when we made one optimization, we saw all of the Spark components get faster • Scheduleroptimization for Spark Streaming => SQL 2xfaster • Network optimizations=> speed up all comm-intensive libraries • Tungsten => DataFrames, SQL and parts of ML Same applies to other changesin core, e.g. debug tools
  42. 42. Conclusion Spark has grown a lot, but it still remains the most active open sourceproject in big data Small core + high-level API => can make changesquickly New hardware => exciting optimizations at all levels
  43. 43. Learn More: sparkhub.databricks.com

×