Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Migrating Complex Data
Aggregations from
Hadoop to Spark
puneet.kumar@pubmatic.com
ashish.singh@pubmatic.com
Agenda
• Aggregation and Scale @PubMatic
• Problem Statement: Why Spark when we run Hadoop
• 3 Use Cases
• Configuration T...
Who we are?
• Marketing Automation Software Company
• Developed Industry’s first Real Time
Analytics Solution
Pubmatic Analytics
Data : Scale & Complexity
Challenges on Current Stack
• Ever Increasing Hardware Costs
• Complex Data Flows
• Cardinality Estimation : Estimating
Bi...
3 Use cases
Data Flow Diagram : Batch
Cardinality Estimation Multi Stage Workflows Grouping sets
Why Spark ?
• Efficient for dependent Data Flows
• Memory : Cheaper (Moore’s Law)
• Optimized Hardware Usage
• Unified sta...
Case 1: Cardinality Estimation
Sec
Size
Spark is ~ 25-30 % faster than Hive on MR
Case 2 :Multi Stage Data Flow
Sec
Size
Spark is ~ 85 % faster than Hive on MR
Case 3 : Grouping Sets
192 GB 384 GB 768 GB
0
200
400
600
800
Spark(Sec)
Hive
Queries(Sec)
Spark is ~ 150 % faster than Hi...
Challenges faced
• Spark on YARN : executors did not use full memory
• Reading Nested Avro Schemas until Spark 1.2 was ted...
Important Performance Params
• SET spark.default.parallelism;
• SET spark.serializer : Kyro Serialization improved the run...
Memory Based Architecture
In Memory Distributed Store
HDFS S3
Flow 1 Flow 2 Flow 3
Conclusions :
• Spark Multi Stage workflows were faster by 85 % over Hive on MR
• Single stage workflows did not see huge ...
THANK YOU!
Prochain SlideShare
Chargement dans…5
×

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPuneet Kumar, PubMatic)

2 682 vues

Publié le

Presentation at Spark Summit 2015

Publié dans : Données & analyses
  • Soyez le premier à commenter

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPuneet Kumar, PubMatic)

  1. 1. Migrating Complex Data Aggregations from Hadoop to Spark puneet.kumar@pubmatic.com ashish.singh@pubmatic.com
  2. 2. Agenda • Aggregation and Scale @PubMatic • Problem Statement: Why Spark when we run Hadoop • 3 Use Cases • Configuration Tuning • Challenges & Learnings
  3. 3. Who we are? • Marketing Automation Software Company • Developed Industry’s first Real Time Analytics Solution
  4. 4. Pubmatic Analytics
  5. 5. Data : Scale & Complexity
  6. 6. Challenges on Current Stack • Ever Increasing Hardware Costs • Complex Data Flows • Cardinality Estimation : Estimating Billion distinct users • Multiple Grouping Sets • Different flows for Real Time and Batch Analytics Data Flow Diagram : Batch
  7. 7. 3 Use cases Data Flow Diagram : Batch Cardinality Estimation Multi Stage Workflows Grouping sets
  8. 8. Why Spark ? • Efficient for dependent Data Flows • Memory : Cheaper (Moore’s Law) • Optimized Hardware Usage • Unified stack for Real Time & Batch • Awesome Scala API’s
  9. 9. Case 1: Cardinality Estimation Sec Size Spark is ~ 25-30 % faster than Hive on MR
  10. 10. Case 2 :Multi Stage Data Flow Sec Size Spark is ~ 85 % faster than Hive on MR
  11. 11. Case 3 : Grouping Sets 192 GB 384 GB 768 GB 0 200 400 600 800 Spark(Sec) Hive Queries(Sec) Spark is ~ 150 % faster than Hive on MR Sec
  12. 12. Challenges faced • Spark on YARN : executors did not use full memory • Reading Nested Avro Schemas until Spark 1.2 was tedious • Had to rewrite code to leverage Spark-Avro with Spark 1.3(DataFrames) • Join and Deduplication was slow for Spark vs Hive
  13. 13. Important Performance Params • SET spark.default.parallelism; • SET spark.serializer : Kyro Serialization improved the runtime. • SET spark.sql.inMemoryColumnarStorage.compressed : Snappy compassion set to true • SET spark.sql.inMemoryColumnarStorage.batchSize : Increasing it to a higher optimum value. • SET spark.shuffle.memorySize
  14. 14. Memory Based Architecture In Memory Distributed Store HDFS S3 Flow 1 Flow 2 Flow 3
  15. 15. Conclusions : • Spark Multi Stage workflows were faster by 85 % over Hive on MR • Single stage workflows did not see huge benefits • HLL mask generation and heavy jobs finished 20-30% faster • Use In Memory Distributed Storage with Spark for multiple jobs on same Input • Overall Hardware cost is expected to decrease by ~35% due to Spark usage(more memory , less nodes)
  16. 16. THANK YOU!

×