Airbnb primarily leverages Spark to power mission critical data applications. In this talk, we would like to share our major production use cases including both Streaming applications and Batch processing applications. In addition, we would like share our optimizations on how to improve the throughput of Spark Kafka connector by 10X. Furthermore, we plan to share our journey and lessons learned during the process of upgrading Spark 1+ applications to Spark 2+. The key takeaways includes best practices learned from building and scaling production Spark applications as well as tips and benefits of migrating to Spark 2.x. We hope to share our experiences of making Spark successful at Airbnb with a broader audience of Spark users.
7. Sharedroom Privateroom Entirehome Bed&breakfast Boutique UniqueVacationhome
H O S T E D B Y M A G D A L E N A · M A D R I D , S PA I N
Stunning at old city center
F R O M $ 9 6 P E R N I G H T
8. Sharedroom Privateroom Entirehome Vacationhome Bed&breakfast Boutique Unique
H O S T E D B Y R E M Y · T E L F S , A U S T R I A
Romantic Chalet with fantastic view
F R O M $ 1 6 5 P E R N I G H T
9. Sharedroom Privateroom Entirehome Bed&breakfast Boutique UniqueVacationhome
H O S T E D B Y F R A N C E S A N D D E N N I S · C O T T O N W O O D , I D A H O
Dog Bark Park Inn
F R O M $ 1 3 2 P E R N I G H T
15. • Search
• Pricing
• Machine Learning
• Data Ingestion
• Near real-time applications
• …
AirbnbSpark
UseCases
16. SearchRanking
• More powerful models (DNNs) require more training
data!
• To process large amounts of data, we need to
leverage tools like Spark
• Allows for more complex error handling and unit tests
• Can re-use code between online system (Java) and
offline data pipeline (Scala)
• Use broadcast-join to optimize data pipelines to only
extract raw logs for searches of interest
arXiv paper: Applying Deep Learning To Airbnb Search
18. FinancialIntelligence
• Support Finance and Accounting Functions
• Process all the data in the company
- Yes, ALL of it
- Varying quality and “cleanliness”
- Immense Scale - 10+ years of transactional
data
• Output clean financial data to power various
business functions
- Treasury
- Revenue Ops
- Financial Systems and Technologies Group
20. SparkUpgradefrom1.xto2.3
• Up to 2.5X performance improvement
- 60% reduction of batch processing time in a production job
• Better SQL support. Spark 2.x can run all the 99 TPC-DS queries, which require many of the SQL:2003
features
• Vectorized reader for ORC and Parquet
• Reduced cost due to better performance
• Better support and integration with Spark and Hadoop ecosystem
• Numerous improvements and bug fixes have went into Spark after the 1.6 release in 2016
23. • Fast growth
- Topics from dozens to thousands
- Bytes grow 6x in 2018
• Bottlenecks (e.g., Spark parallelism determined by Kafka
partitions)
• Skew in event size and QPS
Challenges
andPainPoints
32. Balanced
Partitioning
Algorithm
• Shuffle the list of offset ranges
• Starting from split 1. For each offset range
- Assign it to the current split if the total weight is less than
weight-per-split
- If it doesn’t fit, break it apart and assign the subset of the
offset range that fits
- If the current split is more than bytes-per-split, move to
the next split
Blog Post: Scaling Spark Streaming for Logging Event Ingestion