Apache Spark at Airbnb

Spark@Airbnb
HAO WANG • APRIL 24, 2019 • SPARK SUMMIT

• What is Airbnb
• Spark Use Cases at Airbnb
• Upgrade to Spark 2.3
• Near Real-Time Data Ingestion with Spark Streaming
Outline

50-100 LISTINGS
101-300 LISTINGS
301-1000 LISTINGS
1001+ LISTINGS
2009

6 Million
TOTAL HOMES ON AIRBNB

Sharedroom Privateroom Entirehome Bed&breakfast Boutique UniqueVacationhome
H O S T E D B Y M A G D A L E N A · M A D R I D , S PA I N
Stunning at old city center
F R O M $ 9 6 P E R N I G H T

Sharedroom Privateroom Entirehome Vacationhome Bed&breakfast Boutique Unique
H O S T E D B Y R E M Y · T E L F S , A U S T R I A
Romantic Chalet with fantastic view
F R O M $ 1 6 5 P E R N I G H T

Sharedroom Privateroom Entirehome Bed&breakfast Boutique UniqueVacationhome
H O S T E D B Y F R A N C E S A N D D E N N I S · C O T T O N W O O D , I D A H O
Dog Bark Park Inn
F R O M $ 1 3 2 P E R N I G H T

SPORTS
FASHIONNATURE
PHOTOGRAPHY
CLASSES & WORKSHOPS
HISTORY

SPORTS PHOTOGRAPHYHISTORY
FOOD & DRINKENTERTAINMENTSOCIAL IMPACT
CONCERTSHEALTH & WELLNESS
ARTS

WHAT DOES IT MEAN FOR
DATA INFRASTRUCTURE?

DatawarehouseStorage(2017-2019)

• Search
• Pricing
• Machine Learning
• Data Ingestion
• Near real-time applications
• …
AirbnbSpark
UseCases

SearchRanking
• More powerful models (DNNs) require more training
data!
• To process large amounts of data, we need to
leverage tools like Spark
• Allows for more complex error handling and unit tests
• Can re-use code between online system (Java) and
offline data pipeline (Scala)
• Use broadcast-join to optimize data pipelines to only
extract raw logs for searches of interest
arXiv paper: Applying Deep Learning To Airbnb Search

SmartPricing
Blog Post: Learning Market Dynamics for Optimal Pricing

FinancialIntelligence
• Support Finance and Accounting Functions
• Process all the data in the company
- Yes, ALL of it
- Varying quality and “cleanliness”
- Immense Scale - 10+ years of transactional
data
• Output clean financial data to power various
business functions
- Treasury
- Revenue Ops
- Financial Systems and Technologies Group

SparkUpgradefrom1.xto2.3
• Up to 2.5X performance improvement
- 60% reduction of batch processing time in a production job
• Better SQL support. Spark 2.x can run all the 99 TPC-DS queries, which require many of the SQL:2003
features
• Vectorized reader for ORC and Parquet
• Reduced cost due to better performance
• Better support and integration with Spark and Hadoop ecosystem
• Numerous improvements and bug fixes have went into Spark after the 1.6 release in 2016

Scaling
NearReal-Time
DataIngestion

ArchitectureofLoggingDataFlow
• Bridge between online and offline data
• Mission critical
• Near real-time
• High throughput
• SLA & recovery
• Efficiency & cost

• Fast growth
- Topics from dozens to thousands
- Bytes grow 6x in 2018
• Bottlenecks (e.g., Spark parallelism determined by Kafka
partitions)
• Skew in event size and QPS
Challenges
andPainPoints

SparkTask
RunningTimeSkew
Image from https://silverpond.com.au/2016/10/06/balancing-spark/

• Stability & SLA depend on many systems
- Near Real-time Ingestion
- Headroom for Catch Up
- SLA suffer
- Operational nightmare
- Oncall burnout
• Efficiency & cost
MoreChallenges
andPainPoints

SolutionFrom
SparkCommunity
• Outstanding issue and PR
• Does not handle data skew among topics

Balanced
Partitioning
Algorithm
• Pre-compute averageeventsize (bytes) per topic
• Compute the ideal bytespersplit
• For new topics, use the average size of all topics
Blog Post: Scaling Spark Streaming for Logging Event Ingestion

Balanced
Partitioning
Algorithm
• Shuffle the list of offset ranges
• Starting from split 1. For each offset range
- Assign it to the current split if the total weight is less than
weight-per-split
- If it doesn’t fit, break it apart and assign the subset of the
offset range that fits
- If the current split is more than bytes-per-split, move to
the next split
Blog Post: Scaling Spark Streaming for Logging Event Ingestion

BalancedKafka
ReaderPerformance

Finally
• Support 20X higher throughput with imbalanced topics
• Better SLA (happy customers! happy engineers!)
• Faster recovery and less hand holding
• Ever-increasing throughput

OpenSource • Balanced Kafka Reader for Spark will be available on
Airbnb.io and Airbnb Github soon!

DEREK & IMAN  
AMSTERDAM, NETHERLANDS

Airbnb
Celebrates
HalfABillion
GuestArrivals

Wearehiring!
hao.wang@airbnb.com

Apache Spark at Airbnb

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Apache Spark at Airbnb

Similaire à Apache Spark at Airbnb (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Apache Spark at Airbnb