Yahoo’s next generation user profile platform

1
Yahoo’s Next Generation
User Profile Platform
Kai Liu, Lu Niu
Yahoo Inc.

2
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Optimization
- Future Work

3
Agenda
- Definition
- Use Cases
- Logical View
- User ID Type
- Schema Design
- Optimization
- Future Work

4
What is User Profile
A User Profile is a visual display of personal data associated with a specific user.
(Wikipedia)

7
User ID Type
- Desktop
- BID: for anonymous users
- SID: for registered users
- Mobile
- IDFA: for iOS devices
- GPSAID: for Android devices

8
Agenda
- Old architecture
- Problems
- New architecture
- Schema Design
- Optimization
- Future Work

9
Classic Architecture of Data System
Data
Preparation
(ETL)
Computation
(Hadoop)
Deep Storage
(HDFS)

10
Old Architecture
HDFS
(full)
Hive
AggregationETL
Batch Data
(daily, hourly,
minutely)
Ad Serving
HDFS
(incre)
1 day 1 day Modeling
Insights

11
Problems
- Aggregation is very expensive
- HDFS follows Write Once Read Many approach.
- Actually only ~30% of users get updates every day.
- Impossible to support multiple update frequencies
- Lack of capability to process event stream

12
- Spark
- Fast
- Consistent stack (batch/streaming)
- HBase
- Random read/write capabilities
- Flexible schema
- Hive
- Large scale ad-hoc query engine
- SQL like interface
New Architecture Components

13
New Architecture
HBase
Hive
HDFS
Kafka
Batch Data
Stream Data
10 mins - 1 day
1 sec - 10 mins
Ad Serving
Spark Batch
Spark
Streaming
Modeling
Insights

14
How problems get solved
- Incremental updates avoid full data load.
- Multiple Spark jobs with different frequencies running
concurrently.
- Spark streaming for event stream processing.

15
Agenda
- Schema Design
- Understand the data
- Table design
- Optimization
- Future Work

16
Understand the data
* (1) Ad serving; (2) User Modeling; (3) Audience Insights
Split user profile into multiple HBase tables.
Data Type Update Pattern Use Cases
Properties K/V pairs Overwrite (1)(2)(3)
Events Time Series Append only (3)
Segments List of K/V pairs Read-Modify-Write (1)(3)
Features Hybrid Overwrite + Read-Modify-Write (1)(2)

18
Table Design - Properties
Row Key
Column Family: Properties
c: age c: gender c: device1 c: device2 …...
0_284386766
1_1877933007
id_type + user_id
val 1 val 2 val 3

19
Table Design - Events
Row Key
Column Family: Events
c: event
0_284386766_1463848639
0_284386766_1463935039
id_type + user_id + event_type + timestamp
value
Rows are sorted
by timestamp

20
Table Design - Segments
Row Key
Column Family: Segments
c: type1 c: type2 c: type3 …...
0_284386766
1_1877933007
id_type + user_id
* Different segments in different column to avoid atomic operation
value

21
Properties Events
Query “Get age, gender of user A”
“Get events of user A from 05/21/2016
to 05/22/2016”
Write Pattern
❏ Write only
❏ Keep multiple versions
❏ Append only
❏ Use TTL to auto-remove records
Rollback
❏ Set TIMERANGE to
fetch last version in
application layer
❏ Filtered out bad records in
application layer
❏ Deletion based on timestamp if
necessary
Different Access Patterns

22
Agenda
- Schema Design
- Optimization
- Pre-split tables
- Pre-aggregation in Spark
- Lazy aggregation for inactive users
- Sequential read on Hive
- Future Work

24
Pre-Split Tables
- Data Skew: User data is not evenly distributed across different id types
- Pre-split tables based on data distribution
{SPLITS =>
["x00x00x00x01x50",
"x00x00x00x01xA0",
"x00x00x00x02x00",
"x00x00x00x02x40",
"x00x00x00x02x80",
"x00x00x00x02xC0", ,
"x00x00x00x03x00",
"x00x00x00x04x00"]
}

25
- 1 Billion native ads events per day on 0.1 Billion users
- Group by (user id, time interval)
- Reduce the writes by 10X
Pre-Aggregate events in Spark

26
Pre-Aggregate features in Spark
- 5 Billion app activities per day on 0.5 Billion devices
- 1 Billion search keywords per day on 0.06 Billion devices
- Aggregate on user id for both features. One Spark job instead of two.

27
Lazy aggregation for inactive users
- Problem: read-modify-write is expensive
- Facts:
- A large portion of the users might not be accessed frequently
- Update jobs are not evenly distributed over time
- Solution: Lazy aggregation for inactive users

28
- Maintain a set of users as active users
- Active users
- read-modify-write
- Inactive users
- Append updates only
- Merging updates:
- Batch job
- Upon request
Lazy aggregation for inactive users
Spark
r-m-w
w
HBase
r-m-w
Active Users
Inactive Users
update1
update2

29
Sequential read on Hive
- HBase to Hive
- Sync data to Hive using HBase snapshots without
impact Region Servers.
- Hive access the data using HBaseStorageHandler.
- Move sequential reads to Hive
- User modeling
- Audience insights

30
Agenda
- Schema Design
- Optimization
- Future Work

31
Future Work
- Explore Impala/Presto for better query performance;
- Expose API for incremental modeling capability.

34
More optimization
- Less column family as possible
- Turn off autoflush
- Throttling writes if necessary
- Compress data before sending to Hbase
- Kryo for serialization

Yahoo’s next generation user profile platform

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (16)

Similaire à Yahoo’s next generation user profile platform

Similaire à Yahoo’s next generation user profile platform (20)

Dernier

Dernier (20)

Yahoo’s next generation user profile platform