SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
1
Yahoo’s Next Generation
User Profile Platform
Kai Liu, Lu Niu
Yahoo Inc.
2
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Optimization
- Future Work
3
Agenda
- What is User Profile
- Definition
- Use Cases
- Logical View
- User ID Type
- Architecture Evolution
- Schema Design
- Optimization
- Future Work
4
What is User Profile
A User Profile is a visual display of personal data associated with a specific user.
(Wikipedia)
5
Use Cases
6
Logical View
7
User ID Type
- Desktop
- BID: for anonymous users
- SID: for registered users
- Mobile
- IDFA: for iOS devices
- GPSAID: for Android devices
8
Agenda
- What is User Profile
- Architecture Evolution
- Old architecture
- Problems
- New architecture
- Schema Design
- Optimization
- Future Work
9
Classic Architecture of Data System
Data
Preparation
(ETL)
Computation
(Hadoop)
Deep Storage
(HDFS)
10
Old Architecture
HDFS
(full)
Hive
AggregationETL
Batch Data
(daily, hourly,
minutely)
Ad Serving
HDFS
(incre)
1 day 1 day Modeling
Insights
11
Problems
- Aggregation is very expensive
- HDFS follows Write Once Read Many approach.
- Actually only ~30% of users get updates every day.
- Impossible to support multiple update frequencies
- Lack of capability to process event stream
12
- Spark
- Fast
- Consistent stack (batch/streaming)
- HBase
- Random read/write capabilities
- Flexible schema
- Hive
- Large scale ad-hoc query engine
- SQL like interface
New Architecture Components
13
New Architecture
HBase
Hive
HDFS
Kafka
Batch Data
Stream Data
10 mins - 1 day
1 sec - 10 mins
Ad Serving
Spark Batch
Spark
Streaming
Modeling
Insights
14
How problems get solved
- Incremental updates avoid full data load.
- Multiple Spark jobs with different frequencies running
concurrently.
- Spark streaming for event stream processing.
15
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Understand the data
- Table design
- Optimization
- Future Work
16
Understand the data
* (1) Ad serving; (2) User Modeling; (3) Audience Insights
Split user profile into multiple HBase tables.
Data Type Update Pattern Use Cases
Properties K/V pairs Overwrite (1)(2)(3)
Events Time Series Append only (3)
Segments List of K/V pairs Read-Modify-Write (1)(3)
Features Hybrid Overwrite + Read-Modify-Write (1)(2)
17
HBase Data Model
18
Table Design - Properties
Row Key
Column Family: Properties
c: age c: gender c: device1 c: device2 …...
0_284386766
1_1877933007
id_type + user_id
val 1 val 2 val 3
19
Table Design - Events
Row Key
Column Family: Events
c: event
0_284386766_1463848639
0_284386766_1463935039
id_type + user_id + event_type + timestamp
value
Rows are sorted
by timestamp
20
Table Design - Segments
Row Key
Column Family: Segments
c: type1 c: type2 c: type3 …...
0_284386766
1_1877933007
id_type + user_id
* Different segments in different column to avoid atomic operation
value
21
Properties Events
Query “Get age, gender of user A”
“Get events of user A from 05/21/2016
to 05/22/2016”
Write Pattern
❏ Write only
❏ Keep multiple versions
❏ Append only
❏ Use TTL to auto-remove records
Rollback
❏ Set TIMERANGE to
fetch last version in
application layer
❏ Filtered out bad records in
application layer
❏ Deletion based on timestamp if
necessary
Different Access Patterns
22
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Optimization
- Pre-split tables
- Pre-aggregation in Spark
- Lazy aggregation for inactive users
- Sequential read on Hive
- Future Work
23
Pre-Split Tables
24
Pre-Split Tables
- Data Skew: User data is not evenly distributed across different id types
- Pre-split tables based on data distribution
{SPLITS =>
["x00x00x00x01x50",
"x00x00x00x01xA0",
"x00x00x00x02x00",
"x00x00x00x02x40",
"x00x00x00x02x80",
"x00x00x00x02xC0", ,
"x00x00x00x03x00",
"x00x00x00x04x00"]
}
25
- 1 Billion native ads events per day on 0.1 Billion users
- Group by (user id, time interval)
- Reduce the writes by 10X
Pre-Aggregate events in Spark
26
Pre-Aggregate features in Spark
- 5 Billion app activities per day on 0.5 Billion devices
- 1 Billion search keywords per day on 0.06 Billion devices
- Aggregate on user id for both features. One Spark job instead of two.
27
Lazy aggregation for inactive users
- Problem: read-modify-write is expensive
- Facts:
- A large portion of the users might not be accessed frequently
- Update jobs are not evenly distributed over time
- Solution: Lazy aggregation for inactive users
28
- Maintain a set of users as active users
- Active users
- read-modify-write
- Inactive users
- Append updates only
- Merging updates:
- Batch job
- Upon request
Lazy aggregation for inactive users
Spark
r-m-w
w
HBase
r-m-w
Active Users
Inactive Users
update1
update2
29
Sequential read on Hive
- HBase to Hive
- Sync data to Hive using HBase snapshots without
impact Region Servers.
- Hive access the data using HBaseStorageHandler.
- Move sequential reads to Hive
- User modeling
- Audience insights
30
Agenda
- What is User Profile
- Architecture Evolution
- Schema Design
- Optimization
- Future Work
31
Future Work
- Explore Impala/Presto for better query performance;
- Expose API for incremental modeling capability.
32
Questions?
33
Appendix
34
More optimization
- Less column family as possible
- Turn off autoflush
- Throttling writes if necessary
- Compress data before sending to Hbase
- Kryo for serialization

Contenu connexe

En vedette

Encompass programme for academic business collaboration, Elwood Vogt, Univers...
Encompass programme for academic business collaboration, Elwood Vogt, Univers...Encompass programme for academic business collaboration, Elwood Vogt, Univers...
Encompass programme for academic business collaboration, Elwood Vogt, Univers...AlbaInnovationCentre
 
Questionnaire Analysis
Questionnaire AnalysisQuestionnaire Analysis
Questionnaire Analysischloeannemagee
 
вірт екскурсія 8 кл.
вірт екскурсія 8 кл.вірт екскурсія 8 кл.
вірт екскурсія 8 кл.VlasyukA
 
Cara membuat blog
Cara membuat blogCara membuat blog
Cara membuat blogrmdhnqwaser
 
Pozaklasha robota
Pozaklasha robotaPozaklasha robota
Pozaklasha robotaVlasyukA
 
Vip vsn 2013 1
Vip vsn 2013 1Vip vsn 2013 1
Vip vsn 2013 1erika991
 
GIP Monthly Meeting April
GIP Monthly Meeting AprilGIP Monthly Meeting April
GIP Monthly Meeting AprilCole Wirpel
 
Miketz group 2
Miketz group 2Miketz group 2
Miketz group 2el9360
 
Latihan6 michael 5133331019
Latihan6 michael 5133331019Latihan6 michael 5133331019
Latihan6 michael 5133331019Michael Mick
 
04 la creazione di adamo
04 la creazione di adamo04 la creazione di adamo
04 la creazione di adamoOscar Morandini
 
San gabriel presentation
San gabriel presentationSan gabriel presentation
San gabriel presentationNour Zaghi
 
Lean ITP 1.10.2016 Class 1
Lean ITP 1.10.2016 Class 1Lean ITP 1.10.2016 Class 1
Lean ITP 1.10.2016 Class 1Jen van der Meer
 
Matrimoni reali - 10 abiti da sogno
Matrimoni reali - 10 abiti da sognoMatrimoni reali - 10 abiti da sogno
Matrimoni reali - 10 abiti da sognoStylight
 

En vedette (16)

Encompass programme for academic business collaboration, Elwood Vogt, Univers...
Encompass programme for academic business collaboration, Elwood Vogt, Univers...Encompass programme for academic business collaboration, Elwood Vogt, Univers...
Encompass programme for academic business collaboration, Elwood Vogt, Univers...
 
Questionnaire Analysis
Questionnaire AnalysisQuestionnaire Analysis
Questionnaire Analysis
 
вірт екскурсія 8 кл.
вірт екскурсія 8 кл.вірт екскурсія 8 кл.
вірт екскурсія 8 кл.
 
Cara membuat blog
Cara membuat blogCara membuat blog
Cara membuat blog
 
Pozaklasha robota
Pozaklasha robotaPozaklasha robota
Pozaklasha robota
 
Vip vsn 2013 1
Vip vsn 2013 1Vip vsn 2013 1
Vip vsn 2013 1
 
GIP Monthly Meeting April
GIP Monthly Meeting AprilGIP Monthly Meeting April
GIP Monthly Meeting April
 
Miketz group 2
Miketz group 2Miketz group 2
Miketz group 2
 
Latihan6 michael 5133331019
Latihan6 michael 5133331019Latihan6 michael 5133331019
Latihan6 michael 5133331019
 
Il disoccupato
Il disoccupatoIl disoccupato
Il disoccupato
 
fitgift
fitgiftfitgift
fitgift
 
Le stagioni della_vita
Le stagioni della_vitaLe stagioni della_vita
Le stagioni della_vita
 
04 la creazione di adamo
04 la creazione di adamo04 la creazione di adamo
04 la creazione di adamo
 
San gabriel presentation
San gabriel presentationSan gabriel presentation
San gabriel presentation
 
Lean ITP 1.10.2016 Class 1
Lean ITP 1.10.2016 Class 1Lean ITP 1.10.2016 Class 1
Lean ITP 1.10.2016 Class 1
 
Matrimoni reali - 10 abiti da sogno
Matrimoni reali - 10 abiti da sognoMatrimoni reali - 10 abiti da sogno
Matrimoni reali - 10 abiti da sogno
 

Similaire à Yahoo’s next generation user profile platform

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemYael Garten
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big DataJayesh Thakrar
 
Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...Grigori Fursin
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingDave Nielsen
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)Sascha Dittmann
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs? ScyllaDB
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni PohlModern Workplace Conference Paris
 

Similaire à Yahoo’s next generation user profile platform (20)

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big Data
 
Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...Accelerating open science and AI with automated, portable, customizable and r...
Accelerating open science and AI with automated, portable, customizable and r...
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured Streaming
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs?
 
Mihai_Nuta
Mihai_NutaMihai_Nuta
Mihai_Nuta
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
2018-10-18 J2 1D - Dive into the power of the Microsoft Graph - Toni Pohl
 

Dernier

Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...Henrik Hanke
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxJohnree4
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 

Dernier (20)

Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptx
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 

Yahoo’s next generation user profile platform

  • 1. 1 Yahoo’s Next Generation User Profile Platform Kai Liu, Lu Niu Yahoo Inc.
  • 2. 2 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Future Work
  • 3. 3 Agenda - What is User Profile - Definition - Use Cases - Logical View - User ID Type - Architecture Evolution - Schema Design - Optimization - Future Work
  • 4. 4 What is User Profile A User Profile is a visual display of personal data associated with a specific user. (Wikipedia)
  • 7. 7 User ID Type - Desktop - BID: for anonymous users - SID: for registered users - Mobile - IDFA: for iOS devices - GPSAID: for Android devices
  • 8. 8 Agenda - What is User Profile - Architecture Evolution - Old architecture - Problems - New architecture - Schema Design - Optimization - Future Work
  • 9. 9 Classic Architecture of Data System Data Preparation (ETL) Computation (Hadoop) Deep Storage (HDFS)
  • 10. 10 Old Architecture HDFS (full) Hive AggregationETL Batch Data (daily, hourly, minutely) Ad Serving HDFS (incre) 1 day 1 day Modeling Insights
  • 11. 11 Problems - Aggregation is very expensive - HDFS follows Write Once Read Many approach. - Actually only ~30% of users get updates every day. - Impossible to support multiple update frequencies - Lack of capability to process event stream
  • 12. 12 - Spark - Fast - Consistent stack (batch/streaming) - HBase - Random read/write capabilities - Flexible schema - Hive - Large scale ad-hoc query engine - SQL like interface New Architecture Components
  • 13. 13 New Architecture HBase Hive HDFS Kafka Batch Data Stream Data 10 mins - 1 day 1 sec - 10 mins Ad Serving Spark Batch Spark Streaming Modeling Insights
  • 14. 14 How problems get solved - Incremental updates avoid full data load. - Multiple Spark jobs with different frequencies running concurrently. - Spark streaming for event stream processing.
  • 15. 15 Agenda - What is User Profile - Architecture Evolution - Schema Design - Understand the data - Table design - Optimization - Future Work
  • 16. 16 Understand the data * (1) Ad serving; (2) User Modeling; (3) Audience Insights Split user profile into multiple HBase tables. Data Type Update Pattern Use Cases Properties K/V pairs Overwrite (1)(2)(3) Events Time Series Append only (3) Segments List of K/V pairs Read-Modify-Write (1)(3) Features Hybrid Overwrite + Read-Modify-Write (1)(2)
  • 18. 18 Table Design - Properties Row Key Column Family: Properties c: age c: gender c: device1 c: device2 …... 0_284386766 1_1877933007 id_type + user_id val 1 val 2 val 3
  • 19. 19 Table Design - Events Row Key Column Family: Events c: event 0_284386766_1463848639 0_284386766_1463935039 id_type + user_id + event_type + timestamp value Rows are sorted by timestamp
  • 20. 20 Table Design - Segments Row Key Column Family: Segments c: type1 c: type2 c: type3 …... 0_284386766 1_1877933007 id_type + user_id * Different segments in different column to avoid atomic operation value
  • 21. 21 Properties Events Query “Get age, gender of user A” “Get events of user A from 05/21/2016 to 05/22/2016” Write Pattern ❏ Write only ❏ Keep multiple versions ❏ Append only ❏ Use TTL to auto-remove records Rollback ❏ Set TIMERANGE to fetch last version in application layer ❏ Filtered out bad records in application layer ❏ Deletion based on timestamp if necessary Different Access Patterns
  • 22. 22 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Pre-split tables - Pre-aggregation in Spark - Lazy aggregation for inactive users - Sequential read on Hive - Future Work
  • 24. 24 Pre-Split Tables - Data Skew: User data is not evenly distributed across different id types - Pre-split tables based on data distribution {SPLITS => ["x00x00x00x01x50", "x00x00x00x01xA0", "x00x00x00x02x00", "x00x00x00x02x40", "x00x00x00x02x80", "x00x00x00x02xC0", , "x00x00x00x03x00", "x00x00x00x04x00"] }
  • 25. 25 - 1 Billion native ads events per day on 0.1 Billion users - Group by (user id, time interval) - Reduce the writes by 10X Pre-Aggregate events in Spark
  • 26. 26 Pre-Aggregate features in Spark - 5 Billion app activities per day on 0.5 Billion devices - 1 Billion search keywords per day on 0.06 Billion devices - Aggregate on user id for both features. One Spark job instead of two.
  • 27. 27 Lazy aggregation for inactive users - Problem: read-modify-write is expensive - Facts: - A large portion of the users might not be accessed frequently - Update jobs are not evenly distributed over time - Solution: Lazy aggregation for inactive users
  • 28. 28 - Maintain a set of users as active users - Active users - read-modify-write - Inactive users - Append updates only - Merging updates: - Batch job - Upon request Lazy aggregation for inactive users Spark r-m-w w HBase r-m-w Active Users Inactive Users update1 update2
  • 29. 29 Sequential read on Hive - HBase to Hive - Sync data to Hive using HBase snapshots without impact Region Servers. - Hive access the data using HBaseStorageHandler. - Move sequential reads to Hive - User modeling - Audience insights
  • 30. 30 Agenda - What is User Profile - Architecture Evolution - Schema Design - Optimization - Future Work
  • 31. 31 Future Work - Explore Impala/Presto for better query performance; - Expose API for incremental modeling capability.
  • 34. 34 More optimization - Less column family as possible - Turn off autoflush - Throttling writes if necessary - Compress data before sending to Hbase - Kryo for serialization