We at Advertising.com (a division of AOL Networks) use a dedicated Hadoop cluster to process petabytes of online display advertising data. The data powers customer / audience understanding, predicting look-alike audiences, measuring ad effectiveness, and ad-hoc research. In this talk, we cover a few use cases and success stories and lessons learned.
2. Massive cross-screen network reaching 600M+ consumers worldwide
Premium programmatic demand side platform
Leading premium video network with 67M+ uniques
Premium programmatic video platform
Branded and content entertainment platform
Branded and content entertainment platform
Branded and content entertainment platform
Premium programmatic supply side platform
3. 5Vs IN BIG DATA
• Doesn’t always work
well with “volume”…
leading to silos.
Technical challenge.
VELOCITY
• Petabytes are norm. Thanks
Hadoop! Bottleneck and
hotspots occurs in
unexpected places.
VOLUME
• “Where shall clean
metadata be found?”
Organizational challenge
(culture and process).
VERACITY
• Diverse data source…
leading to silos.
Engineering resource /
architectural challenge
VARIETY
• Not to be forgotten.
“Why we fight?”
VALUE
4. IT’S BEENA
GREAT 10YEARS
(Taken from http://www.slideshare.net/larsgeorge/hadoop-is-dead-lars-george-bi-data2013 and http://techblog.baghel.com/index.php?itemid=132 )
5. AOLNETWORKS
DATAIN HADOOP
USE CASES
Aggregates : Easy via Hive
Ad hoc queries : Harder via Pig/Hive
User level analysis : Hardest
1. Customer / audience understanding,
2. Predicting look-alike audiences,
3. Measuring ad effectiveness,
4. User time-series analysis,
5. Stream analysis,
6. Ad-hoc research,
7. ...
SCALE
• > 1 Billion events / day
• > 100 million web users
Hundreds of advertisers
Thousands of ad campaigns
Thousands of pixels
Petabytes of data
6. CHALLENGES
VARIETY
• Acquisitions happens
• New, diverse data sources
• Speed of ingestion is the key
NEED FOR USER LEVELANALYSIS
Answering such questions as:
• “What are prominent behavioral segments of
those who purchased product A?”
• “What do users do 2-weeks prior to
purchasing product B?”
• “What is the likelihood of a user purchasing
product C over next week?”
UNSTRUCTURED
DATA
7. MAD,MAD, MAD
Magnetic: “attracting all
the data sources that
crop up within an
organization regardless of
data quality niceties.”
Agile: “allow analysts to
easily ingest, digest,
produce and adapt data at
a rapid pace.”
Deep: “... increasingly
sophisticated statistical methods
... beyond the rollups and
drilldowns of traditional BI. ...
need to see both the forest and
the trees in running these
algorithms - they want to study
enormous datasets without
resorting to samples and
extracts. The modern data
warehouse should serve both as
a deep data repository and as
a sophisticated algorithmic
runtime engine.”
MAD Skills: New Analysis Practices for Big Data (2009, Cohen et al.)
M A D
8. USERPROFILE
USER PROFILE
• Daily user profile is built for all
anonymous cookie ids seen on a given
day
• Multiple days’ worth of user profile is
assembled via map-side join.
• Processing framework is built so map-
side join and other machineries are
hidden from researchers and (most)
developers.
• Support almost all advanced use cases.
CHOICES WE (ALMOST) HAD:
• Flat file on HDFS,
• Pig,
• Hive,
• Hbase,
• Custom “user profile”
• Ended up with user profile
approach and never looked back..
• .. so far.
9. USECASES#1:
CUSTOMERUNDERSTANDING
User profile supports AOL Networks’ audience analytics system that answers such
questions as:
• “Are very young and old customers better clickers?”
o “Yes, but young adult are better purchasers”
• “Are people who saw display advertising more likely to come to the online store?”
o “Yes. About twice more likely in particular.”
10. USECASES#2:
LOOKALIKEAUDIENCEMODEL
User profile supports AOL
Networks’ Lookalike audience
offering, which let you reach new
people who are likely to be
interested in advertiser’s offering
due to their similarity to existing
customers.
Predictive Analytics
and Optimization
Logistic Regression
Neural Networks
Random Forest
Gradient Boosting Machine
…
VALUE UNSTRUCTURED
DATA
11. MORECHALLENGES...
Cluster Ops
Tuning of Cluster / Jobs
Velocity / real-time: Want more real-time update of the user profile. Hard.
Veracity: Organizational challenge. High-quality metadata.
Good “Data Scientists” specializing in “Big Data” are hard to find.
12. LOOKING FORWARDTO MORE
EXCITING DEVELOPMENT
(Taken from http://www.slideshare.net/larsgeorge/hadoop-is-dead-lars-george-bi-data2013 and http://techblog.baghel.com/index.php?itemid=132 )
20232015