Big Data for Customer centric organisation - CleverDATA for Oracle CIO Club M...
Clever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentation
1. Spark
Сегментация
пользователей
в
он-‐лайн
рекламе
Сергей
Жемжицкий,
CTO
CleverDATA
для
Data
Science
Week
2015
DATA
MINING
2. Company
Profile
Сегментация
пользователей
в
online-‐рекламе
Spark
vs
Hadoop
Сергей
Жемжицкий,
CTO,
CleverDATA,
28
августа,
2015
3. cleverdata.ru
|
info@cleverdata.ru
InternaPonal
market
business
development
since
2012
One
of
three
leading
IT
companies
in
Russia
43
branches
in
Russia
and
abroad
+5500
employees
100K
projects
for
10K
customers
Data
management
innovaPve
plaXorm
(Data
Exchange
Service)
Cloud
Service
In-‐house
development
Internet
adver[sing
soluPons
Data
Management
Placorms
Customers
Base
Management
Web
Analy[cs
Marke[ng
automaPon
Big
Data
Data
Mining
Digital
Intelligence
Opera[onal
Intelligence
Low
Latency
and
NoSQL
Cloud
Compu[ng
4. cleverdata.ru
|
info@cleverdata.ru
Агенда
• Про
задачу;
• Hadoop
vs.
Spark;
• Особенности;
• Что
дальше.
5. cleverdata.ru
|
info@cleverdata.ru
publishers
AD
NETWORK
AD
NETWORK
AD
NETWORK
AD
NETWORK
AD
NETWORK
AD
NETWORK
adver[sers
D
S
P
S
S
P
Real
Time
Bidding
(RTB)
6. TRACKING
DATA
cleverdata.ru
|
info@cleverdata.ru
publishers
COOKIE
SYNCs
ACCESS
LOGS
PARTNER’S
DATA
3rd
PARTY
DATA
CLICK
STREAMS
adver[sers
S
S
P
D
S
P
DMP
Data
Management
PlaXorm
(DMP)
7. cleverdata.ru
|
info@cleverdata.ru
3rd
party
data
Rela[onal
Data
Store
raw
data
3rd
party
data
3rd
party
data
Raw
Data
Store
&
Processing
RealTime
Data
Store
user
profiles
aggregates
Типовые
потоки
данных
8. cleverdata.ru
|
info@cleverdata.ru
Типовые
потоки
данных
::
RTB
3rd
party
data
Rela[onal
Data
Store
RTB
SRV
Exchange
SSP
bid
req.
bid
resp.
pixels
::
impressions
::
clicks
bid
requests
user
profiles
raw
data
3rd
party
data
3rd
party
data
Raw
Data
Store
&
Processing
RealTime
Data
Store
user
profiles
aggregates
9. cleverdata.ru
|
info@cleverdata.ru
1st-‐party
data
3rd
party
data
Rela[onal
Data
Store
RTB
SRV
Exchange
SSP
bid
req.
bid
resp.
pixels
::
impressions
::
clicks
bid
requests
user
profiles
raw
data
3rd
party
data
3rd
party
data
Raw
Data
Store
&
Processing
RealTime
Data
Store
user
profiles
aggregates
10. cleverdata.ru
|
info@cleverdata.ru
Задача
Найти
всех
пользователей,
которые
участвовали
в
рекламной
кампании
“Star
Wars”
[и]
видели
один
из
баннеров
“Darth
Vader”
или
“Luke
Skywalker”
в
течении
последних
6
дней
[и]
кликнули
на
этот
баннер
[и]
посетили
страницу
покупки
светового
меча
Darth’а
Vader’а
[и]
но
так
ничего
и
не
купили
Для
того,
чтобы
сделать
ретаргетинг
персонифицированным
баннером
со
скидкой
на
меч
в
40%
11. cleverdata.ru
|
info@cleverdata.ru
find
all
users
who
have
taken
part
in
campaign[s]
“Star
Wars”
[and]
viewed
banner[s]
“Darth
Vader”
or
“Luke
Skywalker”
during
[last]
6
day[s]
[and]
clicked
banner[s]
“Darth
Vader's
lightsaber”
[and]
visited
buying
area
of
“Darth
Vader's
lightsaber”
[and]
not
visited
order
confirmed
area
of
“Darth
Vader's
lightsaber”
Задача
[impression]
[click]
[tr.
pixel]
[tr.
pixel]
id
cookie
event_id
event_type
campaign_id
[mestamp
…
1
c1
“Darth
Vader”
impression
“Star
Wars”
2015-‐04-‐20
14:25:11.462
…
2
c1
“Darth
Vader's
lightsaber”
click
“Star
Wars”
2015-‐04-‐21
06:31:12.157
…
3
c1
“Darth
Vader's
lightsaber”
tr.
pixel
“Star
Wars”
2015-‐04-‐22
18:57:19.628
…
[cookies]
12. cleverdata.ru
|
info@cleverdata.ru
Задача
reduce
find
all
users
who
have
taken
part
in
campaign[s]
“Star
Wars”
viewed
banner[s]
“Darth
Vader”
or
“Luke
Skywalker”
during
[last]
6
day[s]
clicked
banner[s]
“Darth
Vader's
lightsaber”
visited
buying
area
of
“Darth
Vader's
lightsaber”
not
visited
order
confirmed
area
of
“Darth
Vader’s
lightsaber”
(c1,
0)
(c1,
1)
(c1,
2)
(c1,
3)
Ø
map
(c1,
0;1;2;3)
true(0)
and
true(1)
and
true(2)
and
true(3)
and
not
false(4)
C1
id
cookie
event_id
event_type
campaign_id
[mestamp
…
1
c1
“Darth
Vader”
impression
“Star
Wars”
2015-‐04-‐20
14:25:11.462
…
2
c1
“Darth
Vader's
lightsaber”
click
“Star
Wars”
2015-‐04-‐21
06:31:12.157
…
3
c1
“Darth
Vader's
lightsaber”
tr.
pixel
“Star
Wars”
2015-‐04-‐22
18:57:19.628
…
24. cleverdata.ru
|
info@cleverdata.ru
MR
vs
Spark
::
Secondary
Sort
MR
ü setSortComparatorClass
ü setGroupingComparatorClass
ü setPar[[onerClass
Spark
ü repar[[onAndSortWithinPar[[ons
ü mapPar[[ons
ü En[re
par[[on
processing
result
must
be
able
to
fit
in
memory
25. cleverdata.ru
|
info@cleverdata.ru
MR
vs
Spark
::
Статистика
MR
ü Counters
Spark
ü Accumulators
–
use
in
ac[ons
only
Spark
гарантирует,
что
вызов
accumulator-‐а
примениться
единожды
только
для
ac[on-‐а,
но
не
для
трансформаций
27. cleverdata.ru
|
info@cleverdata.ru
Что
дальше
и
почему
Spark?
• Spark
Streaming;
• Micro
Batches;
• λ-‐архитектура.
без
серьезного
хирургического
вмешательства