Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
go.indeed.com/IndeedEngTalks
Machine Learning
at Indeed
Scaling Decision Trees
Andrew Hudson
CTO
I help
people
get jobs.
Indeed is a
Search Engine for Jobs
Which jobs to show?
18,749 jobs
Which jobs to show?
Maximize job seeker’s chance to get the job
Which jobs to show?
Maximize job seeker’s chance to get the job
●
●
●
●

Will job seeker click on the job?
Is the job stil...
Which jobs to show?
Maximize job seeker’s chance to get the job
●
●
●
●

Will job seeker click on the job?
Is the job stil...
How?
Log job seeker behavior
Analyze logs, what best explains why they
clicked on some jobs and not on others?
May help pr...
How?
Log job seeker behavior
Analyze logs, what best explains why they
clicked on some jobs and not on others?
May help pr...
Supervised Learning Approaches
Neural networks

Bayesian methods

Decision trees

Genetic
programming

Logistic model tree...
Supervised Learning Approaches
Neural networks

Bayesian methods

Decision trees

Genetic
programming

Logistic model tree...
Supervised Learning Approaches
Decision trees
Genetic
programming

Logistic model tree
Random forests

Bagging

Boosting
E...
Decision Trees
What is a Decision Tree?
A tree like structure that presents a relevant
sequence of questions which determine a path
and u...
I’m Thinking About Buying a Laptop
I’m Thinking About Buying a Laptop
Is quality important?
I’m Thinking About Buying a Laptop
Is quality important?

NO

ASUS
I’m Thinking About Buying a Laptop
Is quality important?

NO

ASUS or whatever woot has
I’m Thinking About Buying a Laptop
Is quality important?
YES

Want to run linux?

NO

ASUS or whatever woot has
I’m Thinking About Buying a Laptop
Is quality important?

NO

ASUS or whatever woot has

YES

Want to run linux?

NO

MACB...
I’m Thinking About Buying a Laptop
Is quality important?

NO

ASUS or whatever woot has

YES

Want to run linux?
YES

LENO...
I’m Thinking About Buying a Laptop
Is quality important?
YES

NO

ASUS or whatever woot has

IDGAF

DELL

Want to run linu...
Benefits of Decision Trees
Algorithm relatively simple to understand and
implement
Model produced also human understandabl...
Decision Tree Learning
Programmatic creation of decision trees
Decision Tree Learning
Given a set of documents, split it into two or
more subsets that optimize some criteria
Repeat this...
Titanic Example
1309 passengers
500 survivors
38.2% survival rate
What best explains who survived?
What best explains who survived?
class
class of ticket; first, second or third
fsize
family size; number of family members...
1309 passengers
500 survivors
38.2% survival
class = 1

1309 passengers
500 survivors
38.2% survival
class = 1
323 passengers
200 survivors
61.9% survival

1309 passengers
500 survivors
38.2% survival
class = 1
323 passengers
200 survivors
61.9% survival

1309 passengers
500 survivors
38.2% survival

class ≠ 1
986 passeng...
class = 1
323 passengers
200 survivors
61.9% survival

1309 passengers
500 survivors
38.2% survival

class ≠ 1
986 passeng...
Score
conditional entropy
Conditional Entropy as Score
lower conditional entropy
↓
less uncertainty about prediction based on term
class = 1
323 passengers
200 survivors
61.9% survival

1309 passengers
500 survivors
38.2% survival

class ≠ 1
986 passeng...
class = 1
323 passengers
200 survivors
61.9% survival

1309 passengers
500 survivors
38.2% survival

class ≠ 1
986 passeng...
class = 1

1309 passengers
500 survivors
38.2% survival

Best Score:
0.6267, class = 1
class ≤ 2

1309 passengers
500 survivors
38.2% survival

Best Score:
0.6267, class = 1
class ≤ 2
600 passengers
319 survivors
53.2% survival

1309 passengers
500 survivors
38.2% survival

Best Score:
0.6267, c...
class ≤ 2
600 passengers
319 survivors
53.2% survival

1309 passengers
500 survivors
38.2% survival

class > 2
709 passeng...
class ≤ 2
600 passengers
319 survivors
53.2% survival

1309 passengers
500 survivors
38.2% survival

Score = 0.6244

class...
class ≤ 2
600 passengers
319 survivors
53.2% survival

1309 passengers
500 survivors
38.2% survival

Score = 0.6244

class...
class ≠ 3
600 passengers
319 survivors
53.2% survival

1309 passengers
500 survivors
38.2% survival

class = 3
709 passeng...
gender = female

1309 passengers
500 survivors
38.2% survival

Best Score:
0.6244, class ≤ 2
gender = female
466 passengers
339 survivors
72.7% survival

1309 passengers
500 survivors
38.2% survival

Best Score:
0.6...
gender ≠ female
843 passengers
161 survivors
19.1% survival

1309 passengers
500 survivors
38.2% survival

gender = female...
gender ≠ female
843 passengers
161 survivors
19.1% survival

1309 passengers
500 survivors
38.2% survival

gender = female...
gender ≠ female
843 passengers
161 survivors
19.1% survival

1309 passengers
500 survivors
38.2% survival

gender = female...
fsize ≠ 0
519 passengers
261 survivors
50.3% survival

1309 passengers
500 survivors
38.2% survival

fsize = 0
790 passeng...
Best Score:
0.5525, gender=f
19.1% survival

72.7% survival
gender=male
843 passengers
161 survivors
19.1% survival
class = 1
179 passengers
61 survivors
34.1% survival

gender=male
843 passengers
161 survivors
19.1% survival
class = 1
179 passengers
61 survivors
34.1% survival

gender=male
843 passengers
161 survivors
19.1% survival

class ≠ 1
6...
class = 1

class ≠ 1
class = 1

class ≠ 1
34.1% survival

15.1% survival
38.2%
38.2%

19.1%

72.7%

MALE

FEMALE
38.2%

19.1%

72.7%

MALE

FEMALE

15.1%

34.1%

CLASS≠1

CLASS=1
38.2%

19.1%

72.7%

MALE

FEMALE

15.1%

34.1%

CLASS≠1

CLASS=1

13.1%

33.9%

FSIZE≠2

FSIZE=2
38.2%

19.1%

72.7%

MALE

FEMALE

15.1%

34.1%

49.1%

93.2%

CLASS≠1

CLASS=1

CLASS>2

CLASS<=2

13.1%

33.9%

FSIZE≠2
...
38.2%

19.1%

72.7%

MALE

FEMALE

15.1%

34.1%

49.1%

93.2%

CLASS≠1

CLASS=1

CLASS>2

CLASS<=2

13.1%

33.9%

24.4%

5...
Predicting Click Probabilities
Passenger → Job Impression
Survived → Clicked on Job
For each candidate job, follow path th...
Simplified Decision Tree for query="sales"
NO

account

sales

NO

NO

1.9%

manager

YES

YES

YES

3.8%
NO

2.1%

manage...
job title = “sales representative”
NO

sales

account

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

manager

...
job title = “account executive”
NO

account
account

sales

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

mana...
job title = “outside sales representative”
NO

sales

account

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

m...
job title = “sales associate”
NO

sales

account

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

manager

YES
Y...
job title = “inside sales representative”
NO

sales

account

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

ma...
job title = “sales manager”
NO

sales

account

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

manager

YES
YES...
job title = “sales consultant”
NO

sales

account

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

manager

YES
...
job title = “store manager”
NO

NO

NO

account

YES

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

manager

Y...
job title = “service sales representative”
NO

sales

account

YES

3.8%
NO

2.1%
NO

NO

manager

representative

1.9%

m...
job title = “customer service representative”
NO

NO

NO

account

YES

YES

3.8%
NO

2.1%
NO

NO

manager

representative...
Final CTR Predictions
5.1%
4.6%
4.4%
3.8%
2.9%
2.9%
2.6%
2.1%
1.9%
1.8%

outside sales representative
sales representative...
Single Machine
Implementation
Overview
Tree Building Strategies
One node at a time
- depth first
- breadth first
1

Depth First
1
2

3

Depth First
1
2
4

3
5

Depth First
1
2
5

4
6

3

7

Depth First
1
2
5

4
6

3

7

Depth First
1
2
5

4
6

3

7

Depth First
1
2
5

4
6

3

7

Depth First
1
2
5

4
6

3
8

7

Depth First

9
1

Breadth First
1
2

3

Breadth First
1
2
4

3
5

Breadth First
1
2
4

3
5

6

Breadth First

7
1
2
5

4
8

3
6

9

Breadth First

7
1
2
5

4
8

3
6

9

Breadth First

7
1
2
5

4
8

3

9

6
10

7
11

Breadth First
1
2
5

4
8

3

9

6
10

7
11

Breadth First

12

13
Tree Building Strategies
One node at a time
- depth first
- breadth first
One layer at a time, all nodes simultaneous
1
1
iter #1
1
iter #1

2

3
1
iter #1

2
iter #2

3
1
iter #1

2

3

iter #2

4

5

6

7
1
iter #1

2

3

iter #2

4
iter #3

5

6

7
1
iter #1

2

3

iter #2

5

4

6

7

iter #3
8

9

0

10

11

12

13
1
iter #1

2

3

iter #2

5

4

6

7

iter #3
8

9

0

10

11

12

13
1
iter #1

2

3

iter #2

5

4

6

7

iter #3
8
iter #4

9

0

10

11

12

13
1
iter #1

2

3

iter #2

5

4

6

7

iter #3
8

9

0

10

11

12

13
Data Format
id

class

fsize

gender

survived

id

class

fsize

gender

survived

0

1

0

f

1

10

1

1

m

0

1

1

3...
Data Format
Create an inverted index
Key to efficiently building one layer at a time
Inverted Index
Maps terms to the list of documents that
contain that term
Terms and docs stored in sorted order
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….
class=2 → 323,324,325,326,327,328,329….
class=3 → 600,601,602,6...
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….
class=2 → 323,324,325,326,327,328,329….
class=3 → 600,601,602,6...
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….
class=2 → 323,324,325,326,327,328,329….
class=3 → 600,601,602,6...
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….
class=2 → 323,324,325,326,327,328,329….
class=3 → 600,601,602,6...
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….
class=2 → 323,324,325,326,327,328,329….
class=3 → 600,601,602,6...
Inverted Index
class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13….
class=2 → 323,324,325,326,327,328,329….
class=3 → 600,601,602,6...
Inverted Index
fsize=0 → 0,5,7,9,12,13,14,15,18,19,22….
fsize=1 → 6,10,11,16,17,26,27,36,49,50….
fsize=2 → 8,20,21,42,76,7...
Inverted Index
gender=f → 0,2,4,6,8,11,12,13,17,18,21….
gender=m → 1,3,5,7,9,10,14,15,16,19,20….
Inverted Index
survived=0 → 2,3,4,7,9,10,15,16,19,25….
survived=1 → 0,1,5,6,8,11,12,13,14,17….
Inverted Index Implementations
Lucene
Flamdex
Primary Lookup Tables
groups[doc]
Where in the tree each doc is
Initialized to all ones, all docs start in root
values[doc...
Primary Lookup Tables
values[doc]
Constructed from an inverted index of the
values
Invert the field of interest (e.g. surv...
Main Loop Overview
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no...
Main Loop - First Iteration
foreach field (class, fsize, gender)
Main Loop - First Iteration
foreach field (class, fsize, gender)
foreach term (class=1,class=2,class=3...)
Main Loop - First Iteration
foreach field (class, fsize, gender)
foreach term (class=1,class=2,class=3...)
get group stats
Get Group Stats
count[grp]
Count of how many documents within that
group contain current term, initialized to zeros
vsum[g...
Get Group Stats
for current field/term
Get Group Stats
for current field/term
foreach doc
Get Group Stats
for current field/term
foreach doc
grp = grps[doc]
Get Group Stats
for current field/term
foreach doc
grp = grps[doc]
if grp == 0 skip
Get Group Stats
for current field/term
foreach doc
grp = grps[doc]
if grp == 0 skip
count[grp]++
vsum[grp] += vals[doc]
Get Group Stats
for current field/term (class=1)
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
i...
class = 1

1309 passengers
500 survivors
38.2% survival
class = 1

Group 1

1309 passengers
500 survivors
38.2% survival
class = 1

Group 1

1309 passengers
500 survivors
38.2% survival
class = 1
323 passengers

count[1]

Group 1

1309 passengers
500 survivors
38.2% survival
class = 1
323 passengers
200 survivors

count[1]
vsum[1]

Group 1

1309 passengers
500 survivors
38.2% survival
Get Group Stats
for current field/term (class=2)
foreach doc (323,324,325,326,327,328,329...)
grp = grps[doc] (1,1,1,1,1,1...
Get Group Stats
for current field/term (class=3)
foreach doc (600,601,602,603,604,605,606...)
grp = grps[doc] (1,1,1,1,1,1...
Main Loop - First Iteration
foreach field (class, fsize, gender)
foreach term (class=1,class=2,class=3...)
get group stats...
Evaluate Splits
Consider current field/term as a potential split
for each group
1) check if split is admissible
balance ch...
Evaluate Splits
totalcount[group] / totalvalue[group]
Total number of documents and total values for
each group, i.e. # pa...
foreach field/term (class=1)
get group stats (count[1]=323,vsum[1]=200)
foreach group
if not admissible( … ) skip
score = ...
foreach field/term (class=1)
get group stats (count[1]=323,vsum[1]=200)
foreach group
if not admissible( … ) skip
score = ...
Main Loop - First Iteration
foreach field (class, fsize, gender)
foreach term (class=1,class=2,class=3...)
get group stats...
Apply Best Splits
Each split is a combination of a target group, a
condition, a positive destination group, and a
negative...
Apply Best Splits
Each split is a combination of a target group, a
condition, a positive destination group, and a
negative...
Apply Best Splits
Each split is a combination of a target group, a
condition, a positive destination group, and a
negative...
Apply Best Splits
Each split is a combination of a target group, a
condition, a positive destination group, and a
negative...
Apply Best Splits
Each split is a combination of a target group, a
condition, a positive destination group, and a
negative...
Apply Best Splits
Each split is a combination of a target group, a
condition, a positive destination group, and a
negative...
Apply Best Splits
Using inverted index, iterate over docs that
match split condition
If current document is in targeted gr...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1
group[1] = 1
group[2] = 1
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1
group[1] = 1
group[2] = 1
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1
group[1] = 1
group[2] = 1
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 1
group[1] = 1
group[2] = 1
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 1
group[2] = 1
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 1
group[2] = 3
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 1
group[2] = 3
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 1
group[2] = 3
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 1
group[2] = 3
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 1
group[2] = 3
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 1
group[2] = 3
group[3] = 1
group[4...
Apply Best Splits
gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23….
group[0] = 3
group[1] = 2
group[2] = 3
group[3] = 2
group[4...
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits
repeat n times or until no more spl...
1
1
iter #1
1
iter #1
gender = female
1
iter #1

2
gender ≠ female

3
gender = female
1
iter #1

2
iter #2

3
Main Loop - Second Iteration
foreach field (class, fsize, gender)
foreach term (class=1,class=2,class=3...)
get group stat...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (3,2,3,2,3,2,3,2,3…)
i...
Get Group Stats
for current field/term (class=1)
foreach doc (0,1,2,3,4,5,6,7,8...)
grp = grps[doc] (3,2,3,2,3,2,3,2,3…)
i...
Get Group Stats
for current field/term (class=2)
foreach doc (323,324,325,326,327,328,329...)
grp = grps[doc] (2,3,2,2,2,2...
Get Group Stats
for current field/term (class=3)
foreach doc (600,601,602,603,604,605,606...)
grp = grps[doc] (2,2,2,3,3,2...
Get Group Stats
for current field/term (gender=female)
foreach doc (0,2,4,6,8,11,12,13,17,18,21,23….)
grp = grps[doc] (3,3...
Get Group Stats
for current field/term (gender=male)
foreach doc (1,3,5,7,9,10,14,15,16,19,20,22...)
grp = grps[doc] (2,2,...
What About
Inequality Splits?
e.g. class ≤ 2
Main Loop + Inequality Splits
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
...
Main Loop + Inequality Splits
foreach field
reset inequality stats
foreach term
get group stats
evaluate splits
apply best...
Main Loop + Inequality Splits
foreach field
reset inequality stats
foreach term
get group stats
update inequality stats
ev...
Main Loop + Inequality Splits
foreach field
reset inequality stats
foreach term
get group stats
update inequality stats
ev...
Scalability
Performs quite well on a single machine
Worked well for a while, but started to hit limits
Ultimately needed t...
Multiple Machine
Implementation
Hadoop?
Hadoop
Experimented with using Hadoop
Each level took five sequential map reduce jobs
Much slower than single machine; rep...
Hadoop
Experimented with using Hadoop
Each level took five sequential map reduce jobs
Much slower than single machine; rep...
Partition Data
Inverted Index
Inverted Index
Inverted Index
Inverted Index
Shard 1

Shard 2
Machine 1

Machine 2

Shard 1

Shard 2
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n times or un...
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n times or un...
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n times or un...
Main Loop

f

foreach ield

FTGS

foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n t...
Main Loop

f

foreach ield

t

foreach erm

FTGS

get group stats
evaluate splits
apply best splits for each group
repeat ...
Main Loop

f

foreach ield

t
get group stats

foreach erm

FTGS

evaluate splits
apply best splits for each group
repeat ...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239

Sorted

fsize=1|1|23...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200

class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fs...
FTGS Stream - Single Machine
class=1|1|323|200

class=2|1|277|119

class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
f...
FTGS Stream - Single Machine
class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239
fsize=1|1|235|126
...
FTGS Stream - Single Machine
class=1|1|323|200

class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119

class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126
...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181

fsize=0|1|790|239

fsize=1|1|235|126

...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239

fsize=1|1|235|126

f...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126

fs...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream - Single Machine
class=1|1|323|200
class=2|1|277|119
class=3|1|709|181
fsize=0|1|790|239
fsize=1|1|235|126
fsi...
FTGS Stream
How to distribute?
Machine 1

Machine 2

Shard 1

Shard 2
FTGS 1
Machine 2

Shard 1

Shard 2
FTGS 1

FTGS 2

Shard 1

Shard 2
FTGS 1

FTGS 2
Machine 3

Shard 1

Shard 2
FTGS 1

Merge

FTGS 2

Machine 3

Shard 1

Shard 2
FTGS Stream Merge
class=1|1|198|111
class=2|1|277|119
class=3|1|511|129
fsize=0|1|790|239
fsize=1|1|94|53
fsize=2|1|75|48
...
FTGS Stream Merge
class=1|1|125|89
class=3|1|198|52
fsize=1|1|141|73
fsize=2|1|84|42
fsize=3|1|22|13
fsize=4|1|19|5
fsize=...
FTGS Stream Merge
class=1|1|198|111
class=2|1|277|119
class=3|1|511|129
fsize=0|1|790|239

class=1|1|125|89
class=3|1|198|...
FTGS Stream Merge
class=1|1|198|111
class=2|1|277|119
class=3|1|511|129

class=1|1|125|89
+

fsize=0|1|790|239

fsize=3|1|...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|198|52

class=3|1|511|129

fsize=1|1|1...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|198|52

class=3|1|511|129

fsize=1|1|1...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|511|129

class=3|1|198|52

fsize=0|1|7...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|198|52

class=3|1|511|129
fsize=0|1|79...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|198|52

class=3|1|511|129

fsize=0|1|7...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|198|52

class=3|1|511|129

fsize=0|1|7...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|198|52

class=3|1|511|129

fsize=1|1|1...
FTGS Stream Merge
class=1|1|198|111

class=1|1|125|89

class=2|1|277|119

class=3|1|198|52

class=3|1|511|129

fsize=1|1|1...
Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Shard 6
FTGS 1

FTGS 2

FTGS 3

FTGS 4

FTGS 5

FTGS 6
k-way merge

FTGS 1

FTGS 2

FTGS 3

FTGS 4

FTGS 5

FTGS 6
FTGS 1-6

FTGS 1

FTGS 2

FTGS 3

FTGS 4

FTGS 5

FTGS 6
FTGS 1-6

FTGS 7-12

FTGS 13-18
FTGS 1-18

FTGS 1-6

FTGS 7-12

FTGS 13-18
FTGS 1-36

FTGS 1-18

FTGS 19-36
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n times or un...
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n times or un...
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n times or un...
Main Loop
foreach field
foreach term
get group stats
evaluate splits
apply best splits for each group
repeat n times or un...
FTGS

FTGS 1-6

FTGS 7-12

FTGS 13-18
Regroup

Regroup 1-6

Regroup 7-12

Regroup 13-18
FTGS

FTGS 1-6

FTGS 7-12

FTGS 13-18
Regroup

Regroup 1-6

Regroup 7-12

Regroup 13-18
Imhotep
Imhotep
Distributed System that does efficient FTGS
and Regroup operations on inverted indexes
Imhotep
32 machines
2 cpu x 6 core xeon westmere E5649
128GB RAM
10x1TB 7200 RPM SATA
Total:
384 cores, 4TB RAM, 320TB dis...
Imhotep
Decision tree on 13 billion documents
Imhotep
Decision tree on 13 billion documents
330GB → ~25 bytes per doc
Imhotep
Decision tree on 13 billion documents
330GB → ~25 bytes per doc
First FTGS: 314 seconds
First Regroup: 9.6 seconds
Imhotep
Decision tree on 13 billion documents
330GB → ~25 bytes per doc
First FTGS: 314 seconds (36.3 million terms)
First...
Imhotep
Decision tree on 13 billion documents
330GB → ~25 bytes per doc
First FTGS: 314 seconds (36.3 million terms)
First...
Imhotep
Decision tree on 13 billion documents
330GB → ~25 bytes per doc
First FTGS: 314 seconds (36.3 million terms)
First...
Imhotep
Distributed System that does efficient FTGS
and Regroup operations
Powers our internal analytical tools
Imhotep
Distributed System that does efficient FTGS
and Regroup operations
Powers our internal analytical tools
… and more
Imhotep - Next @IndeedEng Talk
Sharding and shard management
Session / FTGS network protocol
Memory management
Inverted In...
Conclusion
Now scales to larger and larger data sets by
adding more machines
Increased freshness and frequency of builds
D...
Continuous Improvement

Sponsored Job Click-through Rate (CTR)
Thanks.
Q&A
More Questions?
Jason

David

James

Jeff
Next @IndeedEng Talk
Imhotep: Large Scale Analytics
and Machine Learning at Indeed
Jeff Plaisance, Engineering Manager
Mar...
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees
Prochain SlideShare
Chargement dans…5
×

[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

Video available at: http://www.youtube.com/watch?v=MFilAoiV5nE

Decision trees are a widely used machine learning technique for supervised classification. Indeed's data sets consist of tens of billions of documents with millions of distinct features. Since decision trees back some of our most important features, we built a custom distributed system to efficiently train them. Every day, we now build dozens of decision trees across this data. This same system now powers our internal analytical tools that enable quick data-driven decision-making at Indeed.

This presentation provides a brief introduction to decision trees followed by a detailed overview of our approach to building them. The talk will be presented by our CTO, Andrew Hudson.

  • Soyez le premier à commenter

[@IndeedEng] Machine Learning at Indeed: Scaling Decision Trees

  1. 1. go.indeed.com/IndeedEngTalks
  2. 2. Machine Learning at Indeed Scaling Decision Trees
  3. 3. Andrew Hudson CTO
  4. 4. I help people get jobs.
  5. 5. Indeed is a Search Engine for Jobs
  6. 6. Which jobs to show?
  7. 7. 18,749 jobs
  8. 8. Which jobs to show? Maximize job seeker’s chance to get the job
  9. 9. Which jobs to show? Maximize job seeker’s chance to get the job ● ● ● ● Will job seeker click on the job? Is the job still available? Will job seeker apply to the job? Is job seeker qualified for the job?
  10. 10. Which jobs to show? Maximize job seeker’s chance to get the job ● ● ● ● Will job seeker click on the job? Is the job still available? Will job seeker apply to the job? Is job seeker qualified for the job?
  11. 11. How? Log job seeker behavior Analyze logs, what best explains why they clicked on some jobs and not on others? May help predict future behavior
  12. 12. How? Log job seeker behavior Analyze logs, what best explains why they clicked on some jobs and not on others? May help predict future behavior Supervised learning
  13. 13. Supervised Learning Approaches Neural networks Bayesian methods Decision trees Genetic programming Logistic model tree Nearest neighbor Support Vector Machines Random forests Boosting Bagging Regression Ensemble methods
  14. 14. Supervised Learning Approaches Neural networks Bayesian methods Decision trees Genetic programming Logistic model tree Nearest neighbor Support Vector Machines Random forests Boosting Bagging Regression Ensemble methods
  15. 15. Supervised Learning Approaches Decision trees Genetic programming Logistic model tree Random forests Bagging Boosting Ensemble methods
  16. 16. Decision Trees
  17. 17. What is a Decision Tree? A tree like structure that presents a relevant sequence of questions which determine a path and ultimately some outcome or prediction
  18. 18. I’m Thinking About Buying a Laptop
  19. 19. I’m Thinking About Buying a Laptop Is quality important?
  20. 20. I’m Thinking About Buying a Laptop Is quality important? NO ASUS
  21. 21. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has
  22. 22. I’m Thinking About Buying a Laptop Is quality important? YES Want to run linux? NO ASUS or whatever woot has
  23. 23. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has YES Want to run linux? NO MACBOOK
  24. 24. I’m Thinking About Buying a Laptop Is quality important? NO ASUS or whatever woot has YES Want to run linux? YES LENOVO NO MACBOOK
  25. 25. I’m Thinking About Buying a Laptop Is quality important? YES NO ASUS or whatever woot has IDGAF DELL Want to run linux? YES NO MACBOOK HELLYES SYSTEM76 LENOVO
  26. 26. Benefits of Decision Trees Algorithm relatively simple to understand and implement Model produced also human understandable
  27. 27. Decision Tree Learning Programmatic creation of decision trees
  28. 28. Decision Tree Learning Given a set of documents, split it into two or more subsets that optimize some criteria Repeat this process until a set can no longer be split
  29. 29. Titanic Example 1309 passengers 500 survivors 38.2% survival rate What best explains who survived?
  30. 30. What best explains who survived? class class of ticket; first, second or third fsize family size; number of family members onboard gender male or female
  31. 31. 1309 passengers 500 survivors 38.2% survival
  32. 32. class = 1 1309 passengers 500 survivors 38.2% survival
  33. 33. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival
  34. 34. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival
  35. 35. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = ?
  36. 36. Score conditional entropy
  37. 37. Conditional Entropy as Score lower conditional entropy ↓ less uncertainty about prediction based on term
  38. 38. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = 0.6267
  39. 39. class = 1 323 passengers 200 survivors 61.9% survival 1309 passengers 500 survivors 38.2% survival class ≠ 1 986 passengers 300 survivors 30.4% survival Score = 0.6267 Best Score: 0.6267, class = 1
  40. 40. class = 1 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  41. 41. class ≤ 2 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  42. 42. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Best Score: 0.6267, class = 1
  43. 43. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6267, class = 1
  44. 44. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Score = 0.6244 class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6267, class = 1
  45. 45. class ≤ 2 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival Score = 0.6244 class > 2 709 passengers 181 survivors 25.5% survival Best Score: 0.6244, class ≤ 2
  46. 46. class ≠ 3 600 passengers 319 survivors 53.2% survival 1309 passengers 500 survivors 38.2% survival class = 3 709 passengers 181 survivors 25.5% survival Score = 0.6244 Best Score: 0.6244, class ≤ 2
  47. 47. gender = female 1309 passengers 500 survivors 38.2% survival Best Score: 0.6244, class ≤ 2
  48. 48. gender = female 466 passengers 339 survivors 72.7% survival 1309 passengers 500 survivors 38.2% survival Best Score: 0.6244, class ≤ 2
  49. 49. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Best Score: 0.6244, class ≤ 2
  50. 50. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Score = 0.5525 Best Score: 0.6244, class ≤ 2
  51. 51. gender ≠ female 843 passengers 161 survivors 19.1% survival 1309 passengers 500 survivors 38.2% survival gender = female 466 passengers 339 survivors 72.7% survival Score = 0.5525 Best Score: 0.5525, gender=f
  52. 52. fsize ≠ 0 519 passengers 261 survivors 50.3% survival 1309 passengers 500 survivors 38.2% survival fsize = 0 790 passengers 239 survivors 30.3% survival Score = 0.6448 Best Score: 0.5525, gender=f
  53. 53. Best Score: 0.5525, gender=f
  54. 54. 19.1% survival 72.7% survival
  55. 55. gender=male 843 passengers 161 survivors 19.1% survival
  56. 56. class = 1 179 passengers 61 survivors 34.1% survival gender=male 843 passengers 161 survivors 19.1% survival
  57. 57. class = 1 179 passengers 61 survivors 34.1% survival gender=male 843 passengers 161 survivors 19.1% survival class ≠ 1 664 passengers 100 survivors 15.1% survival Score = 0.4700
  58. 58. class = 1 class ≠ 1
  59. 59. class = 1 class ≠ 1
  60. 60. 34.1% survival 15.1% survival
  61. 61. 38.2%
  62. 62. 38.2% 19.1% 72.7% MALE FEMALE
  63. 63. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% CLASS≠1 CLASS=1
  64. 64. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% CLASS≠1 CLASS=1 13.1% 33.9% FSIZE≠2 FSIZE=2
  65. 65. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% 49.1% 93.2% CLASS≠1 CLASS=1 CLASS>2 CLASS<=2 13.1% 33.9% FSIZE≠2 FSIZE=2
  66. 66. 38.2% 19.1% 72.7% MALE FEMALE 15.1% 34.1% 49.1% 93.2% CLASS≠1 CLASS=1 CLASS>2 CLASS<=2 13.1% 33.9% 24.4% 54.9% FSIZE≠2 FSIZE=2 FSIZE>2 FSIZE<=2
  67. 67. Predicting Click Probabilities Passenger → Job Impression Survived → Clicked on Job For each candidate job, follow path through tree then take click through rate of terminal node
  68. 68. Simplified Decision Tree for query="sales" NO account sales NO NO 1.9% manager YES YES YES 3.8% NO 2.1% manager representative NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  69. 69. job title = “sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  70. 70. job title = “account executive” NO account account sales YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  71. 71. job title = “outside sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  72. 72. job title = “sales associate” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  73. 73. job title = “inside sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  74. 74. job title = “sales manager” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 1.8% 2.9% NO outside YES 5.1% NO NO service YES 2.9% inside YES 4.4% 4.6%
  75. 75. job title = “sales consultant” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  76. 76. job title = “store manager” NO NO NO account YES YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES sales 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  77. 77. job title = “service sales representative” NO sales account YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES YES NO NO 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  78. 78. job title = “customer service representative” NO NO NO account YES YES 3.8% NO 2.1% NO NO manager representative 1.9% manager YES sales 2.6% associate YES YES YES 2.9% NO outside YES 5.1% 1.8% NO NO service YES 2.9% inside YES 4.4% 4.6%
  79. 79. Final CTR Predictions 5.1% 4.6% 4.4% 3.8% 2.9% 2.9% 2.6% 2.1% 1.9% 1.8% outside sales representative sales representative inside sales representative account executive sales manager service sales representative sales consultant store manager customer service representative sales associate
  80. 80. Single Machine Implementation
  81. 81. Overview
  82. 82. Tree Building Strategies One node at a time - depth first - breadth first
  83. 83. 1 Depth First
  84. 84. 1 2 3 Depth First
  85. 85. 1 2 4 3 5 Depth First
  86. 86. 1 2 5 4 6 3 7 Depth First
  87. 87. 1 2 5 4 6 3 7 Depth First
  88. 88. 1 2 5 4 6 3 7 Depth First
  89. 89. 1 2 5 4 6 3 7 Depth First
  90. 90. 1 2 5 4 6 3 8 7 Depth First 9
  91. 91. 1 Breadth First
  92. 92. 1 2 3 Breadth First
  93. 93. 1 2 4 3 5 Breadth First
  94. 94. 1 2 4 3 5 6 Breadth First 7
  95. 95. 1 2 5 4 8 3 6 9 Breadth First 7
  96. 96. 1 2 5 4 8 3 6 9 Breadth First 7
  97. 97. 1 2 5 4 8 3 9 6 10 7 11 Breadth First
  98. 98. 1 2 5 4 8 3 9 6 10 7 11 Breadth First 12 13
  99. 99. Tree Building Strategies One node at a time - depth first - breadth first One layer at a time, all nodes simultaneous
  100. 100. 1
  101. 101. 1 iter #1
  102. 102. 1 iter #1 2 3
  103. 103. 1 iter #1 2 iter #2 3
  104. 104. 1 iter #1 2 3 iter #2 4 5 6 7
  105. 105. 1 iter #1 2 3 iter #2 4 iter #3 5 6 7
  106. 106. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  107. 107. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  108. 108. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 iter #4 9 0 10 11 12 13
  109. 109. 1 iter #1 2 3 iter #2 5 4 6 7 iter #3 8 9 0 10 11 12 13
  110. 110. Data Format id class fsize gender survived id class fsize gender survived 0 1 0 f 1 10 1 1 m 0 1 1 3 m 1 11 1 1 f 1 2 1 3 f 0 12 1 0 f 1 3 1 3 m 0 13 1 0 f 1 4 1 3 f 0 14 1 0 m 1 5 1 0 m 1 15 1 0 m 0 6 1 1 f 1 16 1 1 m 0 7 1 0 m 0 17 1 1 f 1 8 1 2 f 1 18 1 0 f 1 9 1 0 m 0 19 1 0 m 0 ….
  111. 111. Data Format Create an inverted index Key to efficiently building one layer at a time
  112. 112. Inverted Index Maps terms to the list of documents that contain that term Terms and docs stored in sorted order
  113. 113. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606….
  114. 114. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Field
  115. 115. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Term
  116. 116. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  117. 117. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  118. 118. Inverted Index class=1 → 0,1,2,3,4,5,6,7,8,9,10,11,12,13…. class=2 → 323,324,325,326,327,328,329…. class=3 → 600,601,602,603,604,605,606…. Docs
  119. 119. Inverted Index fsize=0 → 0,5,7,9,12,13,14,15,18,19,22…. fsize=1 → 6,10,11,16,17,26,27,36,49,50…. fsize=2 → 8,20,21,42,76,77,78,79,81,82…. fsize=3 → 1,2,3,4,54,55,56,57,90,339…. fsize=4 → 249,250,251,252,253,449,806…. ….
  120. 120. Inverted Index gender=f → 0,2,4,6,8,11,12,13,17,18,21…. gender=m → 1,3,5,7,9,10,14,15,16,19,20….
  121. 121. Inverted Index survived=0 → 2,3,4,7,9,10,15,16,19,25…. survived=1 → 0,1,5,6,8,11,12,13,14,17….
  122. 122. Inverted Index Implementations Lucene Flamdex
  123. 123. Primary Lookup Tables groups[doc] Where in the tree each doc is Initialized to all ones, all docs start in root values[doc] Value to be classified, for each doc In this case it’s 1 if survived, 0 otherwise
  124. 124. Primary Lookup Tables values[doc] Constructed from an inverted index of the values Invert the field of interest (e.g. survived)
  125. 125. Main Loop Overview foreach field foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  126. 126. Main Loop - First Iteration foreach field (class, fsize, gender)
  127. 127. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...)
  128. 128. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats
  129. 129. Get Group Stats count[grp] Count of how many documents within that group contain current term, initialized to zeros vsum[grp] Summation of the value to be classified from the documents within that group that contain current term, initialized to zeros
  130. 130. Get Group Stats for current field/term
  131. 131. Get Group Stats for current field/term foreach doc
  132. 132. Get Group Stats for current field/term foreach doc grp = grps[doc]
  133. 133. Get Group Stats for current field/term foreach doc grp = grps[doc] if grp == 0 skip
  134. 134. Get Group Stats for current field/term foreach doc grp = grps[doc] if grp == 0 skip count[grp]++ vsum[grp] += vals[doc]
  135. 135. Get Group Stats for current field/term (class=1)
  136. 136. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...)
  137. 137. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…)
  138. 138. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip
  139. 139. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
  140. 140. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 0, vsum[1] = 0
  141. 141. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 1, vsum[1] = 1
  142. 142. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 2, vsum[1] = 2
  143. 143. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 3, vsum[1] = 2
  144. 144. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 4, vsum[1] = 2
  145. 145. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 5, vsum[1] = 2
  146. 146. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 6, vsum[1] = 3
  147. 147. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[1] = 323, vsum[1] = 200
  148. 148. class = 1 1309 passengers 500 survivors 38.2% survival
  149. 149. class = 1 Group 1 1309 passengers 500 survivors 38.2% survival
  150. 150. class = 1 Group 1 1309 passengers 500 survivors 38.2% survival
  151. 151. class = 1 323 passengers count[1] Group 1 1309 passengers 500 survivors 38.2% survival
  152. 152. class = 1 323 passengers 200 survivors count[1] vsum[1] Group 1 1309 passengers 500 survivors 38.2% survival
  153. 153. Get Group Stats for current field/term (class=2) foreach doc (323,324,325,326,327,328,329...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…) … count[1] = 277, vsum[1] = 119
  154. 154. Get Group Stats for current field/term (class=3) foreach doc (600,601,602,603,604,605,606...) grp = grps[doc] (1,1,1,1,1,1,1,1,1…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…) … count[1] = 709, vsum[1] = 181
  155. 155. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats evaluate splits
  156. 156. Evaluate Splits Consider current field/term as a potential split for each group 1) check if split is admissible balance check, significance check 2) score the split conditional entropy or some other heuristic 3) keep best scoring split
  157. 157. Evaluate Splits totalcount[group] / totalvalue[group] Total number of documents and total values for each group, i.e. # passengers / # survivors bestsplit[group] / bestscore[group] Current best split and score for each group, initially nulls
  158. 158. foreach field/term (class=1) get group stats (count[1]=323,vsum[1]=200) foreach group if not admissible( … ) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = field/term
  159. 159. foreach field/term (class=1) get group stats (count[1]=323,vsum[1]=200) foreach group if not admissible( … ) skip score = calcscore(cnt[grp], vsum[grp], totcnt[grp], totval[grp]) if score < bestscore[grp] bestscore[grp] = score bestsplit[grp] = field/term
  160. 160. Main Loop - First Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats evaluate splits apply best splits (bestsplit[1]=“gender=f”)
  161. 161. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group
  162. 162. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 1
  163. 163. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female 1
  164. 164. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 1 3
  165. 165. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 2 negative group: 2 1 3
  166. 166. Apply Best Splits Each split is a combination of a target group, a condition, a positive destination group, and a negative destination group target group: 1 condition: gender=female positive group: 3 2 negative group: 2 1 3
  167. 167. Apply Best Splits Using inverted index, iterate over docs that match split condition If current document is in targeted group, move it to the positive group At the end, move anything left in target group to negative group
  168. 168. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  169. 169. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  170. 170. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  171. 171. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 1 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  172. 172. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 1 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  173. 173. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 1 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  174. 174. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  175. 175. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 1 group[7] = 1 group[8] = 1 group[9] = 1 group[10] = 1 group[11] = 1 group[12] = 1 group[13] = 1 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 1 group[18] = 1 group[19] = 1 group[20] = 1
  176. 176. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  177. 177. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  178. 178. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 1 group[2] = 3 group[3] = 1 group[4] = 3 group[5] = 1 group[6] = 3 group[7] = 1 group[8] = 3 group[9] = 1 group[10] = 1 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 1 group[15] = 1 group[16] = 1 group[17] = 3 group[18] = 3 group[19] = 1 group[20] = 1
  179. 179. Apply Best Splits gender=f -> 0,2,4,6,8,11,12,13,17,18,21,23…. group[0] = 3 group[1] = 2 group[2] = 3 group[3] = 2 group[4] = 3 group[5] = 2 group[6] = 3 group[7] = 2 group[8] = 3 group[9] = 2 group[10] = 2 group[11] = 3 group[12] = 3 group[13] = 3 group[14] = 2 group[15] = 2 group[16] = 2 group[17] = 3 group[18] = 3 group[19] = 2 group[20] = 2
  180. 180. Main Loop foreach field foreach term get group stats evaluate splits apply best splits repeat n times or until no more splits found
  181. 181. 1
  182. 182. 1 iter #1
  183. 183. 1 iter #1 gender = female
  184. 184. 1 iter #1 2 gender ≠ female 3 gender = female
  185. 185. 1 iter #1 2 iter #2 3
  186. 186. Main Loop - Second Iteration foreach field (class, fsize, gender) foreach term (class=1,class=2,class=3...) get group stats
  187. 187. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (3,2,3,2,3,2,3,2,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…)
  188. 188. Get Group Stats for current field/term (class=1) foreach doc (0,1,2,3,4,5,6,7,8...) grp = grps[doc] (3,2,3,2,3,2,3,2,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,1,0,0,0,1,1,0,1…) … count[2] = 179, vsum[2] = 61 count[3] = 144, vsum[3] = 139
  189. 189. Get Group Stats for current field/term (class=2) foreach doc (323,324,325,326,327,328,329...) grp = grps[doc] (2,3,2,2,2,2,3,2,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,1,0,0,0,0,1,0,1…) … count[2] = 171, vsum[2] = 25 count[3] = 106, vsum[3] = 94
  190. 190. Get Group Stats for current field/term (class=3) foreach doc (600,601,602,603,604,605,606...) grp = grps[doc] (2,2,2,3,3,2,2,3,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (0,0,0,1,1,1,1,1,0…) … count[2] = 493, vsum[2] = 75 count[3] = 216, vsum[3] = 106
  191. 191. Get Group Stats for current field/term (gender=female) foreach doc (0,2,4,6,8,11,12,13,17,18,21,23….) grp = grps[doc] (3,3,3,3,3,3,3,3,3,3,3,3…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,0,0,1,1,1,1,1,1…) … count[2] = 0, vsum[2] = 0 count[3] = 467, vsum[3] = 339
  192. 192. Get Group Stats for current field/term (gender=male) foreach doc (1,3,5,7,9,10,14,15,16,19,20,22...) grp = grps[doc] (2,2,2,2,2,2,2,2,2,2,2…) if grp == 0 skip count[grp]++ vsum[grp] += vals[doc] (1,0,1,0,0,0,1,0,0…) … count[2] = 844, vsum[2] = 161 count[3] = 0, vsum[3] = 0
  193. 193. What About Inequality Splits? e.g. class ≤ 2
  194. 194. Main Loop + Inequality Splits foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  195. 195. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  196. 196. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats update inequality stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  197. 197. Main Loop + Inequality Splits foreach field reset inequality stats foreach term get group stats update inequality stats evaluate splits evaluate inequality splits apply best splits for each group repeat n times or until no more splits found
  198. 198. Scalability Performs quite well on a single machine Worked well for a while, but started to hit limits Ultimately needed to distribute to multiple machines
  199. 199. Multiple Machine Implementation
  200. 200. Hadoop?
  201. 201. Hadoop Experimented with using Hadoop Each level took five sequential map reduce jobs Much slower than single machine; repeatedly writes intermediate data and lots of shuffling
  202. 202. Hadoop Experimented with using Hadoop Each level took five sequential map reduce jobs Much slower than single machine; repeatedly writes intermediate data and lots of shuffling Hadoop not great for iterative algorithms
  203. 203. Partition Data
  204. 204. Inverted Index
  205. 205. Inverted Index
  206. 206. Inverted Index
  207. 207. Inverted Index Shard 1 Shard 2
  208. 208. Machine 1 Machine 2 Shard 1 Shard 2
  209. 209. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  210. 210. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  211. 211. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found FTGS
  212. 212. Main Loop f foreach ield FTGS foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  213. 213. Main Loop f foreach ield t foreach erm FTGS get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  214. 214. Main Loop f foreach ield t get group stats foreach erm FTGS evaluate splits apply best splits for each group repeat n times or until no more splits found
  215. 215. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  216. 216. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  217. 217. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  218. 218. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  219. 219. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 Sorted fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  220. 220. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  221. 221. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  222. 222. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  223. 223. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  224. 224. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  225. 225. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  226. 226. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  227. 227. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  228. 228. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  229. 229. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  230. 230. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  231. 231. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  232. 232. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  233. 233. FTGS Stream - Single Machine class=1|1|323|200 class=2|1|277|119 class=3|1|709|181 fsize=0|1|790|239 fsize=1|1|235|126 fsize=2|1|159|90 fsize=3|1|43|30 fsize=4|1|22|6 fsize=5|1|25|5 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|466|339 gender=m|1|843|161
  234. 234. FTGS Stream How to distribute?
  235. 235. Machine 1 Machine 2 Shard 1 Shard 2
  236. 236. FTGS 1 Machine 2 Shard 1 Shard 2
  237. 237. FTGS 1 FTGS 2 Shard 1 Shard 2
  238. 238. FTGS 1 FTGS 2 Machine 3 Shard 1 Shard 2
  239. 239. FTGS 1 Merge FTGS 2 Machine 3 Shard 1 Shard 2
  240. 240. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 Machine 1
  241. 241. FTGS Stream Merge class=1|1|125|89 class=3|1|198|52 fsize=1|1|141|73 fsize=2|1|84|42 fsize=3|1|22|13 fsize=4|1|19|5 fsize=5|1|22|4 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  242. 242. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 fsize=0|1|790|239 class=1|1|125|89 class=3|1|198|52 fsize=1|1|141|73 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=4|1|19|5 fsize=3|1|21|17 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  243. 243. FTGS Stream Merge class=1|1|198|111 class=2|1|277|119 class=3|1|511|129 class=1|1|125|89 + fsize=0|1|790|239 fsize=3|1|21|17 fsize=1|1|141|73 fsize=2|1|84|42 fsize=1|1|94|53 fsize=2|1|75|48 class=3|1|198|52 fsize=3|1|22|13 class=1|1|323|200 fsize=4|1|19|5 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  244. 244. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|19|5 class=1|1|323|200 fsize=5|1|22|4 fsize=4|1|3|1 fsize=6|1|16|4 fsize=5|1|3|1 fsize=7|1|8|0 gender=f|1|308|237 gender=m|1|678|122 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  245. 245. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=2|1|84|42 fsize=1|1|94|53 fsize=3|1|22|13 fsize=2|1|75|48 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=4|1|19|5 class=2|1|277|119 class=1|1|323|200 fsize=5|1|22|4 fsize=6|1|16|4 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 1 Machine 2
  246. 246. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|511|129 class=3|1|198|52 fsize=0|1|790|239 fsize=2|1|84|42 fsize=3|1|22|13 fsize=1|1|94|53 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=4|1|3|1 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 Machine 1 fsize=1|1|141|73 fsize=6|1|16|4 class=2|1|277|119 class=1|1|323|200 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  247. 247. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|94|53 fsize=1|1|141|73 + fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|3|1 fsize=2|1|84|42 class=3|1|709|181 fsize=7|1|8|0 fsize=10|1|11|0 gender=f|1|158|102 class=2|1|277|119 gender=m|1|165|39 class=1|1|323|200 Machine 1 Machine 2
  248. 248. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|141|73 fsize=1|1|94|53 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=2|1|84|42 class=3|1|709|181 class=2|1|277|119 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 class=1|1|323|200 Machine 1 Machine 2
  249. 249. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=0|1|790|239 fsize=1|1|141|73 fsize=1|1|94|53 fsize=3|1|22|13 fsize=4|1|19|5 fsize=2|1|75|48 fsize=5|1|22|4 fsize=3|1|21|17 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 gender=f|1|308|237 gender=m|1|678|122 fsize=2|1|84|42 fsize=0|1|790|239 fsize=10|1|11|0 gender=f|1|158|102 gender=m|1|165|39 class=3|1|709|181 Machine 1 class=2|1|277|119 class=1|1|323|200 Machine 2
  250. 250. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|84|42 fsize=3|1|22|13 fsize=2|1|75|48 fsize=4|1|19|5 fsize=3|1|21|17 fsize=5|1|22|4 fsize=6|1|16|4 fsize=4|1|3|1 fsize=7|1|8|0 fsize=5|1|3|1 fsize=10|1|11|0 gender=f|1|308|237 gender=m|1|678|122 fsize=0|1|790|239 gender=f|1|158|102 gender=m|1|165|39 class=3|1|709|181 Machine 1 class=2|1|277|119 class=1|1|323|200 Machine 2
  251. 251. FTGS Stream Merge class=1|1|198|111 class=1|1|125|89 class=2|1|277|119 class=3|1|198|52 class=3|1|511|129 fsize=1|1|141|73 fsize=0|1|790|239 fsize=1|1|94|53 fsize=2|1|75|48 fsize=3|1|21|17 fsize=2|1|84|42 fsize=3|1|22|13 + fsize=4|1|3|1 fsize=6|1|16|4 fsize=10|1|11|0 gender=f|1|308|237 Machine 1 fsize=5|1|22|4 fsize=7|1|8|0 fsize=5|1|3|1 gender=m|1|678|122 fsize=4|1|19|5 fsize=1|1|235|126 fsize=0|1|790|239 class=3|1|709|181 class=2|1|277|119 gender=f|1|158|102 gender=m|1|165|39 Machine 2
  252. 252. Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6
  253. 253. FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  254. 254. k-way merge FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  255. 255. FTGS 1-6 FTGS 1 FTGS 2 FTGS 3 FTGS 4 FTGS 5 FTGS 6
  256. 256. FTGS 1-6 FTGS 7-12 FTGS 13-18
  257. 257. FTGS 1-18 FTGS 1-6 FTGS 7-12 FTGS 13-18
  258. 258. FTGS 1-36 FTGS 1-18 FTGS 19-36
  259. 259. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found FTGS
  260. 260. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  261. 261. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found
  262. 262. Main Loop foreach field foreach term get group stats evaluate splits apply best splits for each group repeat n times or until no more splits found Regroup
  263. 263. FTGS FTGS 1-6 FTGS 7-12 FTGS 13-18
  264. 264. Regroup Regroup 1-6 Regroup 7-12 Regroup 13-18
  265. 265. FTGS FTGS 1-6 FTGS 7-12 FTGS 13-18
  266. 266. Regroup Regroup 1-6 Regroup 7-12 Regroup 13-18
  267. 267. Imhotep
  268. 268. Imhotep Distributed System that does efficient FTGS and Regroup operations on inverted indexes
  269. 269. Imhotep 32 machines 2 cpu x 6 core xeon westmere E5649 128GB RAM 10x1TB 7200 RPM SATA Total: 384 cores, 4TB RAM, 320TB disk
  270. 270. Imhotep Decision tree on 13 billion documents
  271. 271. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc
  272. 272. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds First Regroup: 9.6 seconds
  273. 273. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds
  274. 274. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds (7 groups)
  275. 275. Imhotep Decision tree on 13 billion documents 330GB → ~25 bytes per doc First FTGS: 314 seconds (36.3 million terms) First Regroup: 9.6 seconds (7 groups) Second FTGS: 57 seconds Second Regroup: 23 seconds (217 groups)
  276. 276. Imhotep Distributed System that does efficient FTGS and Regroup operations Powers our internal analytical tools
  277. 277. Imhotep Distributed System that does efficient FTGS and Regroup operations Powers our internal analytical tools … and more
  278. 278. Imhotep - Next @IndeedEng Talk Sharding and shard management Session / FTGS network protocol Memory management Inverted Indexes FTGS Merge Regroup operations Fault Tolerance
  279. 279. Conclusion Now scales to larger and larger data sets by adding more machines Increased freshness and frequency of builds Decision trees have lots of tunable components, regularly get 1% wins via A/B test
  280. 280. Continuous Improvement Sponsored Job Click-through Rate (CTR)
  281. 281. Thanks.
  282. 282. Q&A
  283. 283. More Questions? Jason David James Jeff
  284. 284. Next @IndeedEng Talk Imhotep: Large Scale Analytics and Machine Learning at Indeed Jeff Plaisance, Engineering Manager March 26, 2014 http://engineering.indeed.com/talks

×