SlideShare une entreprise Scribd logo
1  sur  205
Télécharger pour lire hors ligne
go.indeed.com/IndeedEngTalks
Logrepo
Enabling Data-Driven Decisions
Jeff Chien
Software Engineer
Indeed Apply Team
Scale

More job searches
worldwide than any other
employment website.
●
●
●
●
●

Over 100 million unique users
Over 3 billion searches per month
Over 24 million jobs
Over 50 countries
Over 28 languages
I help
people
get jobs.
Job seeker flow using Indeed Apply
1. Search
2. View job
3. Click “Apply Now”
4. Submit application
Knowing how users
interact with our system
helps us make
better products
Likelihood of applying to a job

Have to
upload a
resume

Have
Indeed
Resume
We Have Questions
● What percentage of applications use Indeed
resumes?
● How many searches for “java” in “Austin”?
● How often are resumes edited?
● How long does it take to aggregate jobs?
Complicated Questions
How many applications
… to jobs from CareerBuilder
… by job seekers who searched for “java” in “Austin”
… used an Indeed resume?

Is the percentage different on mobile compared
to web?
How much has this changed in 2011 compared
to 2014?
More Information

Better Decisions
More information
Need to log events
● job searches
● clicks
● applies
What to log
Client information
- unique user identifier, user agent, ip address…

User behavior
- clicks, alert signups…

Performance
- backend request duration, memory usage...

A/B test groups
- control and test groups
Better decisions
Use empirical data to make decisions
Not based on assumptions nor the highest
paid person’s opinion!
Objective
Collect data on user actions and system
performance from many different applications
in multiple data centers
How we build systems
Simple
Fast
Resilient
Scalable
Simple
Easy interface
Reuse familiar technologies
Fast
No impact to runtime performance
Data available soon
Resilient
Does not lose data in spite of system or
network failures
Scalable
Can handle large quantities of data
Requirements
Powerful enough to express diverse data
Requirements
Powerful enough to express diverse data
Store all data forever
Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
Time range based access
Non-Goals
Random access to individual events
Real time access to events
Complex data types
Logrepo
A distributed event logging system

Est. 2006
Logrepo stores log entries
Everything is a string
Key/value pairs
URL-encoded
Organic click log entry
uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0n
k20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http%
3A%2F%2Fwww.indeed.com%2Frc%2Fclk&href=http%
3A%2F%2Fwww.indeed.com%2Fjobs%3Fq%3D%26l%
3DNewburgh%252C%2BNY%26start%
3D20&agent=Mozilla%2F5.0+%28Windows+NT+6.1%
3B+WOW64%3B+rv%3A26.0%29+Gecko%
2F20100101+Firefox%2F26.0&raddr=173.50.255.255
&ckcnt=17&cksz=1033&ctk=18dtbc6960nk20vd&ctkRcv=1
&&
URL-decoded organic click log entry
uid=18dtbolr20nk23qh&
type=orgClk&
v=0&
tk=18dtbnn3p0nk20g9&
jobId=500&
onclick=1&
avgCmpRtg=2.9&
url=http://www.indeed.com/rc/clk&
href=http://www.indeed.com/jobs?q=&l=Newburgh%
2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1;
WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&
...
URL-decoded organic click log entry
uid=18dtbolr20nk23qh&
type=orgClk&
v=0&
tk=18dtbnn3p0nk20g9&
jobId=500&
onclick=1&
avgCmpRtg=2.9&
url=http://www.indeed.com/rc/clk&
href=http://www.indeed.com/jobs?q=&l=Newburgh%
2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1;
WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&
...
Advantages
Human-readable
Advantages
Human-readable
Arbitrary keys
Advantages
Human-readable
Arbitrary keys
Low overhead to add new key/value pairs
Advantages
Human-readable
Arbitrary keys
Low overhead to add new key/value pairs
Self-describing
Advantages
Human-readable
Arbitrary keys
Low overhead to add new key/value pairs
Self-describing
Easy to parse in any language
Required log entry keys
Every log entry has uid and type
Type is an arbitrary string
uid=18dtbolr20nk23qh&type=orgClk&...
UID format
uid=18ducm8u50nk23qh&type=jobsearch&...
UID is always the first key
Unique
16 characters
Base 32 [0-9a-v]
UID breakdown
uid=18ducm8u50nk23qh
Date = 2014-01-10
Time = 09:35:24.357

Server id = 1512
App instance id = 2
UID Version = 0
Random value = 3921
UID generation
Unique IDs are unique
Random value avoids UID collisions
Random value is between 0 and 8191
Up to 8000 events per application instance per
millisecond
UID format benefits
Contains useful metadata
Compact format reduces memory
requirements
Easy to compare or sort events by time
Job seeker events
1. Search for jobs
2. Click on job
3. Apply to job
All events are part of the same flow
Parent-child relationships
between events
Events can reference other events with
&tk=18ducm8u50nk23qh...
Children know their parents
Parents don’t know their children
Extremely powerful model
Parent-child relationships
between events
An organic click points to the search it occurred
on
uid=18dtbnn3p0nk20g9&type=jobsearch&v=0&...
uid=18dtbolr20nk23qh&type=orgClk&v=0
&tk=18dtbnn3p0nk20g9&...
More jobsearch child events
Sponsored job clicks
Javascript errors
Job alert signups
And many more...
Job seeker views a job
job view
18en3o3ov16r25rp

load IndeedApply

user submission

post to employer

uid=18en3o3ov16r25rp&type=viewjob&...
Indeed Apply loads
job view
18en3o3ov16r25rp

load IndeedApply
18en3o3s216ph6d5

user submission

post to employer

uid=18en3o3s216ph6d5&type=loadJs
&vjtk=18en3o3ov16r25rp&...
Prepare job application
job view
18en3o3ov16r25rp

load IndeedApply
18en3o3s216ph6d5

user submission
18en3qe0u16pi5ct

post to employer

uid=18en3qe0u16pi5ct&type=appSubmit
&loadJsTk=18en3o3s216ph6d5&...
Submit job application
job view
18en3o3ov16r25rp

load IndeedApply
18en3o3s216ph6d5

uid=18en3qe2r0nji3h6&type=postApp
&appSubmitTk=18en3qe0u16pi5ct&...

POST /apply HTTPS/1.1
Host: employer.com
{

user submission
18en3qe0u16pi5ct

post to employer
18en3qe2r0nji3h6

"applicant": {
"name": "John Doe",
"email": "jobseeker@gmail.com",
"phone": "555-555-5555",
},
"jobTitle": "Software Engineer"
...
Javascript latency ping
At start of page load, browser executes js to
ping Indeed
Server receives the ping and logs an event
Parent job search and child js
latency ping
uid=18dqpc3lm16pi2an&type=jobsearch&...
uid=18dqpc3s516pi566&type=lat&tk=18dqpc
3lm16pi2an
Subtracting UID timestamps
yields duration
uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an
uid timestamp

Jan 9, 2014 00:00:05.253

tk timestamp

Jan 9, 2014 00:00:05.046

Latency = 1389247205253 - 1389247205046
= 207 ms
Approximates perceived latency to jobseeker
West coast perceived latency in
California vs. Washington
Writing log entries from apps
LogEntry entry =
factory.createLogEntry("search");
entry.setProperty("q", query);
entry.setProperty("acctId", accountId);
entry.setProperty("time", elapsedMillis);
// ...
entry.commit();
Creating a log entry
LogEntry entry =
factory.createLogEntry("search");

Creates a log entry with UID and type set
UID timestamp tied to createLogEntry() call
Populating a log entry
entry.setProperty("q", query);
entry.setProperty("acctId", accountId);
entry.setProperty("time", elapsedMillis);
// ...
Lists
Separate values with commas
String groups = "foo,bar,baz";
logEntry.setProperty("grps", groups);
// uid=...&grps=foo%2Cbar%2Cbaz&...
Lists of Tuples
Encapsulate each tuple in parenthesis
Comma-separate elements within tuple
// Two jobs with (job id, score)
String jobs = "(123,1.0)(400,0.8)";
logEntry.setProperty("jobs", jobs);
// uid=...&jobs=%28123%2C1.0%29%28400%2C0.8%29&...
Committing a log entry
After log entry is fully populated...
entry.commit();
Jason Koppe
System Administrator
I engineer
systems
that help
people get
jobs.
Before logrepo
Before logrepo
log4j - Java logging framework
● Code - what
● Configuration - define what goes to
where
● Appender - where (file, smtp)

http://logging.apache.org/log4j/1.2/
Before logrepo
Reusing log4j for logrepo
Redundancy from the start
Write to local disk (FileAppender)
Write to remote server #1 (? Appender)
Write to remote server #2 (? Appender)
Writing to a remote server
syslog
Protocol for transporting messages across
an IP network
Est. 1980s
http://tools.ietf.org/html/rfc5424
Using log4j with syslog
Out-of-the-box, log4j only supported UDP
syslog
UDP could result in data loss
Avoiding data loss
TCP guarantees data transfer
Use TCP!
Creating a reliable Appender
SyslogTcpAppender
● created by Indeed
● TCP-enabled log4j syslog Appender
● buffers messages before transport
Resilient for short network and syslog
server downtimes
Choosing a syslog daemon
syslog-ng
syslog daemon which supports TCP
Est. 1998
http://www.balabit.com/network-security/syslog-ng
Redundancy with log4j
Write to local disk (FileAppender)
Write to remote server #1 (SyslogTcpAppender)
Write to remote server #2 (SyslogTcpAppender)
Redundancy over TCP
Each syslog-ng server
receives unsorted log entries
immediately flushes entries to files on disk
called raw logs
Quick redundancy over TCP
Optimized for redundancy
raw logs are probably out-of-order

each app writes to syslog independently
Optimize for read access patterns
LogRepositoryBuilder (“Builder”)
● sort
● deduplicate
● compress
Builder architecture
Builder architecture
Builder architecture
Builder architecture
Builder creates segment files
uid=15mt000000k1&type=orgClk&v=1&k=4...
uid=15mt000010k7&type=orgClk&v=1&k=3...
uid=15mt000020k8&type=orgClk&v=1&k=2...
uid=15mt000030ss&type=orgClk&v=1&k=9...
Repeated strings compress well

uid=15mt000000k1&type=orgClk&v=1&k=4...
uid=15mt000010k7&type=orgClk&v=1&k=3...
uid=15mt000020k8&type=orgClk&v=1&k=2...
uid=15mt000030ss&type=orgClk&v=1&k=9...

compresses by 85%
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz

logentry
type
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
4-char UID prefix,
base 32
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
4-char UID prefix,
base 32
~9.3 hour time period
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix,
base 32
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix,
base 32

~17 minute time period
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
unique
number
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
unique
number

Supports more than 1 segment file per
type per 5-char UID prefix
Multiple segment files
Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix
Multiple segment files
Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix
Multiple segment files
Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix
Builder creates the archive
Redundancy
Redundancy
Ensure archive consistency
●
●

Delayed Builder on second server
Add new segment files for log entries
missed by first Builder
●

Causes multiple segment files for a 5-char
UID prefix
Providing access to logrepo
LogRepositoryReader (“Reader”)
● simple request protocol
● reads from (multiple) segment files
● provides sorted stream of entries to TCP
client as quickly as possible
Reader request protocol
1. Start time
2. End time
3. Logrepo type
Reader request using netcat
start time (ms since 1970-01-01, the start of Unix time)

$ echo 1295905740000 1295913600000 orgClk
Reader request using netcat
end time (ms since 1970-01-01)

$ echo 1295905740000 1295913600000 orgClk
Reader request using netcat
logrepo type

$ echo 1295905740000 1295913600000 orgClk
Reader request using netcat
send echo across a TCP session
$ echo 1295905740000 1295913600000 orgClk 
| nc 192.168.0.1 9999
Reader request using netcat
UID-sorted results
$ echo 1295905740000 1295913600000 orgClk 
| nc 192.168.0.1 9999
uid=15mt00l710k3262q&type=orgClk&v=0&...
uid=15mt00l780k137d9&type=orgClk&v=0&...
...
uid=15mt7ggvj142h06k&type=orgClk&v=0&...
Reading entries from archive
1295905740000 1295913600000 orgClk

1. Isolate to the type directory
Reading entries from archive
1295905740000 1295913600000 orgClk

2. Convert request timestamps to UID prefix
uidPrefixFromTime(1295905740000) = 15mt0
uidPrefixFromTime(1295913600000) = 15mt7
Reading entries from archive
1295905740000 1295913600000 orgClk

15mt0

3. Find segments matching first UID prefix
ls orgClk/15mt/0*
orgClk/15mt/0.log3094.seg.gz
orgClk/15mt/0.log4181.seg.gz
Reading entries from archive
1295905740000 1295913600000 orgClk

4. Read sorted segments simultaneously,
merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz:
uid=15mt000080g1i0j5&type=orgClk&...
uid=15mt00l780k137d9&type=orgClk&...
/orgClk/15mt/0.log4181.seg.gz:
uid=15mt00l710k3262q&type=orgClk&...
uid=15mt00l790k1i2rs&type=orgClk&...
Reading entries from archive
1295905740000 1295913600000 orgClk

4. Read sorted segments simultaneously,
merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz:
1 uid=15mt000080g1i0j5&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
/orgClk/15mt/0.log4181.seg.gz:
2 uid=15mt00l710k3262q&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
Reading entries from archive
1295905740000 1295913600000 orgClk

4. Read sorted segments simultaneously,
merge into a single sorted stream

1 uid=15mt000080g1i0j5&type=orgClk&...
2 uid=15mt00l710k3262q&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
Reading entries from archive
1295905740000 1295913600000 orgClk

5. Only return log entries between timestamps
1 uid=15mt000080g1i0j5&type=orgClk&...
2 uid=15mt00l710k3262q&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
Reading entries from archive
1295905740000 1295913600000 orgClk

15mt0

15mt7
15mt1
15mt2
15mt3
15mt4
15mt5
15mt6

6. Read segments for each UID prefix, one
prefix at a time
Reading entries from archive
1295905740000 1295913600000 orgClk

7. Stop reading files when entry crosses
request boundary
The first years (2007 & 2008)
● Single datacenter
● App servers
● 2 logrepo servers
● syslog-ng
● Builder
● Reader
Growth

job seekers
Growth

products

job seekers
Growth

products
datacenters
job seekers
Growth

log entries
Multi-datacenter rationale

Latency
Redundancy
Multi-datacenter rationale

Job seekers
Logrepo in multiple datacenters
● Single datacenter
● Consumers
● Reader
● Every datacenter
● Applications producing logentries
● 2 syslog servers
● Builders (minimize Internet traffic)
Single datacenter archival
/dc1/orgClk/15mt/0.log4181.seg.gz
random number
25-bit timestamp prefix, base 32
~17-minute time period
event type
(orgClick means organic search result click)
Multiple datacenter archival
/dc1/orgClk/15mt/0.log4181.seg.gz
random number
25-bit timestamp prefix, base 32
~17-minute time period
event type
(orgClick means organic search result click)
datacenter
Datacenter dirs avoid collisions
~$ ls */orgClk/15mt/0*
dc1/orgClk/15mt/0.log1481.seg.gz
dc3/orgClk/15mt/0.log1481.seg.gz

Different datacenters
Datacenter dirs avoid collisions
~$ ls */orgClk/15mt/0*
dc1/orgClk/15mt/0.log1481.seg.gz
dc3/orgClk/15mt/0.log1481.seg.gz

Same segment filename
Independent Builders
UID breakdown
uid=18ducm8u50nk23qh
Date = 2014-01-10
Time = 09:35:24.357

Server id = 1512
App instance id = 2
UID Version = 0
Random value = 3921
UID breakdown
uid=18ducm8u50nk23qh
Date = 2014-01-10
Time = 09:35:24.357

Server id = 1512
App instance id = 2
UID Version = 0
Random value = 3921
Using server ID for uniqueness
Each datacenter gets 256 server IDs
1.
2.
3.
4.

DC #1 uses 0 - 255
DC #2 uses 256 - 511
DC #3 uses 512 - 767
...
The next years (2009 - 2011)
● Multiple datacenters
● 2 logrepo servers
● syslog-ng
● Builder
● Consumer datacenter
● Reader
● Consumers
More logentries

More consumers
Diverse requests
Single server disk bottleneck
Scaling logrepo reads
Bottleneck: single active Reader server
Goal: spread logrepo accesses across a
cluster of servers
Read logrepo from HDFS

Hadoop Distributed File System
(HDFS)
“a distributed file-system that stores data on
commodity machines, providing very high
aggregate bandwidth across the cluster.”
http://hadoop.apache.org/docs/stable1/hdfs_design.html
Using HDFS for logrepo access
Using HDFS for logrepo access
Using HDFS for logrepo access
Resilient logrepo in HDFS
Store each logentry on 3 servers
Push to HDFS quickly
Mirror every segment file into HDFS
Push to HDFS quickly
/dc1/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix, base 32
~17-minute time period

500,000+ files per day
HDFS optimized for fewer
files
Reduce the number of logrepo files in HDFS
keeps us efficient
HDFS optimized for fewer
files
Reduce the number of logrepo files in HDFS
keeps us efficient

HDFSArchiver
Archive yesterday in HDFS
/dc1/orgClk/15mt/0.log4181.seg.gz

type

20-bit timestamp prefix
~9.3 hour period

2,500 files per day
Scaling logrepo in HDFS

500,000+ files per day

2,500 files per day
Logrepo
A distributed event logging system
Created @IndeedEng
● Application

Open source
● log4j
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender

Open source
● log4j
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender

Open source
● log4j
● syslog-ng
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder

Open source
● log4j
● syslog-ng
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder

Open source
● log4j
● syslog-ng
● gzip
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder
● Reader

Open source
● log4j
● syslog-ng
● gzip
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder
● Reader

Open source
● log4j
● syslog-ng
● gzip
● rsync+ssh
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder
● Reader

Open source
● log4j
● syslog-ng
● gzip
● rsync+ssh
● Hadoop
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder
● Reader
● HDFSPusher

Open source
● log4j
● syslog-ng
● gzip
● rsync+ssh
● Hadoop
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder
● Reader
● HDFSPusher
● HDFSReader

Open source
● log4j
● syslog-ng
● gzip
● rsync+ssh
● Hadoop
Logrepo
A distributed event logging system
Created @IndeedEng
● Application
● SyslogTcpAppender
● Builder
● Reader
● HDFSPusher
● HDFSReader
● HDFSArchiver

Open source
● log4j
● syslog-ng
● gzip
● rsync+ssh
● Hadoop
All time logrepo = 150 TB compressed
jobsearch event set
abredistime
acmetime
addltime
adsc
adsdelay
adsi
badsc
badsi
boostojc
boostoji
bsjc
bsjcwia
bsji
bsjindapplies
bsjindappviews
bsjrev
bsjwia
ckcnt
cksz
counts
ctkage
ctkagedays
dayofweek
dcpingtime
domTotalTime
ds-mpo

dsmiss
dstime
featemp
fj
freekwac
freekwarev
freesjc
freesjrev
frmtime
galatdelay
iplat
iplong
jslatdelay
jsvdelay
kwac
kwacdelay
kwai
kwarev
kwcnt
lacinsize
lacsgsize
lmstime
mpotime
mprtime
navTotTime
ndxtime

ojc
ojclong
ojcshort
ojcwia
oji
ojindapplies
ojindappviews
ojwia
oocsc
page
prcvdlatency
primfollowcnt
prvwoji
prvwojlat
prvwojopentime
prvwojreq
radsc
radsi
recidlookupbudget
rectime
redirCount
redirTime
relfollowcnt
respTime
returnvisit
rojc

roji
rqcnt
rqlcnt
rqqcnt
rrsjc
rrsji
rrsjrev
rsavail
rsjc
rsji
rsused
rsviable
serpsize
sjc
sjcdelay
sjclong
sjcnt
sjcshort
sjcwia
sji
sjindapplies
sjindappviews
sjrev
sjwia
sllat
sllong

sqc
sqi
sugtime
svj
svjnostar
svjstar
tadsc
tadsi
time
timeofday
totcnt
totfollowcnt
totrev
tottime
tsjc
tsjcwia
tsji
tsjindapplies
tsjindappviews
tsjrev
tsjwia
unqcnt
vp
wacinsize
wacsgsize
acmepage
acmereviewmod
acmeservice
acmesession
adclick
adcrequest
adcrev
adschannel
adsclick
adsenseclick
adve
advt
agghttp
aggjira
aggjob
aggjob_waldorf
aggsherlock
aggsourcehealth
agstiming
api
apijsv
apisearch
archiveindex
archiveindex_shingled_test
bin
carclicks
click
clickanalytics
cobrand
dctmismatch
draw
dupepairs
dupepairs_mini
dupepairs_old
dupepairsall
dupepairsall_mini
ejchecker
emilyops

feedbridge
globalnav
googlebot_organic
homepage
impression
indeedapply
jhst
jobalert
jobalertorganic
jobalertsearch
jobalertsponsored
jobexpiration
jobexpiration2
jobexpiration3
jobprocessed
jobqueueblock
jobsearch
jssquery
keywordAd
locsvc
lucyindexermain
mechanicalturk
mindyops
mobhomepage
mobil
mobile
mobileorganic
mobilesponsored
mobrecjobs
mobsearch
mobviewjob
myindeed
myindfunnel
myindpage
myindrezcreate
myindsession
old
opsesjasx

organic
orgmodel
orgmodelsubset
orgmodelsubset90
passportaccount
passportpage
passportsignin
ramsaccess
recjobs
recommendservice
resumedata
resumesearch
rexcontacts
rexfunnel
reximpression
rexsearch
rezSrchSearch
rezalert
rezalertfunnel
rezfunnel
rezjserr
rezsrchrequest
rezview
searchablejobs
seo
session
sjmodel
sponsored
sysadappinfo
sysadapptiming
testndx
testndx1
testndx2
tmp
usrsvccache
usrsvcrequest
viewjob
webusersignin
Every day at Indeed
● Create 5 billion log entries
● App spends 0.03 ms to create each log entry
● Add 500 GB to the archive
● Add 1.5 TB to HDFS
● Consumers read from HDFS at 18.5 GB/s
● 100s of consumers request 1000 different
logrepo types
Four types of consumers
Ad-hoc command line
Standard Java programs
Hadoop map/reduce
Real-time monitoring
Command line access
$ echo 1388556000000 1388642400000 jobsearch 
| nc logrepo 9999
uid=18d6666o916r15g3&type=jobsearch&q=VP+IT
uid=18d6666ob0mp27aa&type=jobsearch&q=Lab+Tech
uid=18d6666ob0nl15ce&type=jobsearch&q=daycare
uid=18d6666og0nk24rb&type=jobsearch&q=Chef+Upscale
...
Slowest searches from log entries
Reuses standard unix tools and patterns
$ echo 1388556000000 1388642400000 jobsearch 
| nc logrepo 9999
| egrep -o '&searchTime=[^&]+' 
| egrep -o '[0-9]+' 
| sort -r -n 
| head
Programmatic access is trivial
We have clients for
● java
● python
● php
● pig
A typical logrepo consumer
(single machine)
Reads one primary log event type
Reads a dozen child events per primary
Total size of each event set = 10KB
A typical logrepo consumer
(single machine)
Millions of events read per run
Thousands of consumers run each day
Tens of terabytes processed each day
Efficient Parsing
Important for single machine consumers
Log entry parsing too slow
Fast
Minimize memory usage
URL String Parsing
(now available on github)
4x faster than String.split(...), generates
50% less garbage
Parses 1 million log entries of size 0.5K
each in 3 seconds
https://github.com/indeedeng
http://go.indeed.com/urlparsing
Hadoop clients
Reliable, scalable, distributed computing
Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Read log entries directly from HDFS
Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Read log entries directly from HDFS
Divide and conquer to scale
Monitoring
Want to monitor
● Business metrics
● Operational metrics
“Available soon” isn’t good enough
Datadog
Third party monitoring service
Stream metrics to Datadog HQ
Real-time dashboards
Datadog
miniEPL
'jobsearch.organic_clk': "SELECT COUNT(*),
'clicks' AS unit FROM orgClk",
'jobsearch.totTime': "SELECT int(totTime), 'ms'
AS unit FROM jobsearch(totTime IS NOT NULL)",
'mobile.mobsearch.oji': "SELECT tupleCount
(orgRes), 'results' AS unit FROM mobsearch",
Getting logs into Datadog
Data redundancy
Replaying events
Click charging
Replaying events
1. Job alert email sign up broke for logged in users
Replaying events
1. Job alert email sign up broke for logged in users
2. Got alert parameters + jobsearch uid from access logs
Replaying events
1. Job alert email sign up broke for logged in users
2. Got alert parameters + jobsearch uid from access logs
3. Got account id from jobsearch log entries
Replaying events
1. Job alert email sign up broke for logged in users
2. Got alert parameters + jobsearch uid from access logs
3. Got account id from jobsearch log entries
4. Recreated job alert sign ups
Click charging
1. Store sponsored click data in database
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
5. Profit!
What does logrepo enable?
Answering business and operational
questions
Data-driven decisions
Average cover letter length inside
US vs. outside US?
Mobile searches per hour in
JP vs. UK?
Resume creation by country?
Email alert opens by email domain?
Percent of app downloads from
iOS, Android, Windows?
How quickly does a datacenter take
on traffic after a failover?
Q&A
https://github.com/indeedeng
http://go.indeed.com/urlparsing
Next @IndeedEng Talk
Big Value from Big Data:
Building Decision Trees at Scale
Andrew Hudson, Indeed CTO
February 26, 2014
http://engineering.indeed.com/talks

Contenu connexe

En vedette

En vedette (14)

@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
 
[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving System
[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving System[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving System
[@IndeedEng] From 1 To 1 Billion: Evolution of Indeed's Document Serving System
 
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
[@IndeedEng] Boxcar: A self-balancing distributed services protocol [@IndeedEng] Boxcar: A self-balancing distributed services protocol
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
 
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Making
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision MakingData-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Making
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Making
 
Part time job search presentation
Part time job search presentationPart time job search presentation
Part time job search presentation
 
Timesjobs.com Services
Timesjobs.com ServicesTimesjobs.com Services
Timesjobs.com Services
 
Part-time job
Part-time jobPart-time job
Part-time job
 
で、次は何がくるの? - 第2回 TIS Matsuri
で、次は何がくるの? - 第2回 TIS Matsuriで、次は何がくるの? - 第2回 TIS Matsuri
で、次は何がくるの? - 第2回 TIS Matsuri
 
Bridging Your 2016 Enrollment Gaps
Bridging Your 2016 Enrollment GapsBridging Your 2016 Enrollment Gaps
Bridging Your 2016 Enrollment Gaps
 
Naukri.com
Naukri.comNaukri.com
Naukri.com
 
Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...
Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...
Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...
 
キメるClojure
キメるClojureキメるClojure
キメるClojure
 
Trialforce
Trialforce Trialforce
Trialforce
 
Naukri.com
Naukri.comNaukri.com
Naukri.com
 

Similaire à [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Evolving your Data Access with MongoDB Stitch
Evolving your Data Access with MongoDB StitchEvolving your Data Access with MongoDB Stitch
Evolving your Data Access with MongoDB Stitch
MongoDB
 
Splunk .conf2011: Search Language: Intermediate
Splunk .conf2011: Search Language: IntermediateSplunk .conf2011: Search Language: Intermediate
Splunk .conf2011: Search Language: Intermediate
Erin Sweeney
 

Similaire à [@IndeedEng] Logrepo: Enabling Data-Driven Decisions (20)

Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at Grab
 
Budapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.lyBudapest Spark Meetup - Apache Spark @enbrite.ly
Budapest Spark Meetup - Apache Spark @enbrite.ly
 
Evolving your Data Access with MongoDB Stitch
Evolving your Data Access with MongoDB StitchEvolving your Data Access with MongoDB Stitch
Evolving your Data Access with MongoDB Stitch
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Un-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Un-broken Logging - TechnologyUG - Leeds - Matthew SkeltonUn-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Un-broken Logging - TechnologyUG - Leeds - Matthew Skelton
 
Swift meetup22june2015
Swift meetup22june2015Swift meetup22june2015
Swift meetup22june2015
 
Managing Content Chaos
Managing Content ChaosManaging Content Chaos
Managing Content Chaos
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
 
Practical operability techniques for teams - Matthew Skelton - Agile in the C...
Practical operability techniques for teams - Matthew Skelton - Agile in the C...Practical operability techniques for teams - Matthew Skelton - Agile in the C...
Practical operability techniques for teams - Matthew Skelton - Agile in the C...
 
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
Designing The Right Schema To Power Heap (PGConf Silicon Valley 2016)
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Un-broken Logging - Operability.io 2015 - Matthew Skelton
Un-broken Logging - Operability.io 2015 - Matthew SkeltonUn-broken Logging - Operability.io 2015 - Matthew Skelton
Un-broken Logging - Operability.io 2015 - Matthew Skelton
 
Un-broken logging - the foundation of software operability - Operability.io -...
Un-broken logging - the foundation of software operability - Operability.io -...Un-broken logging - the foundation of software operability - Operability.io -...
Un-broken logging - the foundation of software operability - Operability.io -...
 
All about engagement with Universal Analytics @ Google Developer Group NYC Ma...
All about engagement with Universal Analytics @ Google Developer Group NYC Ma...All about engagement with Universal Analytics @ Google Developer Group NYC Ma...
All about engagement with Universal Analytics @ Google Developer Group NYC Ma...
 
MongoDB.local Atlanta: Introduction to Serverless MongoDB
MongoDB.local Atlanta: Introduction to Serverless MongoDBMongoDB.local Atlanta: Introduction to Serverless MongoDB
MongoDB.local Atlanta: Introduction to Serverless MongoDB
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
The future Proof Financial: Fintech
The future Proof Financial: FintechThe future Proof Financial: Fintech
The future Proof Financial: Fintech
 
Splunk .conf2011: Search Language: Intermediate
Splunk .conf2011: Search Language: IntermediateSplunk .conf2011: Search Language: Intermediate
Splunk .conf2011: Search Language: Intermediate
 
Startup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platformStartup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platform
 

Plus de indeedeng

Plus de indeedeng (10)

Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Weapons of Math Instruction: Evolving from Data0-Driven to Science-DrivenWeapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
Weapons of Math Instruction: Evolving from Data0-Driven to Science-Driven
 
Alchemy and Science: Choosing Metrics That Work
Alchemy and Science: Choosing Metrics That WorkAlchemy and Science: Choosing Metrics That Work
Alchemy and Science: Choosing Metrics That Work
 
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...
 
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...
Indeed Engineering and The Lead Developer Present: Tech Leadership and Manage...
 
Improving the development process with metrics driven insights presentation
Improving the development process with metrics driven insights presentationImproving the development process with metrics driven insights presentation
Improving the development process with metrics driven insights presentation
 
Indeed My Jobs: A case study in ReactJS and Redux (Meetup talk March 2016)
Indeed My Jobs: A case study in ReactJS and Redux (Meetup talk March 2016)Indeed My Jobs: A case study in ReactJS and Redux (Meetup talk March 2016)
Indeed My Jobs: A case study in ReactJS and Redux (Meetup talk March 2016)
 
Data Day Texas - Recommendations
Data Day Texas - RecommendationsData Day Texas - Recommendations
Data Day Texas - Recommendations
 
Vectorized VByte Decoding
Vectorized VByte DecodingVectorized VByte Decoding
Vectorized VByte Decoding
 
[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor
[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor
[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor
 
[@IndeedEng] Redundant Array of Inexpensive Datacenters
[@IndeedEng] Redundant Array of Inexpensive Datacenters[@IndeedEng] Redundant Array of Inexpensive Datacenters
[@IndeedEng] Redundant Array of Inexpensive Datacenters
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

[@IndeedEng] Logrepo: Enabling Data-Driven Decisions