Video available at: http://youtu.be/y0WC1cxLsfo
At Indeed our applications generate billions of log events each month across our seven data centers worldwide. These events store user and test data that form the foundation for decision making at Indeed. We built a distributed event logging system, called Logrepo, to record, aggregate, and access these logs. In this talk, we'll examine the architecture of Logrepo and how it evolved to scale.
Jeff Chien joined Indeed as a software engineer in 2008. He's worked on jobsearch frontend and backend, advertiser, company data, and apply teams and enjoys building scalable applications.
Jason Koppe is a Systems Administrator who has been with Indeed since late 2008. He's worked on infrastructure automation, monitoring, application resiliency, incident response and capacity planning.
4. Scale
More job searches
worldwide than any other
employment website.
●
●
●
●
●
Over 100 million unique users
Over 3 billion searches per month
Over 24 million jobs
Over 50 countries
Over 28 languages
14. We Have Questions
● What percentage of applications use Indeed
resumes?
● How many searches for “java” in “Austin”?
● How often are resumes edited?
● How long does it take to aggregate jobs?
15. Complicated Questions
How many applications
… to jobs from CareerBuilder
… by job seekers who searched for “java” in “Austin”
… used an Indeed resume?
Is the percentage different on mobile compared
to web?
How much has this changed in 2011 compared
to 2014?
18. What to log
Client information
- unique user identifier, user agent, ip address…
User behavior
- clicks, alert signups…
Performance
- backend request duration, memory usage...
A/B test groups
- control and test groups
29. Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
30. Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
31. Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
Time range based access
46. UID generation
Unique IDs are unique
Random value avoids UID collisions
Random value is between 0 and 8191
Up to 8000 events per application instance per
millisecond
47. UID format benefits
Contains useful metadata
Compact format reduces memory
requirements
Easy to compare or sort events by time
48. Job seeker events
1. Search for jobs
2. Click on job
3. Apply to job
All events are part of the same flow
50. Parent-child relationships
between events
An organic click points to the search it occurred
on
uid=18dtbnn3p0nk20g9&type=jobsearch&v=0&...
uid=18dtbolr20nk23qh&type=orgClk&v=0
&tk=18dtbnn3p0nk20g9&...
51. More jobsearch child events
Sponsored job clicks
Javascript errors
Job alert signups
And many more...
52. Job seeker views a job
job view
18en3o3ov16r25rp
load IndeedApply
user submission
post to employer
uid=18en3o3ov16r25rp&type=viewjob&...
53. Indeed Apply loads
job view
18en3o3ov16r25rp
load IndeedApply
18en3o3s216ph6d5
user submission
post to employer
uid=18en3o3s216ph6d5&type=loadJs
&vjtk=18en3o3ov16r25rp&...
54. Prepare job application
job view
18en3o3ov16r25rp
load IndeedApply
18en3o3s216ph6d5
user submission
18en3qe0u16pi5ct
post to employer
uid=18en3qe0u16pi5ct&type=appSubmit
&loadJsTk=18en3o3s216ph6d5&...
55. Submit job application
job view
18en3o3ov16r25rp
load IndeedApply
18en3o3s216ph6d5
uid=18en3qe2r0nji3h6&type=postApp
&appSubmitTk=18en3qe0u16pi5ct&...
POST /apply HTTPS/1.1
Host: employer.com
{
user submission
18en3qe0u16pi5ct
post to employer
18en3qe2r0nji3h6
"applicant": {
"name": "John Doe",
"email": "jobseeker@gmail.com",
"phone": "555-555-5555",
},
"jobTitle": "Software Engineer"
...
56. Javascript latency ping
At start of page load, browser executes js to
ping Indeed
Server receives the ping and logs an event
61. Creating a log entry
LogEntry entry =
factory.createLogEntry("search");
Creates a log entry with UID and type set
UID timestamp tied to createLogEntry() call
63. Lists
Separate values with commas
String groups = "foo,bar,baz";
logEntry.setProperty("grps", groups);
// uid=...&grps=foo%2Cbar%2Cbaz&...
64. Lists of Tuples
Encapsulate each tuple in parenthesis
Comma-separate elements within tuple
// Two jobs with (job id, score)
String jobs = "(123,1.0)(400,0.8)";
logEntry.setProperty("jobs", jobs);
// uid=...&jobs=%28123%2C1.0%29%28400%2C0.8%29&...
65. Committing a log entry
After log entry is fully populated...
entry.commit();
70. log4j - Java logging framework
● Code - what
● Configuration - define what goes to
where
● Appender - where (file, smtp)
http://logging.apache.org/log4j/1.2/
77. Creating a reliable Appender
SyslogTcpAppender
● created by Indeed
● TCP-enabled log4j syslog Appender
● buffers messages before transport
Resilient for short network and syslog
server downtimes
78. Choosing a syslog daemon
syslog-ng
syslog daemon which supports TCP
Est. 1998
http://www.balabit.com/network-security/syslog-ng
79. Redundancy with log4j
Write to local disk (FileAppender)
Write to remote server #1 (SyslogTcpAppender)
Write to remote server #2 (SyslogTcpAppender)
98. Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
99. Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
100. Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
104. Ensure archive consistency
●
●
Delayed Builder on second server
Add new segment files for log entries
missed by first Builder
●
Causes multiple segment files for a 5-char
UID prefix
105. Providing access to logrepo
LogRepositoryReader (“Reader”)
● simple request protocol
● reads from (multiple) segment files
● provides sorted stream of entries to TCP
client as quickly as possible
114. Reading entries from archive
1295905740000 1295913600000 orgClk
15mt0
3. Find segments matching first UID prefix
ls orgClk/15mt/0*
orgClk/15mt/0.log3094.seg.gz
orgClk/15mt/0.log4181.seg.gz
115. Reading entries from archive
1295905740000 1295913600000 orgClk
4. Read sorted segments simultaneously,
merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz:
uid=15mt000080g1i0j5&type=orgClk&...
uid=15mt00l780k137d9&type=orgClk&...
/orgClk/15mt/0.log4181.seg.gz:
uid=15mt00l710k3262q&type=orgClk&...
uid=15mt00l790k1i2rs&type=orgClk&...
116. Reading entries from archive
1295905740000 1295913600000 orgClk
4. Read sorted segments simultaneously,
merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz:
1 uid=15mt000080g1i0j5&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
/orgClk/15mt/0.log4181.seg.gz:
2 uid=15mt00l710k3262q&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
117. Reading entries from archive
1295905740000 1295913600000 orgClk
4. Read sorted segments simultaneously,
merge into a single sorted stream
1 uid=15mt000080g1i0j5&type=orgClk&...
2 uid=15mt00l710k3262q&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
118. Reading entries from archive
1295905740000 1295913600000 orgClk
5. Only return log entries between timestamps
1 uid=15mt000080g1i0j5&type=orgClk&...
2 uid=15mt00l710k3262q&type=orgClk&...
3 uid=15mt00l780k137d9&type=orgClk&...
4 uid=15mt00l790k1i2rs&type=orgClk&...
119. Reading entries from archive
1295905740000 1295913600000 orgClk
15mt0
15mt7
15mt1
15mt2
15mt3
15mt4
15mt5
15mt6
6. Read segments for each UID prefix, one
prefix at a time
120. Reading entries from archive
1295905740000 1295913600000 orgClk
7. Stop reading files when entry crosses
request boundary
121.
122.
123. The first years (2007 & 2008)
● Single datacenter
● App servers
● 2 logrepo servers
● syslog-ng
● Builder
● Reader
144. Read logrepo from HDFS
Hadoop Distributed File System
(HDFS)
“a distributed file-system that stores data on
commodity machines, providing very high
aggregate bandwidth across the cluster.”
http://hadoop.apache.org/docs/stable1/hdfs_design.html
169. Every day at Indeed
● Create 5 billion log entries
● App spends 0.03 ms to create each log entry
● Add 500 GB to the archive
● Add 1.5 TB to HDFS
● Consumers read from HDFS at 18.5 GB/s
● 100s of consumers request 1000 different
logrepo types
170. Four types of consumers
Ad-hoc command line
Standard Java programs
Hadoop map/reduce
Real-time monitoring
174. A typical logrepo consumer
(single machine)
Reads one primary log event type
Reads a dozen child events per primary
Total size of each event set = 10KB
175. A typical logrepo consumer
(single machine)
Millions of events read per run
Thousands of consumers run each day
Tens of terabytes processed each day
177. URL String Parsing
(now available on github)
4x faster than String.split(...), generates
50% less garbage
Parses 1 million log entries of size 0.5K
each in 3 seconds
https://github.com/indeedeng
http://go.indeed.com/urlparsing
181. Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Read log entries directly from HDFS
Divide and conquer to scale
185. miniEPL
'jobsearch.organic_clk': "SELECT COUNT(*),
'clicks' AS unit FROM orgClk",
'jobsearch.totTime': "SELECT int(totTime), 'ms'
AS unit FROM jobsearch(totTime IS NOT NULL)",
'mobile.mobsearch.oji': "SELECT tupleCount
(orgRes), 'results' AS unit FROM mobsearch",
193. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
194. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
195. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
196. Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
5. Profit!
197. What does logrepo enable?
Answering business and operational
questions
Data-driven decisions
205. Next @IndeedEng Talk
Big Value from Big Data:
Building Decision Trees at Scale
Andrew Hudson, Indeed CTO
February 26, 2014
http://engineering.indeed.com/talks