Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
go.indeed.com/IndeedEngTalks
Large Scale
Interactive Analytics
with Imhotep
Tom Bergman
Product Manager
Zak Cocos
Manager
Marketing Science
We help
people
get jobs.
What is Imhotep?
Imhotep is a highly scalable analytics
architecture for querying faceted datasets
Open sourcing Imhotep
Imhotep will be an OPEN SOURCE highly
scalable analytics architecture for querying
faceted datasets
People
Tools
System
Data
People
Tools
Data
System
People
Data
Tools
System
People
Data
Tools
System
A Brief History
of Analytics
@Indeed
What's best for the
job seeker?
Test & Measure
EVERYTHING
Query
Query Location
Query Location
Impression
Title: Front End Software Engineer
Position: 1
Clicked: 0
Country: US
Query: indeed software engineer
Location: austin
Tim...
Analytics on Raw Logs
Ramses
● Search logs
● Extract metrics from matches
● Graph aggregated metrics
Ramses
● Search logs
● Extract metrics from matches
● Graph aggregated metrics
Input -> Query and Metric
Output -> Aggregated met...
How many organic clicks did we have in
Australia?
QUERY
country:au
METRIC
organic_clicks
How many organic clicks did we have in
Australia?
How many organic clicks did we have in
Australia?
Does test group A or B have more revenue?
QUERY
testgroup:A, testgroup:B
METRIC
revenue
Does test group A or B have more revenue?
Does test group A or B have more revenue?
How has traffic from Yahoo! changed over
time in Great Britain, Germany, and Japan?
QUERY
from:yahoo AND country:(gb, de, jp)
METRIC
visits
How has traffic from Yahoo! changed over
time in Great Britain, Ge...
How has traffic from Yahoo! changed over
time in Great Britain, Germany, and Japan?
● How many unique queries in the US?
● What are the top 50 queries in the US?
● How many clicks did each of those queries ...
Imhotep
Began as a distributed iteration and group-by
engine for building click prediction models.
Imhotep Origins
We use an iterative algorithm to build decision
trees level-by-level.
Decision Tree Builder
Began as a distributed iteration and group-by
engine for building click prediction models.
Leveraged ability to do massive...
How many Android App users with accounts
older than 30 days saved at least 1 job in the
past week?
What titles have the highest click-through rate
for the query “Architecture” in the US?
What about the lowest click-throug...
For job seekers who click on Google jobs in
Ireland, what other company’s jobs do they
click on?
Zak Cocos
Manager
Marketing Science
I also
help
people
get jobs.
Marketing Sciences
Research, analysis, and automation team
supporting marketing initiatives
Imhotep
Imhotep is a highly scalable, [soon to be] open
source, analytics architecture for querying
faceted datasets
Imhotep@Indeed
Ad hoc exploration
Imhotep@Indeed
Ad hoc exploration
Specific analysis
Imhotep@Indeed
Ad hoc exploration
Specific analysis
Extensible infrastructure
Ad hoc exploration
Public Crunchbase Dataset
Source: CrunchBase
CrunchBase 2013 Snapshot © 2013
Ad hoc exploration
Public Crunchbase Dataset
Document
Source: CrunchBase
CrunchBase 2013 Snapshot © 2013
Ad hoc exploration
Public Crunchbase Dataset
Fields
Source: CrunchBase
CrunchBase 2013 Snapshot © 2013
Ad hoc exploration
Public Crunchbase Dataset
Metric
Source: CrunchBase
CrunchBase 2013 Snapshot © 2013
Interactive tool for exploring Imhotep data
Imhotep Data Explorer
Interactive tool for exploring Imhotep data
Also: a badass hyperlinked pivot table
Imhotep Data Explorer
Imhotep is Large Scale
Total size of all indexes: 125TB
Jobsearch index (largest): 30TB
● Over 48 billion documents
Query
Query Location
Query Location
Organic Impression
Organic Impression
A job that was displayed as the result of a
search
Title
Company
Information
Description
Job Age
abredistime
acmetime
addltime
adsc
adsdelay
adsi
badsc
badsi
boostojc
boostoji
bsjc
bsjcwia
bsji
bsjindapplies
bsjindappvi...
Organic Impression Document
Title: Front End Software Engineer
Position: 1
Clicked: 0
Country: US
Query: indeed software e...
Organic Impression Index
Title: Front End Software Engineer
Position: 1
Clicked: 0
Country: US
Query: indeed software engi...
Imhotep Data Explorer can’t...
Combine results from multiple datasets
Combine results from multiple datasets
Be easily automated
Imhotep Data Explorer can’t...
Imhotep Query Language
(IQL)
IQL - Imhotep Query Language
Can combine results from multiple datasets
Allows for automation of data tools
IQL queries - requirements
Index
Date range
Metrics
IQL queries - optional
Index
Date range
Metrics
Filters
Group by
IQL - Metrics
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
Metr...
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Indexes
Index
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Date Range
D...
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Filters
Filt...
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Filters
Grou...
IQL Question
Do companies that have raised more than $10
million in the Austin get more clicks on average
than those raise...
Methodology
1) organic index: select companies in the US
which received organic clicks
Methodology
1) organic index: select companies in the US
which received organic clicks
2) crunchbase index: select compani...
Methodology
1) organic index: select companies in the US
which received organic clicks
2) crunchbase index: select compani...
Tom Bergman
Product Manager
I still
help
people
get jobs.
Large Scale Interactive Analytics Platform
● 123 Unique Indexes
● Largest Index 30TB
● Total size ~125TB
Large Scale Interactive Analytics Platform
IQL -> Largely Programmatic access
● approx 76k queries/day
● Avg time to execu...
Large Scale Interactive Analytics Platform
Users
● 198 unique users in past month
● 25,622 unique queries in past month
● ...
Large Scale Interactive Analytics Platform
40+ internal clients
● 6 Analytics Webapps
● 5 dashboards
● 10 programming/scri...
Large Scale Interactive Analytics Platform
One Tool-set for all data
● Website usage
● Operational Monitoring
● Financial ...
Solving a real problem
Providing the Best Results
Show the jobs that users are most
interesting to our users
Providing the Best Results
Clicks are a very good indicator of
interest
Providing the Best Results
Clicks are a very good indicator of
interest
More clicks -> More Relevant
Less clicks -> Less R...
Architecture
Very hard query to serve correctly
Architecture
Very hard query to serve correctly
Architecture terminology has been co-
opted by technology
Terminology Common to both
Software and Architecture
Blueprint
Design
Framework
Infrastructure
Engineer
Project manager
De...
Architecture vs Software Titles
Architect
CAD Designer
Project Manager
vs
Software Architect
UI Designer
Project Manager
Query Management
Indeed uses Imhotep to improve
matching
Query Management
Indeed uses Imhotep to improve
matching
Automatically detect results that should
be added or removed from...
Query Management
Indeed uses Imhotep to improve
matching
Automatically detect results that should
be added or removed from...
Imhotep Open Source
Imhotep Open Source ETA:
August 1, 2014
Imhotep Open Source
Follow along at our blog
engineering.indeed.com
Sign up for mailing list to get latest updates
go.inde...
Q & A
Next @IndeedEng Talk
Launching Indeed Around the World
Davide Novelli, International Director
David Tulig, Tech Lead
May 2...
More Questions?
Jason David James Jeff
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep
[@IndeedEng] Large scale interactive analytics with Imhotep
Prochain SlideShare
Chargement dans…5
×

[@IndeedEng] Large scale interactive analytics with Imhotep

Link to video: https://www.youtube.com/watch?v=IZ-kC6ut1Lg

In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. This has kept our engineering and product organizations focused on key metrics by analyzing test results. It also gives our marketing organization timely and accurate insight into our data - allowing us to identify opportunities, spot trends, and learn about our job seekers. In this talk, Zak Cocos, who leads our Marketing Sciences team, and Product Manager Tom Bergman will discuss and provide examples of the valuable insights that can be gained by using Imhotep with almost any data set.

  • Soyez le premier à commenter

[@IndeedEng] Large scale interactive analytics with Imhotep

  1. 1. go.indeed.com/IndeedEngTalks
  2. 2. Large Scale Interactive Analytics with Imhotep
  3. 3. Tom Bergman Product Manager
  4. 4. Zak Cocos Manager Marketing Science
  5. 5. We help people get jobs.
  6. 6. What is Imhotep? Imhotep is a highly scalable analytics architecture for querying faceted datasets
  7. 7. Open sourcing Imhotep Imhotep will be an OPEN SOURCE highly scalable analytics architecture for querying faceted datasets
  8. 8. People Tools System Data
  9. 9. People Tools Data System
  10. 10. People Data Tools System
  11. 11. People Data Tools System
  12. 12. A Brief History of Analytics @Indeed
  13. 13. What's best for the job seeker?
  14. 14. Test & Measure EVERYTHING
  15. 15. Query
  16. 16. Query Location
  17. 17. Query Location Impression
  18. 18. Title: Front End Software Engineer Position: 1 Clicked: 0 Country: US Query: indeed software engineer Location: austin Timestamp:2014-04-30T20:00:00 Organic Impression Log Entry
  19. 19. Analytics on Raw Logs
  20. 20. Ramses
  21. 21. ● Search logs ● Extract metrics from matches ● Graph aggregated metrics Ramses
  22. 22. ● Search logs ● Extract metrics from matches ● Graph aggregated metrics Input -> Query and Metric Output -> Aggregated metrics by bucket Ramses
  23. 23. How many organic clicks did we have in Australia?
  24. 24. QUERY country:au METRIC organic_clicks How many organic clicks did we have in Australia?
  25. 25. How many organic clicks did we have in Australia?
  26. 26. Does test group A or B have more revenue?
  27. 27. QUERY testgroup:A, testgroup:B METRIC revenue Does test group A or B have more revenue?
  28. 28. Does test group A or B have more revenue?
  29. 29. How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
  30. 30. QUERY from:yahoo AND country:(gb, de, jp) METRIC visits How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
  31. 31. How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
  32. 32. ● How many unique queries in the US? ● What are the top 50 queries in the US? ● How many clicks did each of those queries receive? Questions Ramses can’t answer
  33. 33. Imhotep
  34. 34. Began as a distributed iteration and group-by engine for building click prediction models. Imhotep Origins
  35. 35. We use an iterative algorithm to build decision trees level-by-level. Decision Tree Builder
  36. 36. Began as a distributed iteration and group-by engine for building click prediction models. Leveraged ability to do massive group-bys and aggregates to make real-time analytics engine. Imhotep Origins
  37. 37. How many Android App users with accounts older than 30 days saved at least 1 job in the past week?
  38. 38. What titles have the highest click-through rate for the query “Architecture” in the US? What about the lowest click-through rate?
  39. 39. For job seekers who click on Google jobs in Ireland, what other company’s jobs do they click on?
  40. 40. Zak Cocos Manager Marketing Science
  41. 41. I also help people get jobs.
  42. 42. Marketing Sciences Research, analysis, and automation team supporting marketing initiatives
  43. 43. Imhotep Imhotep is a highly scalable, [soon to be] open source, analytics architecture for querying faceted datasets
  44. 44. Imhotep@Indeed Ad hoc exploration
  45. 45. Imhotep@Indeed Ad hoc exploration Specific analysis
  46. 46. Imhotep@Indeed Ad hoc exploration Specific analysis Extensible infrastructure
  47. 47. Ad hoc exploration Public Crunchbase Dataset Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  48. 48. Ad hoc exploration Public Crunchbase Dataset Document Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  49. 49. Ad hoc exploration Public Crunchbase Dataset Fields Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  50. 50. Ad hoc exploration Public Crunchbase Dataset Metric Source: CrunchBase CrunchBase 2013 Snapshot © 2013
  51. 51. Interactive tool for exploring Imhotep data Imhotep Data Explorer
  52. 52. Interactive tool for exploring Imhotep data Also: a badass hyperlinked pivot table Imhotep Data Explorer
  53. 53. Imhotep is Large Scale Total size of all indexes: 125TB Jobsearch index (largest): 30TB ● Over 48 billion documents
  54. 54. Query
  55. 55. Query Location
  56. 56. Query Location Organic Impression
  57. 57. Organic Impression A job that was displayed as the result of a search
  58. 58. Title
  59. 59. Company Information
  60. 60. Description
  61. 61. Job Age
  62. 62. abredistime acmetime addltime adsc adsdelay adsi badsc badsi boostojc boostoji bsjc bsjcwia bsji bsjindapplies bsjindappviews bsjrev bsjwia ckcnt cksz counts ctkage ctkagedays dayofweek dcpingtime domTotalTime ds-mpo dsmiss dstime featemp fj freekwac freekwarev freesjc freesjrev frmtime galatdelay iplat iplong jslatdelay jsvdelay kwac kwacdelay kwai kwarev kwcnt lacinsize lacsgsize lmstime mpotime mprtime navTotTime ndxtime ojc ojclong ojcshort ojcwia oji ojindapplies ojindappviews ojwia oocsc page prcvdlatency primfollowcnt prvwoji prvwojlat prvwojopentime prvwojreq radsc radsi recidlookupbudget rectime redirCount redirTime relfollowcnt respTime returnvisit rojc roji rqcnt rqlcnt rqqcnt rrsjc rrsji rrsjrev rsavail rsjc rsji rsused rsviable serpsize sjc sjcdelay sjclong sjcnt sjcshort sjcwia sji sjindapplies sjindappviews sjrev sjwia sllat sllong sqc sqi sugtime svj svjnostar svjstar tadsc tadsi time timeofday totcnt totfollowcnt totrev tottime tsjc tsjcwia tsji tsjindapplies tsjindappviews tsjrev tsjwia unqcnt vp wacinsize wacsgsize
  63. 63. Organic Impression Document Title: Front End Software Engineer Position: 1 Clicked: 0 Country: US Query: indeed software engineer Location: austin Timestamp:2014-04-30T20:00:00
  64. 64. Organic Impression Index Title: Front End Software Engineer Position: 1 Clicked: 0 Country: US Query: indeed software engineer Location: austin Timestamp:2014-04-30T20:00:00
  65. 65. Imhotep Data Explorer can’t... Combine results from multiple datasets
  66. 66. Combine results from multiple datasets Be easily automated Imhotep Data Explorer can’t...
  67. 67. Imhotep Query Language (IQL)
  68. 68. IQL - Imhotep Query Language Can combine results from multiple datasets Allows for automation of data tools
  69. 69. IQL queries - requirements Index Date range Metrics
  70. 70. IQL queries - optional Index Date range Metrics Filters Group by
  71. 71. IQL - Metrics select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid Metrics
  72. 72. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Indexes Index
  73. 73. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Date Range Date Range
  74. 74. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Filters Filters
  75. 75. select count() from organic ‘2013-12-05’ ‘2013-12-10’ where country=ie and clicked=1 group by companyid IQL - Filters Groups
  76. 76. IQL Question Do companies that have raised more than $10 million in the Austin get more clicks on average than those raised less than $10 million?
  77. 77. Methodology 1) organic index: select companies in the US which received organic clicks
  78. 78. Methodology 1) organic index: select companies in the US which received organic clicks 2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin
  79. 79. Methodology 1) organic index: select companies in the US which received organic clicks 2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin 3) Join, segment, and do the math!
  80. 80. Tom Bergman Product Manager
  81. 81. I still help people get jobs.
  82. 82. Large Scale Interactive Analytics Platform ● 123 Unique Indexes ● Largest Index 30TB ● Total size ~125TB
  83. 83. Large Scale Interactive Analytics Platform IQL -> Largely Programmatic access ● approx 76k queries/day ● Avg time to execute 0.67 seconds Ramses -> Largely Human ● approx 3,400 queries/day ● Avg time to execute 4.4 seconds
  84. 84. Large Scale Interactive Analytics Platform Users ● 198 unique users in past month ● 25,622 unique queries in past month ● Avg 53 queries/user per day
  85. 85. Large Scale Interactive Analytics Platform 40+ internal clients ● 6 Analytics Webapps ● 5 dashboards ● 10 programming/scripting shells ● 6 monitoring apps ● … and more
  86. 86. Large Scale Interactive Analytics Platform One Tool-set for all data ● Website usage ● Operational Monitoring ● Financial Reporting ● Google Analytics ● Internal Webapp Usage ● External Reports
  87. 87. Solving a real problem
  88. 88. Providing the Best Results Show the jobs that users are most interesting to our users
  89. 89. Providing the Best Results Clicks are a very good indicator of interest
  90. 90. Providing the Best Results Clicks are a very good indicator of interest More clicks -> More Relevant Less clicks -> Less Relevant
  91. 91. Architecture Very hard query to serve correctly
  92. 92. Architecture Very hard query to serve correctly Architecture terminology has been co- opted by technology
  93. 93. Terminology Common to both Software and Architecture Blueprint Design Framework Infrastructure Engineer Project manager Development Technical architect Software Modeling Computation Code reviews
  94. 94. Architecture vs Software Titles Architect CAD Designer Project Manager vs Software Architect UI Designer Project Manager
  95. 95. Query Management Indeed uses Imhotep to improve matching
  96. 96. Query Management Indeed uses Imhotep to improve matching Automatically detect results that should be added or removed from queries
  97. 97. Query Management Indeed uses Imhotep to improve matching Automatically detect results that should be added or removed from queries 26,790 rules across all countries
  98. 98. Imhotep Open Source Imhotep Open Source ETA: August 1, 2014
  99. 99. Imhotep Open Source Follow along at our blog engineering.indeed.com Sign up for mailing list to get latest updates go.indeed.com/imhotep-announce
  100. 100. Q & A
  101. 101. Next @IndeedEng Talk Launching Indeed Around the World Davide Novelli, International Director David Tulig, Tech Lead May 28, 2014 http://engineering.indeed.com/talks
  102. 102. More Questions? Jason David James Jeff

×