2. Tumblr - Follow the World’s Creators
Founded
● David Karp
● February 2007
Publishing Platform
● 163 million blogs
● 72 billion posts
Social Network
● Follow, Mention
● Like, Reblog
3. About search@tumblr
● Most important way to discover great content
○ 50M searches a day
● Limited search for a long time (2007-2012)
○ Tagged page
■ mysql lookup of a single tag id
■ sorted by reverse chronological order
○ Finding blog
■ navigate through curated directories
4. About search@tumblr
● Search Team
○ 2012 July, Jak joined as first search engineer!
Jak
Yufei
Bennett
Beitao
Patrick
● Features launched in 2013
○ Post search, Blog search, Theme search
○ Typeaheads, Recommendation, Trends
Adam
5. Whole New Search
Post search
● full text search
● top and recent
● post type filtering
Blog search
● name & title
● top tags in posts
● blog highlights
Related search
● term co-occurrence
10. Search Architecture
Post
Search
Blog
Search
Typeahead
Related
Tags
Blog
Recommend
Blog
Highlights
Blog
Top Tags
Trending
Tags
Trending
Blogs
Trending
Posts
Online
Search Online Framework
Recent Post
Index
Blog Full
Index
Theme
Index
Blog Top-K
Index
Follower
Counts
Post
Notecount
Post
Model
Personalized
Blog Index
Trending
Blogs
Trending
Posts
Trending
Tags
Related Tag
Index
Blog Global
Rank
Blog
Model
User
Model
Typeahead
Indices
Data
Top
Post Index
Blog Top
Posts
Blog Top
Tags
Two Degree
Like Root
Blog
Feedback
In-Blog Tag
Index
Global Tag
Index
Search Offline Framework
Rediscover
Solr
Offline
MySQL
Activity Streams (Fire Geyser)
Scribe logs, Sqoop tables (HDFS)
Nginx
Linux
13. Search Batch Processing
Search Data (Redis)
Workflow
Composition
Dependency
Resolution
Automatic
Versioning
Data
Verification
Execution
Logging
Failure
Detection/Alert
Search Workflow Engine
Hive Jobs
Term
Generators
Streaming
Jobs
Pig Jobs
Top-K
Indexer
Delta
Propagator
Search Task Base
Scribe Logs, Sqoop Tables (HDFS)
Scalding
Jobs
Lucille2
Classes
14. Indexing
● 3-Tier indices
○ Index all posts
■ 600+ machines
○ Recent (6W) + Popular (4Y) + Existing tag table
■ Down to 40 machines
■ Minor loss in coverage
■ Serve up to 4K qps (non-cached)
● Lean index
○ Separate signals from index
■ Eliminate high volume re-indexing
■ Independent signal engineering from indexing
○ Separate document text from index
■ Dropping the memory footprint
15. Ranking
● Quickly evolving!
● Major ranking signals in production
○ Global popularity
■ likes, reblogs, follows
○ Local popularity
■ popularity projected on <user, query>
●
●
blog search: aggregated likes on query term
blog recommendation: follow counts among friends
○ Textual relevancy
■ how: exact match, query proximity
■ where: name, title, tag, mention, body, etc
○ Recency
16. Duplicate Elimination (DE)
● Index-time DE
○ post signature
■ number of tags > N1
■ md5 hash of normalized tag list
● Search-time DE
○ Media DE
■ posts with same media hashes.
○ Near DE
■ posts with tags > N2
■ mark as near duplicate if diff <= N3 tags
■ older posts selected as original
17. Search Platform
● A curvy road
○ Started with ElasticSearch
○ Switched to SolrCloud due to reliability
○ Ended up with Solr + Customized Clustering
● Our takes
○ ElasticSearch and SolrCloud have great functionality
■ distributed indexing and search
■ easy cluster management
○ Solr seems still much more reliable with high
indexing load and search traffic.
18. Offline Precomputation
● Benefits
○ Minimize the search online latency
○ More sophisticated/expensive computation
● Limitation
○ Loss of freshness
○ Expensive for longtail query and results
● Precomputed
○
○
○
○
Typeaheads
Related search
Blog recommendation
Top posts of Blog / User
19. What’s Next
● Inblog search
○ full text search on all posts in a blog
○ original posts, reblogs, likes
● Ranking
○ more effective and spam-resilient signals
○ learning to rank
● Topical interest modeling
○ supervised and unsupervised
○ blog content and user activities
○ interest based blog recommendation
● Content discovery
○ trending content in various categories
20. Q&A
Question: Are you hiring?
Answer: Yeah! Check it out at http://www.tumblr.com/jobs
More questions please, :-)