SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Search at Tumblr
Yufei Pan
Director of Search, Tumblr
16 January 2013
Tumblr - Follow the World’s Creators
Founded
● David Karp
● February 2007

Publishing Platform
● 163 million blogs
● 72 billion posts

Social Network
● Follow, Mention
● Like, Reblog
About search@tumblr
● Most important way to discover great content
○ 50M searches a day

● Limited search for a long time (2007-2012)
○ Tagged page
■ mysql lookup of a single tag id
■ sorted by reverse chronological order
○ Finding blog
■ navigate through curated directories
About search@tumblr
● Search Team
○ 2012 July, Jak joined as first search engineer!

Jak

Yufei

Bennett

Beitao

Patrick

● Features launched in 2013
○ Post search, Blog search, Theme search
○ Typeaheads, Recommendation, Trends

Adam
Whole New Search
Post search
● full text search
● top and recent
● post type filtering

Blog search
● name & title
● top tags in posts
● blog highlights

Related search
● term co-occurrence
Typeahead Autocompletes
Search Autocomplete

Mention Autocomplete

●
●
●

Interactive guide of tumblr content
High volume of traffic
Low latency

Tag Suggest
Recommendations
Personalized Recommendation

Weekly Dashboard Digest
Trends
Trending Tags

Trending Blogs
Theme Search
Search Architecture
Post
Search

Blog
Search

Typeahead

Related
Tags

Blog
Recommend

Blog
Highlights

Blog
Top Tags

Trending
Tags

Trending
Blogs

Trending
Posts

Online
Search Online Framework

Recent Post
Index

Blog Full
Index

Theme
Index

Blog Top-K
Index

Follower
Counts

Post
Notecount

Post
Model

Personalized

Blog Index

Trending
Blogs

Trending
Posts

Trending
Tags

Related Tag
Index

Blog Global
Rank

Blog
Model

User
Model

Typeahead
Indices

Data

Top
Post Index

Blog Top
Posts

Blog Top
Tags

Two Degree

Like Root

Blog
Feedback

In-Blog Tag
Index

Global Tag
Index

Search Offline Framework

Rediscover
Solr

Offline

MySQL
Activity Streams (Fire Geyser)

Scribe logs, Sqoop tables (HDFS)

Nginx
Linux
Software Stack
● Search Online
○ HAProxy, Nginx, PHP
○ Memcache
○ Icinga, Scribe, OpenTSDB

● Search Data
○ Solr, Redis, MySQL

● Search Offline
○ Sqoop, Hadoop
○ Java, Hive, Pig, Scalding, Python
Search Online Framework
Search Services

SearchBase

Search Flow
Execution

Multi-level
Caching

Search
Logging

Async
Execution

Search
Editorial

QueryIF

RetrieverIF

SignalFetcherIF

RankerIF

DocFetcherIF

FilterIF

SimpleQuery

SolrPostRetriever

NotecountFetcher

TopPostRanker

PostFetcher

PostFilter

PersonalizedQuery

MysqlPostRetriever

FollowercountFetcher

TumblelogRanker

TumblelogFetcher

TumblelogFilter

AdvancedPostQuery

SMPostRetriever

TumblelogGlobalRan
kFetcher

RelatedPostRanker

TagFetcher

TagFilter

RecommendationSign
alFetcher

TumblelogMixingRan
ker

TimeSliceQuery
TrendTagQuery

TumblelogRetriever
TagTypeahead
Reteriever

BlogTopTagFetcher
Search Batch Processing
Search Data (Redis)

Workflow
Composition

Dependency
Resolution

Automatic
Versioning

Data
Verification

Execution
Logging
Failure
Detection/Alert

Search Workflow Engine
Hive Jobs
Term
Generators

Streaming
Jobs

Pig Jobs

Top-K
Indexer

Delta
Propagator

Search Task Base
Scribe Logs, Sqoop Tables (HDFS)

Scalding
Jobs
Lucille2
Classes
Indexing
● 3-Tier indices
○ Index all posts
■ 600+ machines
○ Recent (6W) + Popular (4Y) + Existing tag table
■ Down to 40 machines
■ Minor loss in coverage
■ Serve up to 4K qps (non-cached)

● Lean index
○ Separate signals from index
■ Eliminate high volume re-indexing
■ Independent signal engineering from indexing
○ Separate document text from index
■ Dropping the memory footprint
Ranking
● Quickly evolving!
● Major ranking signals in production
○ Global popularity
■ likes, reblogs, follows
○ Local popularity
■ popularity projected on <user, query>
●
●

blog search: aggregated likes on query term
blog recommendation: follow counts among friends

○ Textual relevancy
■ how: exact match, query proximity
■ where: name, title, tag, mention, body, etc
○ Recency
Duplicate Elimination (DE)
● Index-time DE
○ post signature
■ number of tags > N1
■ md5 hash of normalized tag list

● Search-time DE
○ Media DE
■ posts with same media hashes.
○ Near DE
■ posts with tags > N2
■ mark as near duplicate if diff <= N3 tags
■ older posts selected as original
Search Platform
● A curvy road
○ Started with ElasticSearch
○ Switched to SolrCloud due to reliability
○ Ended up with Solr + Customized Clustering

● Our takes
○ ElasticSearch and SolrCloud have great functionality
■ distributed indexing and search
■ easy cluster management
○ Solr seems still much more reliable with high
indexing load and search traffic.
Offline Precomputation
● Benefits
○ Minimize the search online latency
○ More sophisticated/expensive computation

● Limitation
○ Loss of freshness
○ Expensive for longtail query and results

● Precomputed
○
○
○
○

Typeaheads
Related search
Blog recommendation
Top posts of Blog / User
What’s Next
● Inblog search
○ full text search on all posts in a blog
○ original posts, reblogs, likes

● Ranking
○ more effective and spam-resilient signals
○ learning to rank

● Topical interest modeling
○ supervised and unsupervised
○ blog content and user activities
○ interest based blog recommendation

● Content discovery
○ trending content in various categories
Q&A
Question: Are you hiring?
Answer: Yeah! Check it out at http://www.tumblr.com/jobs

More questions please, :-)

Contenu connexe

Similaire à Search at Tumblr (nyc search meetup)

WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentFrank Jones
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization FundamentalsKalin Chernev
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)NeslaSherin
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) convertedNeslaSherin
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsOmnePresent
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Abhimanyu Lad
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightRaunak Guha
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020Bill Hartzer
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Mark Ginsberg
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israelBarry Schwartz
 
Michał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDMichał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDAbi Yudhie
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Marshal Yung
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO BasicsAmanda Billy
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO IntroductionSSAA60
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Performing Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsitePerforming Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsiteFrom The Future
 

Similaire à Search at Tumblr (nyc search meetup) (20)

WordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional DevelopmentWordPress SEO Class Outline for NCSU Professional Development
WordPress SEO Class Outline for NCSU Professional Development
 
Search Engine Optimization Fundamentals
Search Engine Optimization FundamentalsSearch Engine Optimization Fundamentals
Search Engine Optimization Fundamentals
 
Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)Search engine optimization (SEO, SEM, SMM)
Search engine optimization (SEO, SEM, SMM)
 
Seo class (2) converted
Seo class (2) convertedSeo class (2) converted
Seo class (2) converted
 
How To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For BlogsHow To Guide : Researching Topics For Blogs
How To Guide : Researching Topics For Blogs
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]
 
Demystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals RightDemystifying SEO - Getting the Fundamentals Right
Demystifying SEO - Getting the Fundamentals Right
 
SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020SEO and TLD Domain Names - NamesCon 2020
SEO and TLD Domain Names - NamesCon 2020
 
SEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETINGSEO AND DIGITAL MARKETING
SEO AND DIGITAL MARKETING
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013Keyword Research - Moving Beyond Adwords - SMX Israel 2013
Keyword Research - Moving Beyond Adwords - SMX Israel 2013
 
Mark ginsberg beyond kw research - smx israel
Mark ginsberg   beyond kw research - smx israelMark ginsberg   beyond kw research - smx israel
Mark ginsberg beyond kw research - smx israel
 
Michał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.IDMichał Suski SEO Surfer SEOCON.ID
Michał Suski SEO Surfer SEOCON.ID
 
Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?Search Engine Optimisation - Have you been crawled over?
Search Engine Optimisation - Have you been crawled over?
 
Presentation: SEO Basics
Presentation: SEO BasicsPresentation: SEO Basics
Presentation: SEO Basics
 
SEO Introduction
SEO IntroductionSEO Introduction
SEO Introduction
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Introduction to Databases
Introduction to Databases Introduction to Databases
Introduction to Databases
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
Performing Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW WebsitePerforming Technical Keyword Research for a NEW Website
Performing Technical Keyword Research for a NEW Website
 

Dernier

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Search at Tumblr (nyc search meetup)

  • 1. Search at Tumblr Yufei Pan Director of Search, Tumblr 16 January 2013
  • 2. Tumblr - Follow the World’s Creators Founded ● David Karp ● February 2007 Publishing Platform ● 163 million blogs ● 72 billion posts Social Network ● Follow, Mention ● Like, Reblog
  • 3. About search@tumblr ● Most important way to discover great content ○ 50M searches a day ● Limited search for a long time (2007-2012) ○ Tagged page ■ mysql lookup of a single tag id ■ sorted by reverse chronological order ○ Finding blog ■ navigate through curated directories
  • 4. About search@tumblr ● Search Team ○ 2012 July, Jak joined as first search engineer! Jak Yufei Bennett Beitao Patrick ● Features launched in 2013 ○ Post search, Blog search, Theme search ○ Typeaheads, Recommendation, Trends Adam
  • 5. Whole New Search Post search ● full text search ● top and recent ● post type filtering Blog search ● name & title ● top tags in posts ● blog highlights Related search ● term co-occurrence
  • 6. Typeahead Autocompletes Search Autocomplete Mention Autocomplete ● ● ● Interactive guide of tumblr content High volume of traffic Low latency Tag Suggest
  • 10. Search Architecture Post Search Blog Search Typeahead Related Tags Blog Recommend Blog Highlights Blog Top Tags Trending Tags Trending Blogs Trending Posts Online Search Online Framework Recent Post Index Blog Full Index Theme Index Blog Top-K Index Follower Counts Post Notecount Post Model Personalized Blog Index Trending Blogs Trending Posts Trending Tags Related Tag Index Blog Global Rank Blog Model User Model Typeahead Indices Data Top Post Index Blog Top Posts Blog Top Tags Two Degree Like Root Blog Feedback In-Blog Tag Index Global Tag Index Search Offline Framework Rediscover Solr Offline MySQL Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS) Nginx Linux
  • 11. Software Stack ● Search Online ○ HAProxy, Nginx, PHP ○ Memcache ○ Icinga, Scribe, OpenTSDB ● Search Data ○ Solr, Redis, MySQL ● Search Offline ○ Sqoop, Hadoop ○ Java, Hive, Pig, Scalding, Python
  • 12. Search Online Framework Search Services SearchBase Search Flow Execution Multi-level Caching Search Logging Async Execution Search Editorial QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF SimpleQuery SolrPostRetriever NotecountFetcher TopPostRanker PostFetcher PostFilter PersonalizedQuery MysqlPostRetriever FollowercountFetcher TumblelogRanker TumblelogFetcher TumblelogFilter AdvancedPostQuery SMPostRetriever TumblelogGlobalRan kFetcher RelatedPostRanker TagFetcher TagFilter RecommendationSign alFetcher TumblelogMixingRan ker TimeSliceQuery TrendTagQuery TumblelogRetriever TagTypeahead Reteriever BlogTopTagFetcher
  • 13. Search Batch Processing Search Data (Redis) Workflow Composition Dependency Resolution Automatic Versioning Data Verification Execution Logging Failure Detection/Alert Search Workflow Engine Hive Jobs Term Generators Streaming Jobs Pig Jobs Top-K Indexer Delta Propagator Search Task Base Scribe Logs, Sqoop Tables (HDFS) Scalding Jobs Lucille2 Classes
  • 14. Indexing ● 3-Tier indices ○ Index all posts ■ 600+ machines ○ Recent (6W) + Popular (4Y) + Existing tag table ■ Down to 40 machines ■ Minor loss in coverage ■ Serve up to 4K qps (non-cached) ● Lean index ○ Separate signals from index ■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing ○ Separate document text from index ■ Dropping the memory footprint
  • 15. Ranking ● Quickly evolving! ● Major ranking signals in production ○ Global popularity ■ likes, reblogs, follows ○ Local popularity ■ popularity projected on <user, query> ● ● blog search: aggregated likes on query term blog recommendation: follow counts among friends ○ Textual relevancy ■ how: exact match, query proximity ■ where: name, title, tag, mention, body, etc ○ Recency
  • 16. Duplicate Elimination (DE) ● Index-time DE ○ post signature ■ number of tags > N1 ■ md5 hash of normalized tag list ● Search-time DE ○ Media DE ■ posts with same media hashes. ○ Near DE ■ posts with tags > N2 ■ mark as near duplicate if diff <= N3 tags ■ older posts selected as original
  • 17. Search Platform ● A curvy road ○ Started with ElasticSearch ○ Switched to SolrCloud due to reliability ○ Ended up with Solr + Customized Clustering ● Our takes ○ ElasticSearch and SolrCloud have great functionality ■ distributed indexing and search ■ easy cluster management ○ Solr seems still much more reliable with high indexing load and search traffic.
  • 18. Offline Precomputation ● Benefits ○ Minimize the search online latency ○ More sophisticated/expensive computation ● Limitation ○ Loss of freshness ○ Expensive for longtail query and results ● Precomputed ○ ○ ○ ○ Typeaheads Related search Blog recommendation Top posts of Blog / User
  • 19. What’s Next ● Inblog search ○ full text search on all posts in a blog ○ original posts, reblogs, likes ● Ranking ○ more effective and spam-resilient signals ○ learning to rank ● Topical interest modeling ○ supervised and unsupervised ○ blog content and user activities ○ interest based blog recommendation ● Content discovery ○ trending content in various categories
  • 20. Q&A Question: Are you hiring? Answer: Yeah! Check it out at http://www.tumblr.com/jobs More questions please, :-)