SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Building the News Search Engine
Ramkumar Aiyengar
Team Leader, R&D News Search, Bloomberg L.P.
andyetitmoves@apache.org
Copyright © 2015 Bloomberg L.P.
•  A technology company
•  Our strength and focus is data
•  “The Terminal”, vertical portals
•  Customers: Primarily finance
•  Also government, lawyers etc.
3	
  
•  Started 8 years back
•  Now team lead for R&D News
Search
•  Search, alerting, ingest
infrastructure
•  Started with Solr/Lucene 3 years ago
•  Now a committer with the project
The News Search Ecosystem
•  Suggest	
  queries	
  as	
  the	
  user	
  is	
  typing	
  
•  Understand	
  a	
  query	
  to	
  figure	
  out	
  what’s	
  being	
  requested	
  
•  NLP,	
  en)ty	
  recogni)on/disambigua)on,	
  spellcheck	
  
•  Search	
  for	
  keywords	
  and	
  metadata	
  of	
  documents	
  
•  Sort	
  the	
  documents	
  as	
  the	
  usage	
  demands	
  
•  If	
  sor)ng	
  by	
  relevance,	
  what’s	
  actually	
  relevant	
  for	
  the	
  user?	
  
•  Should	
  some	
  results	
  be	
  promoted	
  ahead	
  of	
  others?	
  
•  Alert	
  users	
  when	
  new	
  stories	
  match	
  the	
  ac=ve	
  search	
  
•  Expose	
  facets	
  for	
  refining	
  and	
  discovery	
  
•  Recommend	
  searches	
  and	
  search	
  results	
  
4	
  
5	
  
What we are in the middle of…
•  Existing system based on a proprietary system
•  Proprietary product, past its end of life
•  Inflexible, no scalable relevance sorting
•  Enter Solr/Lucene!
•  Rich in features, extensible and actively maintained
•  Free software, we are involved and contribute back!
•  From-scratch alerting backend based on Lucene
•  Architectural revamp of the News Search backend
•  Scalable with load and data: Just add machines!
•  Maintainable: Easy to add metadata, re-index all of the data
6	
  
What goes in…
•  Document
•  News stories, research documents, tweets…
•  Story body, headline, time of arrival, source…
•  Tags (companies, topics, people etc.) associated with the story
•  Query
KEYWORDS:(“Barack Obama” N/5 “War*” IN STORYTOP/75) AND (REGION:IRAQ OR REGION:AFGHAN)
AND NOT TOPIC:ODD AND (WIRE:BLOOMBERG OR WIRE:TWT)
•  Multiple fields (keywords, topics, regions, sources)
•  Boolean (and/or/not), proximity (n/5), zoning (storytop/75) operators
•  Phrase search (“Barack Obama”), wildcard searches (“War*”)
•  Range queries (Time between A and B)
•  Overall search filters (Relevance, Language)
	
  
7	
  
The madness we deal with…
•  Arbitrarily complex Boolean queries
•  for both search and alerting
•  users create queries as large as 20K characters (or more!)
•  Lists of metadata can be specified in short hand
•  A “ticker list” could have 1000s of companies interesting to the user
•  Stories from 125K+ sources
•  users privileged for a subset of these sources
•  can be turned on/off per user
•  ACLs can have a few, many or all of the users
•  Searches and stories in 40 languages
•  any user can have a subset of these selected
8	
  
The News Search cloud
•  Lots of Linux machines on a SolrCloud housing:
•  Hundreds of shards, and thousands of Solr cores
•  Multiple tiers: ‘recent’ collection to optimize chronological results
•  Cross data-centre redundancy
•  Stories available for search in 125ms, minimal caching
•  Custom components for:
•  Parsing: XML query parser, with additions
•  Indexing: For handling tags, more on that later…
•  Searching: Custom Lucene queries for some cases
•  Post Filtering: For privileging of news stories
9	
  
Parsing search queries
•  We need to..
•  Understand In-house search syntax
•  Validate/Privilege tags based on DB
•  Present part of search query in UI
•  Understand query to suggest sources
•  Parsed outside Solr to XML queries (SOLR-839)
•  Future: Compact transfer: JSON, Binary? (SOLR-4351)
•  Will state* NP/10 ((“tax*” N/5 “incentive”) OR “sales and use”)
work?
•  Making spans interoperate with any query
•  Initially at flaxsearch/lucene-solr-intervals, now possible with 5.3.
10	
  
Searching for news tags
•  Each story has multiple tags associated
•  Topics, companies, regions, people…
•  Each tag has a ‘relevance’ provided by a classifier
•  Up to a few hundred tags per story, millions overall
•  Tag relevance to be considered for scoring and filtering
•  How do you normalize relevance with keywords?
•  One solution: repurpose TF for tags
•  Use TF/IDF for tags like with keywords
•  Modify searches to be filtered by ranges of TF values
11	
  
Optimizing searches
•  Running “ticker list” searches fast is hard
•  Boolean OR of thousands of terms with ”relevance” filters
•  Naïve: BooleanQuery(Filter(TermQuery, FRange)…)
•  Better: BooleanQuery(TermFreqQuery…)
•  Even more: TermsFreqQuery
•  Optimizing searches for sorting by time (SOLR-5730)
•  Pluggable merge policy factory in Solr
•  Solr support (use the schema) for EarlyTerminatingSortingCollector
•  How aggressive is your merge policy?
•  aka how much can you squeeze out of your SSDs?
12	
  
Optimizing searches
•  You really need that ShardHandlerFactory? (and other tales
of GC)
•  Even small inefficiencies multiply at scale (e.g. SOLR-6603)
•  How much can be moved off the Java heap?
•  Routing smartly to reduce the probability of GC (SOLR-6730)
•  Looking out for what the kernel is doing
•  “swappiness”, I/O scheduler fit for SSDs, huge pages
•  Watch where the time’s spent (may not be where you
expect…)
•  No point with fast searches if max connections is too low
•  There may be that odd hardcoded number (e.g. SOLR-6605)
•  Even Jetty could have bugs which cause a request to stall and timeout
13	
  
There will be storms…
•  Thousands of cores in a cloud is a lot of fun J
•  Started with Lucene/Solr 4.3.1, lots of stability concerns
•  A lot better now obviously, but the odd issue remains…
•  If there’s a race condition, we will hit it!
•  Is it safe to bring down multiple replicas of a shard?
•  What happens when you shutdown in the middle of a merge?
•  If you have to screw up, be controlled about it!
•  Zombie checks should be light (SOLR-5718)
•  Will the cloud always heal after a network partition?
•  As with real life, sometimes democracy can be annoying…
14	
  
Scaling Solr Cloud
•  Distributed coordination (good ol’ Overseer!)
•  Hundreds of cores restarted at a time during weekends
•  Scaling cluster state (SOLR-5381, SOLR-5872)
•  Leadership mechanisms have to scale
•  Transitions have to happen quickly (SOLR-6261)
•  Leaders shouldn’t gang up on some machines (SOLR-6491)
•  Running queries across hundreds of shards
•  You could end up with situations involving 60K threads!
•  Time to revisit the I/O model across shards in Solr?
15	
  
Alerting: Prospective search
•  Searching turned upside down
•  Find which of millions of searches match one document
•  Alert users who are interested in these searches
•  Use tailored searches to tag documents with topics
•  No out of the box support for Solr
•  Baleene: A standalone application for prospective search
•  Based on and improving Luwak and a fork of Lucene
•  But fork no longer needed with Lucene 5.3!
•  Indexes queries, and “pre-searches” by making queries of documents
•  “Turning Search Upside Down” – Alan’s presentation at Dublin!
•  Planning to release application as open source
16	
  
The road ahead…
•  A lot more possible with relevance
•  Better scoring, user models, connecting data
•  Learning to Rank presentation yesterday from Michael and Diego!
•  Alerting for relevance sorted views
•  Alert when the top N stories for my search change
•  Update relevance sorted views in real-time
•  Making queries more expressive
•  Language detection, translation, synonyms…
•  Entity search within text (‘richter:5.5-12 NEAR someplace’)
•  Readership sorted views for any search
•  Tens of millions of story hits per day, how best to index? (SOLR-5944?)
17	
  
Committers: 2, Patches: 100+, Challenges: Countless!
18	
  
19
Q&

Contenu connexe

Tendances

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 

Tendances (20)

Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Managed Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty ImagesManaged Search: Presented by Jacob Graves, Getty Images
Managed Search: Presented by Jacob Graves, Getty Images
 
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks FusionWebinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks Fusion
 

En vedette

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Lucidworks
 
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Lucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Lucidworks
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Lucidworks
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks
 
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBMUnderstanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Lucidworks
 
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
Lucidworks
 

En vedette (20)

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Scaling Saved Searches at eBay Kleinanzeigen
Scaling Saved Searches at eBay KleinanzeigenScaling Saved Searches at eBay Kleinanzeigen
Scaling Saved Searches at eBay Kleinanzeigen
 
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
 
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
A near real time search and alert engine powered by SolR Lucene
A near real time search and alert engine powered by SolR LuceneA near real time search and alert engine powered by SolR Lucene
A near real time search and alert engine powered by SolR Lucene
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
 
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBMUnderstanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
 
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
 
Secure Search - Using Apache Sentry to Add Authentication and Authorization S...
Secure Search - Using Apache Sentry to Add Authentication and Authorization S...Secure Search - Using Apache Sentry to Add Authentication and Authorization S...
Secure Search - Using Apache Sentry to Add Authentication and Authorization S...
 
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will HayesLucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
 
Build a Great Application in Minutes!: Presented by Stefan Olafsson, Twigkit
Build a Great Application in Minutes!: Presented by Stefan Olafsson, TwigkitBuild a Great Application in Minutes!: Presented by Stefan Olafsson, Twigkit
Build a Great Application in Minutes!: Presented by Stefan Olafsson, Twigkit
 

Similaire à Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloomberg LP

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 

Similaire à Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloomberg LP (20)

Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
2013 11-07 lsr-dublin_m_hausenblas_when solr is best2013 11-07 lsr-dublin_m_hausenblas_when solr is best
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
Ubiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil TwinUbiquitous Solr - A Database's not-so-evil Twin
Ubiquitous Solr - A Database's not-so-evil Twin
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Solr5
Solr5Solr5
Solr5
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
ApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big Data
 

Plus de Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Plus de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloomberg LP

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Building the News Search Engine Ramkumar Aiyengar Team Leader, R&D News Search, Bloomberg L.P. andyetitmoves@apache.org Copyright © 2015 Bloomberg L.P.
  • 3. •  A technology company •  Our strength and focus is data •  “The Terminal”, vertical portals •  Customers: Primarily finance •  Also government, lawyers etc. 3   •  Started 8 years back •  Now team lead for R&D News Search •  Search, alerting, ingest infrastructure •  Started with Solr/Lucene 3 years ago •  Now a committer with the project
  • 4. The News Search Ecosystem •  Suggest  queries  as  the  user  is  typing   •  Understand  a  query  to  figure  out  what’s  being  requested   •  NLP,  en)ty  recogni)on/disambigua)on,  spellcheck   •  Search  for  keywords  and  metadata  of  documents   •  Sort  the  documents  as  the  usage  demands   •  If  sor)ng  by  relevance,  what’s  actually  relevant  for  the  user?   •  Should  some  results  be  promoted  ahead  of  others?   •  Alert  users  when  new  stories  match  the  ac=ve  search   •  Expose  facets  for  refining  and  discovery   •  Recommend  searches  and  search  results   4  
  • 6. What we are in the middle of… •  Existing system based on a proprietary system •  Proprietary product, past its end of life •  Inflexible, no scalable relevance sorting •  Enter Solr/Lucene! •  Rich in features, extensible and actively maintained •  Free software, we are involved and contribute back! •  From-scratch alerting backend based on Lucene •  Architectural revamp of the News Search backend •  Scalable with load and data: Just add machines! •  Maintainable: Easy to add metadata, re-index all of the data 6  
  • 7. What goes in… •  Document •  News stories, research documents, tweets… •  Story body, headline, time of arrival, source… •  Tags (companies, topics, people etc.) associated with the story •  Query KEYWORDS:(“Barack Obama” N/5 “War*” IN STORYTOP/75) AND (REGION:IRAQ OR REGION:AFGHAN) AND NOT TOPIC:ODD AND (WIRE:BLOOMBERG OR WIRE:TWT) •  Multiple fields (keywords, topics, regions, sources) •  Boolean (and/or/not), proximity (n/5), zoning (storytop/75) operators •  Phrase search (“Barack Obama”), wildcard searches (“War*”) •  Range queries (Time between A and B) •  Overall search filters (Relevance, Language)   7  
  • 8. The madness we deal with… •  Arbitrarily complex Boolean queries •  for both search and alerting •  users create queries as large as 20K characters (or more!) •  Lists of metadata can be specified in short hand •  A “ticker list” could have 1000s of companies interesting to the user •  Stories from 125K+ sources •  users privileged for a subset of these sources •  can be turned on/off per user •  ACLs can have a few, many or all of the users •  Searches and stories in 40 languages •  any user can have a subset of these selected 8  
  • 9. The News Search cloud •  Lots of Linux machines on a SolrCloud housing: •  Hundreds of shards, and thousands of Solr cores •  Multiple tiers: ‘recent’ collection to optimize chronological results •  Cross data-centre redundancy •  Stories available for search in 125ms, minimal caching •  Custom components for: •  Parsing: XML query parser, with additions •  Indexing: For handling tags, more on that later… •  Searching: Custom Lucene queries for some cases •  Post Filtering: For privileging of news stories 9  
  • 10. Parsing search queries •  We need to.. •  Understand In-house search syntax •  Validate/Privilege tags based on DB •  Present part of search query in UI •  Understand query to suggest sources •  Parsed outside Solr to XML queries (SOLR-839) •  Future: Compact transfer: JSON, Binary? (SOLR-4351) •  Will state* NP/10 ((“tax*” N/5 “incentive”) OR “sales and use”) work? •  Making spans interoperate with any query •  Initially at flaxsearch/lucene-solr-intervals, now possible with 5.3. 10  
  • 11. Searching for news tags •  Each story has multiple tags associated •  Topics, companies, regions, people… •  Each tag has a ‘relevance’ provided by a classifier •  Up to a few hundred tags per story, millions overall •  Tag relevance to be considered for scoring and filtering •  How do you normalize relevance with keywords? •  One solution: repurpose TF for tags •  Use TF/IDF for tags like with keywords •  Modify searches to be filtered by ranges of TF values 11  
  • 12. Optimizing searches •  Running “ticker list” searches fast is hard •  Boolean OR of thousands of terms with ”relevance” filters •  Naïve: BooleanQuery(Filter(TermQuery, FRange)…) •  Better: BooleanQuery(TermFreqQuery…) •  Even more: TermsFreqQuery •  Optimizing searches for sorting by time (SOLR-5730) •  Pluggable merge policy factory in Solr •  Solr support (use the schema) for EarlyTerminatingSortingCollector •  How aggressive is your merge policy? •  aka how much can you squeeze out of your SSDs? 12  
  • 13. Optimizing searches •  You really need that ShardHandlerFactory? (and other tales of GC) •  Even small inefficiencies multiply at scale (e.g. SOLR-6603) •  How much can be moved off the Java heap? •  Routing smartly to reduce the probability of GC (SOLR-6730) •  Looking out for what the kernel is doing •  “swappiness”, I/O scheduler fit for SSDs, huge pages •  Watch where the time’s spent (may not be where you expect…) •  No point with fast searches if max connections is too low •  There may be that odd hardcoded number (e.g. SOLR-6605) •  Even Jetty could have bugs which cause a request to stall and timeout 13  
  • 14. There will be storms… •  Thousands of cores in a cloud is a lot of fun J •  Started with Lucene/Solr 4.3.1, lots of stability concerns •  A lot better now obviously, but the odd issue remains… •  If there’s a race condition, we will hit it! •  Is it safe to bring down multiple replicas of a shard? •  What happens when you shutdown in the middle of a merge? •  If you have to screw up, be controlled about it! •  Zombie checks should be light (SOLR-5718) •  Will the cloud always heal after a network partition? •  As with real life, sometimes democracy can be annoying… 14  
  • 15. Scaling Solr Cloud •  Distributed coordination (good ol’ Overseer!) •  Hundreds of cores restarted at a time during weekends •  Scaling cluster state (SOLR-5381, SOLR-5872) •  Leadership mechanisms have to scale •  Transitions have to happen quickly (SOLR-6261) •  Leaders shouldn’t gang up on some machines (SOLR-6491) •  Running queries across hundreds of shards •  You could end up with situations involving 60K threads! •  Time to revisit the I/O model across shards in Solr? 15  
  • 16. Alerting: Prospective search •  Searching turned upside down •  Find which of millions of searches match one document •  Alert users who are interested in these searches •  Use tailored searches to tag documents with topics •  No out of the box support for Solr •  Baleene: A standalone application for prospective search •  Based on and improving Luwak and a fork of Lucene •  But fork no longer needed with Lucene 5.3! •  Indexes queries, and “pre-searches” by making queries of documents •  “Turning Search Upside Down” – Alan’s presentation at Dublin! •  Planning to release application as open source 16  
  • 17. The road ahead… •  A lot more possible with relevance •  Better scoring, user models, connecting data •  Learning to Rank presentation yesterday from Michael and Diego! •  Alerting for relevance sorted views •  Alert when the top N stories for my search change •  Update relevance sorted views in real-time •  Making queries more expressive •  Language detection, translation, synonyms… •  Entity search within text (‘richter:5.5-12 NEAR someplace’) •  Readership sorted views for any search •  Tens of millions of story hits per day, how best to index? (SOLR-5944?) 17  
  • 18. Committers: 2, Patches: 100+, Challenges: Countless! 18  
  • 19. 19 Q&