SlideShare une entreprise Scribd logo
1  sur  26
Search Team
Engineering Achievements
Agenda

   Challenges
   Why a Platform?
   Information Extraction
       Need, Impact
       Research / Evaluations
       Approach / Implementation
   Information Retrieval
       Need, Impact
       Research / Evaluations
       Approach / Implementation
Challenges
   Job Alerts
        Over 13 Million searches, 3 times a week
        Complex Matching: Multiple Filters, Boosts, Sorts
   Resdex
        130K active users daily
        470K searches daily
        Over 220 million resumes and growing.
   Job Search
        High QPS 112, 760K searches a day
        Near Real-time Indexing
                            Jobs Refreshed 92 times daily
   Product Demands
        > 99.99% uptime, Stability, Scalability → User Experience
        Varied Functional Requirements (Complexity)
                            NIRM, FN Suggestors, etc.
        Turnaround Time
                            Over 17 applications and growing
                            About a week to deploy / configure a new one
Why a Platform?
   Technical Challenges
        Code / Bug Duplication, Reusability
        Agility
                          Product Requirements Drive Platform-Wide Features
                          SOA, Integration, Business Logic Separation
                          Comprehensive Documentation
        Scalability
        Development and QA Time/Cost Reduction
   Product Challenges
        Turnaround Time
        Business Logic Implementation = Configuration
   Miscellaneous
        Maintenance Cost Reduction
        Resource Optimization/Integration (...Cloud)
        Standardized Reporting / Health Monitoring
Information Extraction
   Data/Information Acquisition
   Structurize Raw Information
       Training based Models for Class Inference
                 Functional Area Detection
       Rule based Extraction
               Nested Funnels/Filter Layers
               Regular Expressions
   Feedback Loop
       Wisdom of Crowd/Collective Intelligence
                  SAP/SimCV: Capture User Response for
                   Recommendations
       Continuous Quality Improvement
IE:Use Cases/Impact

   Resume Parsing
                Resman (unreg apply flow)
                         Accuracy ~75%
                         Dropouts reduced ~44%
   Job Acquisition/Aggregation
                Naukri India:
                       JobMail: 23 Sites, 8.8K Jobs
                       JobPosting: 16 Competition Sites, 472K Jobs


                Naukri Gulf: JobAlerts, JobSearch
                         21 Sites, 6.5K Jobs
   Taxonomy Acquisition (Entity Extraction)
                FN Institute Names
                Contact Information
IE: Research / Evaluation

   Nutch
   Selenium
   Celerity
   UIMA*
   Heritrix
   HTTP Unit
   HTML Unit*
   Open NLP*
   Net::LWP*
IE: Architecture
Crawler Framework

   Crawler Propagation Capabilities
       JavaScript/Ajax/Event Support
                  Follow JavaScript Links
       URL Detection (Final URL Presentation)
                  URLs obtained via JavaScript Execution
                  Recursively Redirected URLs
                  Handle Dom Events
                         button, link, check-box, click, mouseOver, doubleClick.
                  URLs obtained after form-submission
Crawler Framework (contd.)

   Browser Emulation
               Built over Headless Browser
               Human-Like Propagation Strategies
               Handles Cookies
               Handles POST/GET methods
   Compliance (Obeys Robots.txt)
   Configurable Stateful/Delta Crawling
   Nested Multi-threaded Execution
                Pause/Resume/Restart Capabilities at Site/Seed
                 URL levels
   Controllable Depth
Crawler Framework (contd.)

   Real-time Crawler Statistics
       State Information
       MISes
   Crawl-Payload persistence strategies
       Multiple, Combinable Persistence Modules
       Multiple Output Format Support
                   Flat File, XML
                   JDBC-connectable data stores
                   Search Engine Index Formats (e.g. Lucene,
                     Sphinx)
                   Archive Formats (bz, gz, rar, zip, ...)
Extraction Strategies

   Analysis Plugins
       Entity Extraction
                   Composable Funnelling Filters
                          Sections, Subsections, ..., Entity
                   Regex-based Subpart matching
                   Corpus, NLP-based matching
                          UIMA, OpenNLP

       Machine-Learning Approaches
                   Classification / Tagging (Bayesian, SVM)
                   Clustering
Information Retrieval

   Custom/Controllable Relevance/Matching
   Scalability of Search
       Large Volumes
       High Churn
       QPS
   Extraction/Acquisition Pipeline Pluggability
   Results Post Processing
IR: Research / Evaluations

   MySQL Full-Text
   FastESP
   Solr
   Sphinx
   Lucene *
   OpenNLP *
   LingPipe
   Mozilla Rhino (JavaScript) *
IR: Use Cases/Impact

   NSE on Resdex India
       Relevance
IR: Use Cases/Impact
       Error Count the week Before: 91, week After: 1
       Availability (Before: 97.71% - 99.44%, After: 99.99%)
       Performance
                       Slow Queries ( 10 secs): < 0.2%
                       Average Search Time: 0.55 secs
                       QA Quote
                        ”There is an overall decrease in the page download time for
                          Resdex Search Results page. Incase the cache is cleared the
                          page download time has decreased by 34% to 35%, while the
                          page download time has drastically decreased, more than 73%,
                          when checked without clearing cache.”
   NSE on Resdex FirstNaukri
       PM Quote
        ”Hardly any bugs considering the complexity of project. Search results are also
        coming @ speed of thought.”
IR: Use Cases/Impact (contd.)

   Improved Concurrency → ` ` `
IR: Architecture
IR: Platform Features

   High Availability, Stability, Performance
        Caching
                       Adaptive Caching of Hit Attributes
                       Caching of Expression Evaluations
                       Pre-configurable Caching Query Filters
        Distributed Search
                       Search over Sharded Indexes
                       Auto Failover
                       Auto Healing
        Search/Sort/Group Millions of results
                       Complex expressions.
        Miscellaneous
                       Status Reports, Performance Analytics
                       Suggestive Garbage Collection
                       Preload Indexes into RAM
                       Ease of Deployment
IR: Platform Features

   Text Transformations
        Tokenization/Transformation/Tagging
       Controlled, Combinable Stemming
                      Plural, Tenses, Noun-Forms, etc. [Relevance ]
                      Inversion of Stem-roots
                               Highlighting/Did You Mean/Query Expansion
       Phonetic Token Mapping/Augmentation
       Custom Word Mapping/Synonyms (iMatch)
       Linguistic Tagging
                      PoS, Entity Extraction
                      Match/Boost on Tags
       Sentence Detection
       Apply different analytics to different fields
       Context Sensitive Spelling Correction
IR: Platform Features

   Indexing
       Dynamic Rule Based Sharding, Distributed Search
       Multiple Data Source Type Support
       (Near-)Real Time Indexing, Search
       Generic Auxillary Index Format
                  Fast Updation/Retrieval
                  Realtime Per-User Filtering/Sorting
   Matching/Filtering
        Lucene Query Functionality
                  Phrase, Proximity, Fuzzy, Wildcard
                  FirstNaukri Suggestor Implementation
IR: Platform Features

   Result Grouping/Clustering
   Expressions
       Embedded JavaScript Support
       Aggregate Functions (superset of SQL)
                   Sort/Group/Filter during indexing, search
   Sorting
       Dynamic/Stateful Sorting, e.g. for Ad Rotation
       Quota-Based Resorting
IR: Platform Features

   Scoring
       Fully Controlled, Customizable Relevance Scores
       More controllable/testable than Solr/Default Lucene
        Scoring
       Named Query Parts usable in Expressions
       Custom Scorer Variables
                 Vector Space, Query Boost, LCS, Numwords
   Configurability, API
       SQL-like client wrapper
                 Engine-App interactions look like SQL
       XML configurability
Road Ahead


If you don't know where you are going,
   any road will get you there.

                        - The Cheshire Cat,
                        Alice in Wonderland.
Road Ahead

   Cloud → ` ` `

   Semantic Relevance (Search is Dead!)
       Contextual


   Information Extraction
       NLP
       Ontology Extraction
Thanks!

Contenu connexe

En vedette

Better Support for Functional Programming in Angular 2
Better Support for Functional Programming in Angular 2Better Support for Functional Programming in Angular 2
Better Support for Functional Programming in Angular 2Viktor Savkin
 
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار طبعة القسطنطينية
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار   طبعة القسطنطينيةدلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار   طبعة القسطنطينية
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار طبعة القسطنطينيةmoiare
 
Gruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
Gruppo Irpini: Sviluppo e cultura della Sicurezza InformaticaGruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
Gruppo Irpini: Sviluppo e cultura della Sicurezza InformaticaAngela Iaciofano
 
Dlaa5il alxayraat
Dlaa5il alxayraatDlaa5il alxayraat
Dlaa5il alxayraatmoiare
 
Архивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборотаАрхивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборотаNatasha Khramtsovsky
 
Linoma CryptoComplete
Linoma CryptoCompleteLinoma CryptoComplete
Linoma CryptoCompleteStuart Marsh
 
Gruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
Gruppo Liburni: Tutella della Salute e Sicurezza sul LavoroGruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
Gruppo Liburni: Tutella della Salute e Sicurezza sul LavoroAngela Iaciofano
 
Hs Classroom Guidelines
Hs Classroom GuidelinesHs Classroom Guidelines
Hs Classroom Guidelinesjaspang
 

En vedette (20)

Petra Gone Google
Petra Gone GooglePetra Gone Google
Petra Gone Google
 
Erasmus+
Erasmus+Erasmus+
Erasmus+
 
Workshop sociusonderzoek
Workshop sociusonderzoekWorkshop sociusonderzoek
Workshop sociusonderzoek
 
Better Support for Functional Programming in Angular 2
Better Support for Functional Programming in Angular 2Better Support for Functional Programming in Angular 2
Better Support for Functional Programming in Angular 2
 
Solidariteit en kapitalisme
Solidariteit en kapitalismeSolidariteit en kapitalisme
Solidariteit en kapitalisme
 
Google analytics-socius
Google analytics-sociusGoogle analytics-socius
Google analytics-socius
 
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار طبعة القسطنطينية
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار   طبعة القسطنطينيةدلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار   طبعة القسطنطينية
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار طبعة القسطنطينية
 
Tooldag 'Financiële planning'
Tooldag 'Financiële planning'Tooldag 'Financiële planning'
Tooldag 'Financiële planning'
 
Voorstelling EPALE
Voorstelling EPALEVoorstelling EPALE
Voorstelling EPALE
 
Gruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
Gruppo Irpini: Sviluppo e cultura della Sicurezza InformaticaGruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
Gruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
 
Timotheus vanuit een multilevelbril
Timotheus vanuit een multilevelbrilTimotheus vanuit een multilevelbril
Timotheus vanuit een multilevelbril
 
Dlaa5il alxayraat
Dlaa5il alxayraatDlaa5il alxayraat
Dlaa5il alxayraat
 
Архивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборотаАрхивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборота
 
La empresa
La empresaLa empresa
La empresa
 
Ringland (Sven Augusteyns)
Ringland (Sven Augusteyns)Ringland (Sven Augusteyns)
Ringland (Sven Augusteyns)
 
Innoveren naar een duurzame toekomst - Matthias lievens
Innoveren naar een duurzame toekomst - Matthias lievensInnoveren naar een duurzame toekomst - Matthias lievens
Innoveren naar een duurzame toekomst - Matthias lievens
 
Linoma CryptoComplete
Linoma CryptoCompleteLinoma CryptoComplete
Linoma CryptoComplete
 
Gruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
Gruppo Liburni: Tutella della Salute e Sicurezza sul LavoroGruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
Gruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Hs Classroom Guidelines
Hs Classroom GuidelinesHs Classroom Guidelines
Hs Classroom Guidelines
 

Similaire à Naukri Search Team achievements, 2009-2010

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Timothy Chen
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hoodSilicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hoodManish Pandit
 
Lifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationLifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationPerficient, Inc.
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSrinath Perera
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Application architecture for cloud
Application architecture for cloudApplication architecture for cloud
Application architecture for cloudMarco Parenzan
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...lucenerevolution
 
Production profiling what, why and how (JBCN Edition)
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)RichardWarburton
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
 

Similaire à Naukri Search Team achievements, 2009-2010 (20)

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
 
Solr 101
Solr 101Solr 101
Solr 101
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Tech Award Presentation, 2011
Tech Award Presentation, 2011Tech Award Presentation, 2011
Tech Award Presentation, 2011
 
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hoodSilicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
 
RavenDB overview
RavenDB overviewRavenDB overview
RavenDB overview
 
Lifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationLifecycle of a FAST Search Implementation
Lifecycle of a FAST Search Implementation
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing Implementations
 
Solr -
Solr - Solr -
Solr -
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Application architecture for cloud
Application architecture for cloudApplication architecture for cloud
Application architecture for cloud
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...
 
ML studio overview v1.1
ML studio overview v1.1ML studio overview v1.1
ML studio overview v1.1
 
Azure ml studio_overview_v1.1
Azure ml studio_overview_v1.1Azure ml studio_overview_v1.1
Azure ml studio_overview_v1.1
 
Production profiling what, why and how (JBCN Edition)
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)
 
Performance on a budget
Performance on a budgetPerformance on a budget
Performance on a budget
 
Managing the cloud
Managing the cloudManaging the cloud
Managing the cloud
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 

Dernier

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Naukri Search Team achievements, 2009-2010

  • 2. Agenda  Challenges  Why a Platform?  Information Extraction  Need, Impact  Research / Evaluations  Approach / Implementation  Information Retrieval  Need, Impact  Research / Evaluations  Approach / Implementation
  • 3. Challenges  Job Alerts  Over 13 Million searches, 3 times a week  Complex Matching: Multiple Filters, Boosts, Sorts  Resdex  130K active users daily  470K searches daily  Over 220 million resumes and growing.  Job Search  High QPS 112, 760K searches a day  Near Real-time Indexing  Jobs Refreshed 92 times daily  Product Demands  > 99.99% uptime, Stability, Scalability → User Experience  Varied Functional Requirements (Complexity)  NIRM, FN Suggestors, etc.  Turnaround Time  Over 17 applications and growing  About a week to deploy / configure a new one
  • 4. Why a Platform?  Technical Challenges  Code / Bug Duplication, Reusability  Agility  Product Requirements Drive Platform-Wide Features  SOA, Integration, Business Logic Separation  Comprehensive Documentation  Scalability  Development and QA Time/Cost Reduction  Product Challenges  Turnaround Time  Business Logic Implementation = Configuration  Miscellaneous  Maintenance Cost Reduction  Resource Optimization/Integration (...Cloud)  Standardized Reporting / Health Monitoring
  • 5. Information Extraction  Data/Information Acquisition  Structurize Raw Information  Training based Models for Class Inference Functional Area Detection  Rule based Extraction Nested Funnels/Filter Layers Regular Expressions  Feedback Loop  Wisdom of Crowd/Collective Intelligence SAP/SimCV: Capture User Response for Recommendations  Continuous Quality Improvement
  • 6. IE:Use Cases/Impact  Resume Parsing  Resman (unreg apply flow)  Accuracy ~75%  Dropouts reduced ~44%  Job Acquisition/Aggregation  Naukri India:  JobMail: 23 Sites, 8.8K Jobs  JobPosting: 16 Competition Sites, 472K Jobs  Naukri Gulf: JobAlerts, JobSearch  21 Sites, 6.5K Jobs  Taxonomy Acquisition (Entity Extraction)  FN Institute Names  Contact Information
  • 7. IE: Research / Evaluation  Nutch  Selenium  Celerity  UIMA*  Heritrix  HTTP Unit  HTML Unit*  Open NLP*  Net::LWP*
  • 9. Crawler Framework  Crawler Propagation Capabilities  JavaScript/Ajax/Event Support  Follow JavaScript Links  URL Detection (Final URL Presentation)  URLs obtained via JavaScript Execution  Recursively Redirected URLs  Handle Dom Events button, link, check-box, click, mouseOver, doubleClick.  URLs obtained after form-submission
  • 10. Crawler Framework (contd.)  Browser Emulation  Built over Headless Browser  Human-Like Propagation Strategies  Handles Cookies  Handles POST/GET methods  Compliance (Obeys Robots.txt)  Configurable Stateful/Delta Crawling  Nested Multi-threaded Execution Pause/Resume/Restart Capabilities at Site/Seed URL levels  Controllable Depth
  • 11. Crawler Framework (contd.)  Real-time Crawler Statistics  State Information  MISes  Crawl-Payload persistence strategies  Multiple, Combinable Persistence Modules  Multiple Output Format Support  Flat File, XML  JDBC-connectable data stores  Search Engine Index Formats (e.g. Lucene, Sphinx)  Archive Formats (bz, gz, rar, zip, ...)
  • 12. Extraction Strategies  Analysis Plugins  Entity Extraction  Composable Funnelling Filters Sections, Subsections, ..., Entity  Regex-based Subpart matching  Corpus, NLP-based matching UIMA, OpenNLP  Machine-Learning Approaches  Classification / Tagging (Bayesian, SVM)  Clustering
  • 13. Information Retrieval  Custom/Controllable Relevance/Matching  Scalability of Search  Large Volumes  High Churn  QPS  Extraction/Acquisition Pipeline Pluggability  Results Post Processing
  • 14. IR: Research / Evaluations  MySQL Full-Text  FastESP  Solr  Sphinx  Lucene *  OpenNLP *  LingPipe  Mozilla Rhino (JavaScript) *
  • 15. IR: Use Cases/Impact  NSE on Resdex India  Relevance
  • 16. IR: Use Cases/Impact  Error Count the week Before: 91, week After: 1  Availability (Before: 97.71% - 99.44%, After: 99.99%)  Performance  Slow Queries ( 10 secs): < 0.2%  Average Search Time: 0.55 secs  QA Quote ”There is an overall decrease in the page download time for Resdex Search Results page. Incase the cache is cleared the page download time has decreased by 34% to 35%, while the page download time has drastically decreased, more than 73%, when checked without clearing cache.”  NSE on Resdex FirstNaukri  PM Quote ”Hardly any bugs considering the complexity of project. Search results are also coming @ speed of thought.”
  • 17. IR: Use Cases/Impact (contd.)  Improved Concurrency → ` ` `
  • 19. IR: Platform Features  High Availability, Stability, Performance  Caching  Adaptive Caching of Hit Attributes  Caching of Expression Evaluations  Pre-configurable Caching Query Filters  Distributed Search  Search over Sharded Indexes  Auto Failover  Auto Healing  Search/Sort/Group Millions of results  Complex expressions.  Miscellaneous  Status Reports, Performance Analytics  Suggestive Garbage Collection  Preload Indexes into RAM  Ease of Deployment
  • 20. IR: Platform Features  Text Transformations Tokenization/Transformation/Tagging  Controlled, Combinable Stemming  Plural, Tenses, Noun-Forms, etc. [Relevance ]  Inversion of Stem-roots Highlighting/Did You Mean/Query Expansion  Phonetic Token Mapping/Augmentation  Custom Word Mapping/Synonyms (iMatch)  Linguistic Tagging  PoS, Entity Extraction  Match/Boost on Tags  Sentence Detection  Apply different analytics to different fields  Context Sensitive Spelling Correction
  • 21. IR: Platform Features  Indexing  Dynamic Rule Based Sharding, Distributed Search  Multiple Data Source Type Support  (Near-)Real Time Indexing, Search  Generic Auxillary Index Format  Fast Updation/Retrieval  Realtime Per-User Filtering/Sorting  Matching/Filtering Lucene Query Functionality  Phrase, Proximity, Fuzzy, Wildcard  FirstNaukri Suggestor Implementation
  • 22. IR: Platform Features  Result Grouping/Clustering  Expressions  Embedded JavaScript Support  Aggregate Functions (superset of SQL)  Sort/Group/Filter during indexing, search  Sorting  Dynamic/Stateful Sorting, e.g. for Ad Rotation  Quota-Based Resorting
  • 23. IR: Platform Features  Scoring  Fully Controlled, Customizable Relevance Scores  More controllable/testable than Solr/Default Lucene Scoring  Named Query Parts usable in Expressions  Custom Scorer Variables Vector Space, Query Boost, LCS, Numwords  Configurability, API  SQL-like client wrapper Engine-App interactions look like SQL  XML configurability
  • 24. Road Ahead If you don't know where you are going, any road will get you there. - The Cheshire Cat, Alice in Wonderland.
  • 25. Road Ahead  Cloud → ` ` `  Semantic Relevance (Search is Dead!)  Contextual  Information Extraction  NLP  Ontology Extraction