SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Search Engine
                             How To Make it




Wednesday, December 12, 12
Search Engine
                      Search Quality Measurement

                                                             retrieved documents
                                                             (RET)
                      relevant documents       RET ∩ REL
                      (REL)




                                            All documents




             database search:              web search:
             - low recall                  - high recall
             - high precision              - low precision
Wednesday, December 12, 12
Search Engine
                        File System
            File                                    Text Parser
                          Crawler
           System

                                                                      Documents
                                      AaBb                               (title,                      Documents
                                          PDF
                                       AaBb                                              Document
          3rd party
            apps
                        Crawler API
                                         Text
                                                    HTML Parser
                                                                      summary,           Enhancing   (Categorized,
                                         HTML
                                       Document
                                                                        author,                      Taxonomized)
                                        Image
                                           ...
                                                                       datetime)
                         Database
          Database        Crawler                   PDF Parser




                                                                                  Language
                                                                                                       Indexer       Stop Analyzer
                                                                                  Analyzer




                                                                   Web Client
                                                                                        Index
                       Document           Index                                        Searcher         Index

                      Landing Page       Searcher                 Mobile Client




Wednesday, December 12, 12
Search Engine

                   • Process in Search Engine
                        • Crawling
                        • Parsing
                        • Indexing
                        • Searching


Wednesday, December 12, 12
Search Engine
                   • Process in Search Engine
                        • Crawling
                        • Parsing
                        • Duplicate Content Detection
                        • Document Enhancement
                        • Indexing
                        • Searching
                        • Document Serving
Wednesday, December 12, 12
Search Engine

                   • Crawling
                        • Collecting Data
                        • Input : Data content to Search
                        • Output : Raw Content Data in its
                          original format



Wednesday, December 12, 12
Search Engine
                   • Crawling

                                         File System
                              File
                                           Crawler
                             System




                                                       AaBb
                             3rd party   Crawler API       PDF
                                                        AaBb
                               apps                       Text
                                                          HTML
                                                        Document
                                                         Image
                                                            ...
                                          Database
                             Database      Crawler




Wednesday, December 12, 12
Search Engine
                   • Parsing
                        • Process to extract elements from
                          crawled documents
                        • Input : Raw Contents
                        • Output : Textual Structured
                          Documents


Wednesday, December 12, 12
Search Engine
                   • Parsing


                                         Text Parser



                                                       Documents
                             AaBb                         (title,
                                 PDF
                              AaBb
                                Text
                                         HTML Parser
                                                       summary,
                                HTML
                              Document                   author,
                               Image
                                  ...
                                                        datetime)
                                         PDF Parser




Wednesday, December 12, 12
Search Engine

                   • Content Duplication Detection
                        • Bigger Data means Bigger
                          Duplication on Data
                        • Search Engine implement similiar
                          document detection



Wednesday, December 12, 12
Search Engine
                   • Document Representation
                             Model: Term Frequency(Tf)
                             Contoh:
                              Document 1(d1)=”andi likes to watch movie. His wife likes it too”

                              Document 2(d2)=”andi also likes to watch soccer game.”
                              Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}


                              Document representation in model Tf:
                              d1={1, 2, 2, 2, 1, 1, 0}
                              d2={1, 1, 1, 0, 0, 0, 1}




Wednesday, December 12, 12
Search Engine
                   • Document Similiarity
                             Similarity between document d1 dan d2 : S(d1, d2)

                             S(d1, d2)=|d1-d2|
                             Contoh:
                             d1={1, 2, 2, 2, 1, 1, 0}

                             d2={1, 1, 1, 0, 0, 0, 1}

                              S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|

                             S(d1, d2)=7

                             With above definition, less value we got means more those two documents
                             are getting more similiar

Wednesday, December 12, 12
Search Engine
                   • Alghoritms
                             1. Counting Tf for every document

                             2. Find the smallest value of S(d, di) from all
                             documents collection to get the most similiar of
                             document d
                             3. if the value of S(d, di) < threshold then
                             document d and compared with create date, then
                             erase older document
                             4. Repeat process 2 dan 3 until there is no value
                             of S that less than Theshold


Wednesday, December 12, 12
Search Engine


                   • Document Enhancement
                        • Give tagging based on taxonomy




Wednesday, December 12, 12
Search Engine
                   • Document Enhancement



                         Documents
                            (title,                Documents
                                      Document
                         summary,     Enhancing
                                                  (Categorized,
                           author,                Taxonomized)
                          datetime)




Wednesday, December 12, 12
Search Engine
                   • Indexing
                        • Indexing process from all information
                          that have been gathered in one
                          document
                             • Faster Searching process
                             • Able to search based on certain field


Wednesday, December 12, 12
Search Engine
                   • Indexing
                                              Language
                                              Analyzer




                              Documents
                             (Categorized,     Indexer       Index
                             Taxonomized)




                                             Stop Analyzer
Wednesday, December 12, 12
Search Engine
                   • Searching



                                                 Web Client
                                      Index
                             Index   Searcher
                                                Mobile Client




Wednesday, December 12, 12
Search Engine

                   • Document Serving
                        • Search Engine also has a function to
                          display result




Wednesday, December 12, 12
Search Engine


                                         Web Client
                              Index                      Index      Document
         Index               Searcher                   Searcher   Landing Page
                                        Mobile Client




Wednesday, December 12, 12
Search Engine
                   • Recommended Open Source
                     Technology
                             • Search Engine : Lucene, Nutch

                             • Programming Library : Hadoop, Scala Actor

                             • Database : MongoDB, PostgreSQL

                             • Programming Language : Java, Scala, PHP




Wednesday, December 12, 12
Thank You



Wednesday, December 12, 12

Contenu connexe

Tendances

Best Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchBest Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 Search
Agnes Molnar
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Cory Lampert
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
Joshua Shinavier
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
Neo4j
 

Tendances (18)

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Best Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 SearchBest Practices for SharePoint 2010 Search
Best Practices for SharePoint 2010 Search
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
 
Applied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerApplied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL Server
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Smarter share point kc user group fast presentation march 2015
Smarter share point kc user group fast presentation   march 2015Smarter share point kc user group fast presentation   march 2015
Smarter share point kc user group fast presentation march 2015
 

En vedette (6)

Getting more from Google Analytics
Getting more from Google AnalyticsGetting more from Google Analytics
Getting more from Google Analytics
 
Introduction To Ad Words
Introduction To Ad WordsIntroduction To Ad Words
Introduction To Ad Words
 
Organic Web Search - why it matters.
Organic Web Search - why it matters.Organic Web Search - why it matters.
Organic Web Search - why it matters.
 
Isaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo CollegeIsaac's Health Care Archimedes talk at Apollo College
Isaac's Health Care Archimedes talk at Apollo College
 
Increasing and Improving your web traffic
Increasing and Improving your web trafficIncreasing and Improving your web traffic
Increasing and Improving your web traffic
 
Better Digital Marketing
Better Digital MarketingBetter Digital Marketing
Better Digital Marketing
 

Similaire à Search Engine - How to Make it

Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
Agnes Molnar
 
SharePoint 2013 Search Architecture with Russ Houberg
SharePoint 2013  Search Architecture with Russ HoubergSharePoint 2013  Search Architecture with Russ Houberg
SharePoint 2013 Search Architecture with Russ Houberg
knowledgelakemarketing
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With Hadoop
Cloudera, Inc.
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
bhughes26
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
Amazon Web Services
 

Similaire à Search Engine - How to Make it (20)

FAST Search for SharePoint
FAST Search for SharePointFAST Search for SharePoint
FAST Search for SharePoint
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search10 Things I Like in SharePoint 2013 Search
10 Things I Like in SharePoint 2013 Search
 
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 SearchSPCAdriatics - 10 Things I Like In SharePoint 2013 Search
SPCAdriatics - 10 Things I Like In SharePoint 2013 Search
 
SharePoint 2013 Search Architecture with Russ Houberg
SharePoint 2013  Search Architecture with Russ HoubergSharePoint 2013  Search Architecture with Russ Houberg
SharePoint 2013 Search Architecture with Russ Houberg
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With Hadoop
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Search, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journeySearch, APIs, capability management and Sensis's journey
Search, APIs, capability management and Sensis's journey
 
How to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User ExperienceHow to SEO a Terrific - and Profitable - User Experience
How to SEO a Terrific - and Profitable - User Experience
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Planning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROsPlanning SharePoint 2013 Search for IT PROs
Planning SharePoint 2013 Search for IT PROs
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
AWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearchAWS Update | London - Amazon CloudSearch
AWS Update | London - Amazon CloudSearch
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
Arakno
AraknoArakno
Arakno
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 

Search Engine - How to Make it

  • 1. Search Engine How To Make it Wednesday, December 12, 12
  • 2. Search Engine Search Quality Measurement retrieved documents (RET) relevant documents RET ∩ REL (REL) All documents database search: web search: - low recall - high recall - high precision - low precision Wednesday, December 12, 12
  • 3. Search Engine File System File Text Parser Crawler System Documents AaBb (title, Documents PDF AaBb Document 3rd party apps Crawler API Text HTML Parser summary, Enhancing (Categorized, HTML Document author, Taxonomized) Image ... datetime) Database Database Crawler PDF Parser Language Indexer Stop Analyzer Analyzer Web Client Index Document Index Searcher Index Landing Page Searcher Mobile Client Wednesday, December 12, 12
  • 4. Search Engine • Process in Search Engine • Crawling • Parsing • Indexing • Searching Wednesday, December 12, 12
  • 5. Search Engine • Process in Search Engine • Crawling • Parsing • Duplicate Content Detection • Document Enhancement • Indexing • Searching • Document Serving Wednesday, December 12, 12
  • 6. Search Engine • Crawling • Collecting Data • Input : Data content to Search • Output : Raw Content Data in its original format Wednesday, December 12, 12
  • 7. Search Engine • Crawling File System File Crawler System AaBb 3rd party Crawler API PDF AaBb apps Text HTML Document Image ... Database Database Crawler Wednesday, December 12, 12
  • 8. Search Engine • Parsing • Process to extract elements from crawled documents • Input : Raw Contents • Output : Textual Structured Documents Wednesday, December 12, 12
  • 9. Search Engine • Parsing Text Parser Documents AaBb (title, PDF AaBb Text HTML Parser summary, HTML Document author, Image ... datetime) PDF Parser Wednesday, December 12, 12
  • 10. Search Engine • Content Duplication Detection • Bigger Data means Bigger Duplication on Data • Search Engine implement similiar document detection Wednesday, December 12, 12
  • 11. Search Engine • Document Representation Model: Term Frequency(Tf) Contoh: Document 1(d1)=”andi likes to watch movie. His wife likes it too” Document 2(d2)=”andi also likes to watch soccer game.” Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer} Document representation in model Tf: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1} Wednesday, December 12, 12
  • 12. Search Engine • Document Similiarity Similarity between document d1 dan d2 : S(d1, d2) S(d1, d2)=|d1-d2| Contoh: d1={1, 2, 2, 2, 1, 1, 0} d2={1, 1, 1, 0, 0, 0, 1} S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1| S(d1, d2)=7 With above definition, less value we got means more those two documents are getting more similiar Wednesday, December 12, 12
  • 13. Search Engine • Alghoritms 1. Counting Tf for every document 2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d 3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document 4. Repeat process 2 dan 3 until there is no value of S that less than Theshold Wednesday, December 12, 12
  • 14. Search Engine • Document Enhancement • Give tagging based on taxonomy Wednesday, December 12, 12
  • 15. Search Engine • Document Enhancement Documents (title, Documents Document summary, Enhancing (Categorized, author, Taxonomized) datetime) Wednesday, December 12, 12
  • 16. Search Engine • Indexing • Indexing process from all information that have been gathered in one document • Faster Searching process • Able to search based on certain field Wednesday, December 12, 12
  • 17. Search Engine • Indexing Language Analyzer Documents (Categorized, Indexer Index Taxonomized) Stop Analyzer Wednesday, December 12, 12
  • 18. Search Engine • Searching Web Client Index Index Searcher Mobile Client Wednesday, December 12, 12
  • 19. Search Engine • Document Serving • Search Engine also has a function to display result Wednesday, December 12, 12
  • 20. Search Engine Web Client Index Index Document Index Searcher Searcher Landing Page Mobile Client Wednesday, December 12, 12
  • 21. Search Engine • Recommended Open Source Technology • Search Engine : Lucene, Nutch • Programming Library : Hadoop, Scala Actor • Database : MongoDB, PostgreSQL • Programming Language : Java, Scala, PHP Wednesday, December 12, 12