SlideShare une entreprise Scribd logo
1  sur  4
Ms. T.Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore
Open Source Search Engine Framework
When deciding to install a search engine in a website, there exists the possibility to use a commercial
search engine or an open source one. For most of the websites, using a commercial search engine is
not a feasible alternative because of the fees that are required and because they focus on large scale
sites. On the other hand, open source search engines may give the same functionalities (some are
capable of managing large amount of data) as a commercial one, with the benefits of the open source
philosophy: no cost, software maintained actively, possibility to customize the code in order to satisfy
personal needs.
Nowadays, there are many open source alternatives that can be used, and each of them has different
characteristics that must be taken into consideration in order to determine which one to install in the
website. These search engines can be classified according to the programming language in which it is
implemented, how it stores the index (inverted file, database, other file structure), its searching
capabilities (Boolean operators, fuzzy search, use of stemming, etc), way of ranking, type of files
capable of indexing (HTML, PDF, plain text, etc), possibility of on-line indexing and/or making
incremental indexes.
Example:
There are several open source search engines available.
Nutch, Lucene, ASPSeek, BBDBot, Datapark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE,
ManagingGigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Omega,
OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS,
WebGlimpse, XML Query Engine, XMLSearch, Zebra, and Zettair.
Nutch: (A Flexible and Scalable Open-Source Web Search Engine)
Nutch is an open-source Web search engine that can be used at global, local, and even personal scale.
Its initial design goal was to enable a transparent alternative for global Web search in the public
interest — one of its signature features is the ability to “explain” its result rankings. Recent work has
emphasized how it can also be used for intranets; by local communities with richer data models, such
as the Creative Commons metadata enabled search for licensed content; on a personal scale to index a
user's files, email, and web-surfing history
Architecture:
Nutch has a highly modular architecture that uses plug-in APIs for media-type parsing, HTML
analysis, data retrieval protocols, and queries. The core has four major components.
a) Searcher:
Given a query, it must quickly find a small relevant subset of a corpus of documents, and then present
them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking
within that set to produce the most relevant documents, which then must be summarized for display.
b) Indexer:
Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. c)
Database:
Stores the document contents for indexing and later summarization by the searcher, along with
information such as the link structure of the document space and the each document was last fetched.
d) Fetcher:
Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written
entirely from scratch.
Crawling:
An intranet or niche search engine might only take a single machine a few hours to crawl, while a
whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists
of generating a fetch list from the webdb, fetching those pages, parsing those for links, then updating
the webdb. It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days,
by default).
Indexing Text:
Lucene meets the scalability requirements for text indexing in Nutch. Nutch also takes advantage of
Lucene’s multi-field case-folding keyword and phrase search in URLs, anchor text, and document
text.
Indexing Hypertext:
Lucene provides an inverted-file full-text index, which suffices for indexing text but not the
additional tasks required by a web search engine. In addition to this, Nutch implements a link
database to provide efficient access to the Web's link graph, and a page database that stores crawled
pages for indexing, summarizing, and serving to users, as well as supporting other functions such as
crawling and link analysis.
Removing Duplicates
The nutch dedup command eliminates duplicate documents from a set of Lucene indices for Nutch
segments, so it inherently requires access to all the segments at once. It's a batch-mode process that
has to be run before running searches to prevent the search from returning duplicate documents.
Link Analysis:
Nutch includes a link analysis algorithm similar to PageRank. Distributed link analysis is a bulk
synchronous parallel process. At the beginning of each phase, the list of URLs whose scores must be
updated is divided up into many chunks; in the middle, many processes produce score-edit files by
finding all the links into pages in their particular chunk. At the end, an updating phase reads the score-
edit files one at a time, merging their results into new scores for the pages in the web database.
Searching:
Nutch's search user interface runs as a Java Server Page (JSP) that parses the user's textual query and
invokes the search method of a NutchBean. If Nutch is running on a single server, this translates the
user's query into a Lucene query and gets a list of hits from Lucene, which the JSP then renders into
HTML. If Nutch is instead distributed across several servers, the NutchBean's search method instead
remotely invokes the search methods of other NutchBeans on other machines, which can be
configured either to perform the search locally as described above or farm pieces of the work out to
yet other servers.
Summarizing:
Summaries on a results page are designed to avoid click throughs. By providing as much relevant
information as possible in a small amount of text, they help users improve precision. Nutch's
summarizer works by retokenizing a text string containing the entire original document, extracting a
minimal set of excerpts containing five words of context on each side of each hit in the document,
then deciding which excerpts to include in the final summary. It orders the excerpts with the best ones
first, preferring longer excerpts over shorter ones, and excerpts with more hits above excerpts with
fewer, and then it truncates the total summary to a maximum of twenty words.

Contenu connexe

Tendances

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrievalBasma Gamal
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notesBAIRAVI T
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?Yu-Chang Ho
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 

Tendances (20)

Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web mining
Web miningWeb mining
Web mining
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Inverted index
Inverted indexInverted index
Inverted index
 
web mining
web miningweb mining
web mining
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
CS8080 information retrieval techniques unit iii ppt in pdf
CS8080 information retrieval techniques unit iii ppt in pdfCS8080 information retrieval techniques unit iii ppt in pdf
CS8080 information retrieval techniques unit iii ppt in pdf
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
On demand provisioning
On demand provisioningOn demand provisioning
On demand provisioning
 

Similaire à Open source search engine

Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]Mustafa Elkhiat
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.Anish Thomas
 
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...IJCSIS Research Publications
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...ijwscjournal
 
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...iosrjce
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 
Google indexing
Google indexingGoogle indexing
Google indexingtahoor71
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER ijcseit
 
Evaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesEvaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesNikesh Narayanan
 

Similaire à Open source search engine (20)

Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Introduction to internet.
Introduction to internet.Introduction to internet.
Introduction to internet.
 
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
Proactive Approach to Estimate the Re-crawl Period for Resource Minimization ...
 
How Google Works
How Google WorksHow Google Works
How Google Works
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
 
R01765113122
R01765113122R01765113122
R01765113122
 
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
 
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.
Presentation on SEO, .htaccess, Open-source, Ontology, Semantic web, etc.
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Google indexing
Google indexingGoogle indexing
Google indexing
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER
 
Evaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery ServicesEvaluation of Web Scale Discovery Services
Evaluation of Web Scale Discovery Services
 
Web crawler
Web crawlerWeb crawler
Web crawler
 

Dernier

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 

Dernier (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Open source search engine

  • 1. Ms. T.Primya Assistant Professor Department of Computer Science and Engineering Dr. N. G. P. Institute of Technology Coimbatore Open Source Search Engine Framework When deciding to install a search engine in a website, there exists the possibility to use a commercial search engine or an open source one. For most of the websites, using a commercial search engine is not a feasible alternative because of the fees that are required and because they focus on large scale sites. On the other hand, open source search engines may give the same functionalities (some are capable of managing large amount of data) as a commercial one, with the benefits of the open source philosophy: no cost, software maintained actively, possibility to customize the code in order to satisfy personal needs. Nowadays, there are many open source alternatives that can be used, and each of them has different characteristics that must be taken into consideration in order to determine which one to install in the website. These search engines can be classified according to the programming language in which it is implemented, how it stores the index (inverted file, database, other file structure), its searching capabilities (Boolean operators, fuzzy search, use of stemming, etc), way of ranking, type of files capable of indexing (HTML, PDF, plain text, etc), possibility of on-line indexing and/or making incremental indexes. Example: There are several open source search engines available. Nutch, Lucene, ASPSeek, BBDBot, Datapark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE, ManagingGigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine, XMLSearch, Zebra, and Zettair. Nutch: (A Flexible and Scalable Open-Source Web Search Engine) Nutch is an open-source Web search engine that can be used at global, local, and even personal scale. Its initial design goal was to enable a transparent alternative for global Web search in the public interest — one of its signature features is the ability to “explain” its result rankings. Recent work has emphasized how it can also be used for intranets; by local communities with richer data models, such
  • 2. as the Creative Commons metadata enabled search for licensed content; on a personal scale to index a user's files, email, and web-surfing history Architecture: Nutch has a highly modular architecture that uses plug-in APIs for media-type parsing, HTML analysis, data retrieval protocols, and queries. The core has four major components. a) Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, and then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display. b) Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. c) Database: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the each document was last fetched. d) Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch. Crawling:
  • 3. An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists of generating a fetch list from the webdb, fetching those pages, parsing those for links, then updating the webdb. It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days, by default). Indexing Text: Lucene meets the scalability requirements for text indexing in Nutch. Nutch also takes advantage of Lucene’s multi-field case-folding keyword and phrase search in URLs, anchor text, and document text. Indexing Hypertext: Lucene provides an inverted-file full-text index, which suffices for indexing text but not the additional tasks required by a web search engine. In addition to this, Nutch implements a link database to provide efficient access to the Web's link graph, and a page database that stores crawled pages for indexing, summarizing, and serving to users, as well as supporting other functions such as crawling and link analysis. Removing Duplicates The nutch dedup command eliminates duplicate documents from a set of Lucene indices for Nutch segments, so it inherently requires access to all the segments at once. It's a batch-mode process that has to be run before running searches to prevent the search from returning duplicate documents. Link Analysis: Nutch includes a link analysis algorithm similar to PageRank. Distributed link analysis is a bulk synchronous parallel process. At the beginning of each phase, the list of URLs whose scores must be updated is divided up into many chunks; in the middle, many processes produce score-edit files by finding all the links into pages in their particular chunk. At the end, an updating phase reads the score- edit files one at a time, merging their results into new scores for the pages in the web database. Searching: Nutch's search user interface runs as a Java Server Page (JSP) that parses the user's textual query and invokes the search method of a NutchBean. If Nutch is running on a single server, this translates the user's query into a Lucene query and gets a list of hits from Lucene, which the JSP then renders into HTML. If Nutch is instead distributed across several servers, the NutchBean's search method instead remotely invokes the search methods of other NutchBeans on other machines, which can be
  • 4. configured either to perform the search locally as described above or farm pieces of the work out to yet other servers. Summarizing: Summaries on a results page are designed to avoid click throughs. By providing as much relevant information as possible in a small amount of text, they help users improve precision. Nutch's summarizer works by retokenizing a text string containing the entire original document, extracting a minimal set of excerpts containing five words of context on each side of each hit in the document, then deciding which excerpts to include in the final summary. It orders the excerpts with the best ones first, preferring longer excerpts over shorter ones, and excerpts with more hits above excerpts with fewer, and then it truncates the total summary to a maximum of twenty words.