This document provides an overview of the basic system design for the Geliyoo search engine. It describes the key components including Apache Nutch for crawling, Elasticsearch for indexing, and Apache Cassandra for data storage. It outlines the basic workflow including crawling websites to extract links and content, indexing the content using Elasticsearch, and allowing searching of indexed data. The system is designed to distribute tasks across Hadoop, Elasticsearch, and Cassandra clusters for scalability.
5. ●
Basic System Flow:
Figure 2.0
There are three procedure take place in overall system as per following:
1. Crawling
2. Indexing
3. Searching
And overall system divides in 3 clusters:
1. Hadoop Cluster
A small Hadoop cluster includes a single master and multiple worker nodes. The master
node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node
acts as both a DataNode and TaskTracker, though it is possible to have dataonly worker nodes
and computeonly worker nodes.
In a larger cluster, the HDFS is managed through a dedicated NameNode server to host
the file system index, and a secondary NameNode that can generate snapshots of the
NameNode's memory structures, thus preventing filesystem corruption and reducing loss of
data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where
the Hadoop MapReduce engine is deployed against an alternate file system, the NameNode,
secondary NameNode and DataNode architecture of HDFS is replaced by the
filesystemspecific equivalent.
Page 5 of 10