2. indexing
• indexing collects, parses, and stores data to facilitate fast and
accurate information retrieval.
• The purpose of storing an index is to optimize speed and performance in
finding documents.
• Without an index, the search engine would scan every document.
• The additional computer storage required to store the index, as well as the
considerable increase in the time required for an update to take place, are
traded off for the time saved during information retrieval.
3. Why hadoop + solr ?
• Data set outgrows the storage capacity of a single physical machine.
• Distributed filesystems more complex than regular disk filesystems.
• Biggest challenges is making the filesystem tolerate node failure without
suffering data loss.
• Hadoop comes with a distributed filesystem called HDFS.
• HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.
• Hadoop doesn’t require expensive, highly reliable hardware to run on.
4. Continue…
• A program written in other frameworks may require large amounts of
refactoring when scaling from ten to one hundred or one thousand
machines.
• This may involve having the program be rewritten several times
• Hadoop is specifically designed to have a very flat scalability curve.
• In Hadoop very little--if any--work is required for that same program to
run on a much larger amount of hardware.
• Hadoop platform will manage the data and hardware resources and
provide dependable performance growth proportionate to the number of
machines available.
5. Continue…
• Highly fault-tolerant
• Suitable for applications with large data sets
• A HTTP browser can be used to browse the files of a HDFS instance.
• Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
6. Solr
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML, JSON and HTTP
• Comprehensive HTML Administration Interfaces
• Linearly scalable, auto index replication, auto failover and recovery
• Near Real-time indexing
• Flexible and Adaptable with XML configuration
• Extensible Plugin Architecture
7. Solr cloud
• New in Solr 4.0
• Easier scaling
• Centralized config
• Fault tolerant indexing and querying
• Using Apache ZooKeeper as registry