4. It required vertical scaling with manual sharding and subsequent resharding as data grew Lucene/Solr What didn’t work for us
5. We built a new search forHadoop Platform HDFS and HBase What we did
6. In the next few slides you will hear about my learning from designing, developing and benchmarking a distributed, real-time search engine whose Index is stored and served out of HBase
7. Key Learning Using SSD is a design decision. Methods to reduce HBase table storage size Serving a request without accessing ALL region servers Methods to move processing near the data Byte block caching to lower network and I/O trips to HBase Configuration to balance network vs. CPU vs. I/O vs memory
8. Using SSD is a design decision 1 SSD improved HSearch response time by 66% over SATA.However, SSD is costlier. In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.
9. .. Our SSD Friendly Schema Design Keyword:Reads all for a query. Document: Reads 10 docs / query. Keyword + Document in 1 Table Keyword + Document in 2 Tables SSD deployment is All or none SSD deployment is only for Keyword Table
10. Key length Value length Row length Row Bytes Family Length Family Bytes Qualifier Bytes Timestamp Key Type Value Bytes 4 BYTES 1 BYTE 4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTES BYTES 2 Methods to reduceHBase table storage size Storing a 4 byte cell requires >27bytes in HBase.
11. .. to 1/3rd Stored large cell values by merging cells Reduced the Family name to 1 Character Reduced the Qualifier name to 1 Character
12. Serving a request without accessing ALL region servers 3 Consider a 100 node cluster of HBase and a single search request need to access all of them. Bad Design.. Clogged Network.. No scaling
13. Index Table was divided on Column-Family as separate tables Scan Table A - 3 Machines Hit Table B Table A Machine 5 Machine 4 Machine 5 4-5 M Machine 3 Machine 3 Machine 4 3-4 M Machine 2 Row Ranges 2-3 M Machine 3 Machine 1 1-2 M Machine 2 0-1 M Machine 1 3 And our solution… Scan “Family A” - 5 Machines Hit Family A Family B
14.
15. Sent relevant section of a Field over network.public class DocFilter implements Filter { public void filterRow(List<KeyValue> kvL) { byte[] val = extractNeededPiece(kvL); kvL.clear(); kvL.add(new KeyValue(row,fam,,val)); } …. E.g. Computing a best match section from within a document for a given query
16. Byte block caching to lower network and I/O trips to HBase 5 Object caching – With growing number of objects we encountered ‘Out of Memory’ exception HBase commit - Frequent flushing to HBase introduced network and I/O latencies. Converting Objects to intermediate Byte Blocks increased record processing by 20x in 1 batch.
17. Configuration to balance Network vs. CPU vs. I/O vs. Memory 6 Disk I/O Block Caching Compression Memory CPU Aggressive GC Network IPC Caching Compression In a Single Machine
18. … and it’s settings Network Increased IPC Cache Limits (hbase.client.scanner.caching) CPU JVM agressive heap ("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “) I/O LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”) Memory HBase block caching (hfile.block.cache.size) and overall memory allocation for data-node and region-server.
22. HSearch Benchmarks on AWS Amazon Large instance 7.5 GB Memory * 11 Machines with a single 7.5K SATA drive 100 Million Wikipedia pages of total 270GB and completely indexed (Included common stopwords) 10 Million pages repeated 10 times. (Total indexing time is 5 Hours) Search Query Response speed using a regular word is 1.5 sec common word such as “hill” found 1.6 million matches and sorted in 7 seconds
23. www.sourceforge.net/bizosyshsearch Apache-licensed (2 versions released) Distributed, real-time search Supports XML documents with rich search syntax and various filtration criteria such as document type, field type.
24. References Initial performance reports (Bizosys HSearch, a Nosql search engine, featured in Intel Cloud Builders Success Stories (July, 2010) HSearch is currently in use at http://www.10screens.com More on hsearch http://www.bizosys.com/blog/ http://bizosyshsearch.sourceforge.net/ More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf