Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Searching Information Inside Hadoop Platform Abinasha KaranaDirector-TechnologyBizosys Technologies Pvt Ltd.abinash@bizosys.comwww.bizosys.com

To search a large dataset inside HDFS and HBase, At Bizosys we started with Map-Reduce and Lucene/Solr

Map-reduce What didn’t work for us Result not in a mouse click

It required vertical scaling with manual sharding and subsequent resharding as data grew Lucene/Solr What didn’t work for us

We built a new search forHadoop Platform HDFS and HBase What we did

In the next few slides you will hear about my learning from designing, developing and benchmarking a distributed, real-time search engine whose Index is stored and served out of HBase

Key Learning Using SSD is a design decision. Methods to reduce HBase table storage size Serving a request without accessing ALL region servers Methods to move processing near the data Byte block caching to lower network and I/O trips to HBase Configuration to balance network vs. CPU vs. I/O vs memory

Using SSD is a design decision 1 SSD improved HSearch response time by 66% over SATA.However, SSD is costlier. In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.

.. Our SSD Friendly Schema Design Keyword:Reads all for a query. Document: Reads 10 docs / query. Keyword + Document in 1 Table Keyword + Document in 2 Tables SSD deployment is All or none SSD deployment is only for Keyword Table

Key length Value length Row length Row Bytes Family Length Family Bytes Qualifier Bytes Timestamp Key Type Value Bytes 4 BYTES 1 BYTE 4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTES BYTES 2 Methods to reduceHBase table storage size Storing a 4 byte cell requires >27bytes in HBase.

.. to 1/3rd Stored large cell values by merging cells Reduced the Family name to 1 Character Reduced the Qualifier name to 1 Character

Serving a request without accessing ALL region servers 3 Consider a 100 node cluster of HBase and a single search request need to access all of them. Bad Design.. Clogged Network.. No scaling

Index Table was divided on Column-Family as separate tables Scan Table A - 3 Machines Hit Table B Table A Machine 5 Machine 4 Machine 5 4-5 M Machine 3 Machine 3  Machine 4 3-4 M Machine 2 Row Ranges 2-3 M Machine 3 Machine 1 1-2 M Machine 2 0-1 M Machine 1 3 And our solution… Scan “Family A” - 5 Machines Hit Family A Family B

Methods to move processing near the data 4 ,[object Object],public class TermFilter implements Filter { public ReturnCode filterKeyValue(KeyValue kv) { boolean isMatched = isFound(kv); if (isMatched ) return ReturnCode.INCLUDE; return ReturnCode.NEXT_ROW; } … E.g. Matched rows for a keyword ,[object Object]

Sent relevant section of a Field over network.public class DocFilter implements Filter { public void filterRow(List<KeyValue> kvL) { byte[] val = extractNeededPiece(kvL); kvL.clear(); kvL.add(new KeyValue(row,fam,,val)); } …. E.g. Computing a best match section from within a document for a given query

Byte block caching to lower network and I/O trips to HBase 5 Object caching – With growing number of objects we encountered ‘Out of Memory’ exception HBase commit - Frequent flushing to HBase introduced network and I/O latencies.  Converting Objects to intermediate Byte Blocks increased record processing by 20x in 1 batch.

Configuration to balance Network vs. CPU vs. I/O vs. Memory 6 Disk I/O Block Caching Compression Memory CPU Aggressive GC Network IPC Caching Compression In a Single Machine

… and it’s settings Network Increased IPC Cache Limits (hbase.client.scanner.caching) CPU JVM agressive heap ("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “) I/O LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”) Memory HBase block caching (hfile.block.cache.size) and overall memory allocation for data-node and region-server.

.. and parallelized to multi-machines ,[object Object]

FUTURE-coprocessors (hbase 0.92 release).Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Similaire à Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana (20)

Plus de Yahoo Developer Network

Plus de Yahoo Developer Network (20)

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana