SlideShare une entreprise Scribd logo
1  sur  21
Searching Information Inside Hadoop Platform Abinasha KaranaDirector-TechnologyBizosys Technologies Pvt Ltd.abinash@bizosys.comwww.bizosys.com
To search a large dataset inside HDFS and HBase,  At Bizosys we started with Map-Reduce and Lucene/Solr
Map-reduce What didn’t work for us Result not in a mouse click
It required vertical scaling with manual sharding and subsequent resharding as data grew Lucene/Solr What didn’t work for us
We built a new search forHadoop Platform  HDFS and HBase What we did
In the next few slides you will hear about my learning from designing, developing and benchmarking a distributed, real-time search engine whose Index is stored and served out of HBase
Key Learning Using SSD is a design decision. Methods to reduce HBase table storage size  Serving a request without accessing ALL region servers Methods to move processing near the data Byte block caching to lower network and I/O trips to HBase Configuration  to balance network vs. CPU vs. I/O vs memory
Using SSD is a design decision 1 SSD improved HSearch response time by 66% over SATA.However, SSD is costlier.  In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.
.. Our SSD Friendly Schema Design Keyword:Reads all for a query. Document: Reads 10 docs / query. Keyword + Document in 1 Table Keyword + Document  in 2 Tables SSD deployment is All or none SSD deployment is only for  Keyword Table
Key length Value length Row length Row Bytes Family Length Family Bytes Qualifier Bytes Timestamp Key Type  Value Bytes 4 BYTES 1 BYTE 4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTES BYTES 2 Methods to reduceHBase table storage size Storing a 4 byte cell requires >27bytes in HBase.
.. to 1/3rd Stored large cell values by merging cells Reduced the Family name to 1 Character Reduced the Qualifier name to 1 Character
Serving a request without accessing ALL region servers 3 Consider a 100 node cluster of HBase and a single search request need to access all of them. Bad Design..  Clogged Network..  No scaling
Index Table was divided on Column-Family as separate tables Scan Table  A - 3 Machines Hit  Table B Table A Machine 5 Machine 4 Machine 5 4-5 M Machine 3 Machine 3  Machine 4 3-4 M Machine 2 Row Ranges 2-3 M Machine 3 Machine 1 1-2 M Machine 2 0-1 M Machine 1 3 And our solution… Scan “Family A” - 5 Machines Hit Family A Family B
Methods to move processing near the data 4 ,[object Object],public class TermFilter implements Filter { public ReturnCode filterKeyValue(KeyValue kv) { 	boolean isMatched = isFound(kv); 	if (isMatched ) return ReturnCode.INCLUDE; 	return ReturnCode.NEXT_ROW; } … E.g. Matched rows for a keyword ,[object Object]
Sent relevant section of a Field over network.public class DocFilter implements Filter { public void filterRow(List<KeyValue> kvL) { 	byte[] val =  extractNeededPiece(kvL); 	kvL.clear(); 	kvL.add(new KeyValue(row,fam,,val)); } …. E.g. Computing a best match section from within a document for a given query
Byte block caching to lower network and I/O trips to HBase 5 Object caching – With growing number of objects we encountered ‘Out of Memory’ exception HBase commit - Frequent flushing to HBase introduced network and I/O latencies.  Converting Objects to intermediate Byte Blocks increased record processing by 20x in 1 batch.
Configuration to balance Network vs. CPU vs. I/O vs. Memory  6 Disk I/O Block Caching Compression Memory CPU Aggressive GC Network IPC Caching Compression In a Single Machine
… and it’s settings Network Increased IPC Cache Limits (hbase.client.scanner.caching) CPU JVM agressive heap ("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “) I/O LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”) Memory HBase block caching (hfile.block.cache.size) and overall memory allocation for data-node and region-server.
.. and parallelized to multi-machines ,[object Object]
ParallelHTable (Scans)
FUTURE-coprocessors (hbase 0.92 release).Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count  and dfs.datanode.handler.count

Contenu connexe

Tendances

Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
enissoz
 

Tendances (20)

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかApache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 

En vedette (8)

18 OCT FRSA FLASH
18 OCT FRSA FLASH18 OCT FRSA FLASH
18 OCT FRSA FLASH
 
Skills Development
Skills Development Skills Development
Skills Development
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)
 
Lapan 20.04 hadoop h-base
Lapan 20.04 hadoop h-baseLapan 20.04 hadoop h-base
Lapan 20.04 hadoop h-base
 
алексей романенко
алексей романенкоалексей романенко
алексей романенко
 
Il Cloud chiavi in mano | Vincenzo Messina (CA Technologies) | Roma
Il Cloud chiavi in mano | Vincenzo Messina (CA Technologies) | RomaIl Cloud chiavi in mano | Vincenzo Messina (CA Technologies) | Roma
Il Cloud chiavi in mano | Vincenzo Messina (CA Technologies) | Roma
 
Сервис Ты где на карте?
Сервис Ты где на карте?Сервис Ты где на карте?
Сервис Ты где на карте?
 
Presentacion trovit global mx_v1_homes
Presentacion trovit global mx_v1_homesPresentacion trovit global mx_v1_homes
Presentacion trovit global mx_v1_homes
 

Similaire à Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 

Similaire à Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana (20)

Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
מיכאל
מיכאלמיכאל
מיכאל
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 

Plus de Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

  • 1. Searching Information Inside Hadoop Platform Abinasha KaranaDirector-TechnologyBizosys Technologies Pvt Ltd.abinash@bizosys.comwww.bizosys.com
  • 2. To search a large dataset inside HDFS and HBase, At Bizosys we started with Map-Reduce and Lucene/Solr
  • 3. Map-reduce What didn’t work for us Result not in a mouse click
  • 4. It required vertical scaling with manual sharding and subsequent resharding as data grew Lucene/Solr What didn’t work for us
  • 5. We built a new search forHadoop Platform HDFS and HBase What we did
  • 6. In the next few slides you will hear about my learning from designing, developing and benchmarking a distributed, real-time search engine whose Index is stored and served out of HBase
  • 7. Key Learning Using SSD is a design decision. Methods to reduce HBase table storage size Serving a request without accessing ALL region servers Methods to move processing near the data Byte block caching to lower network and I/O trips to HBase Configuration to balance network vs. CPU vs. I/O vs memory
  • 8. Using SSD is a design decision 1 SSD improved HSearch response time by 66% over SATA.However, SSD is costlier. In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.
  • 9. .. Our SSD Friendly Schema Design Keyword:Reads all for a query. Document: Reads 10 docs / query. Keyword + Document in 1 Table Keyword + Document in 2 Tables SSD deployment is All or none SSD deployment is only for Keyword Table
  • 10. Key length Value length Row length Row Bytes Family Length Family Bytes Qualifier Bytes Timestamp Key Type Value Bytes 4 BYTES 1 BYTE 4 BYTES 4 BYTES 2 BYTES BYTES 1 BYTE BYTES 8 BYTES BYTES 2 Methods to reduceHBase table storage size Storing a 4 byte cell requires >27bytes in HBase.
  • 11. .. to 1/3rd Stored large cell values by merging cells Reduced the Family name to 1 Character Reduced the Qualifier name to 1 Character
  • 12. Serving a request without accessing ALL region servers 3 Consider a 100 node cluster of HBase and a single search request need to access all of them. Bad Design.. Clogged Network.. No scaling
  • 13. Index Table was divided on Column-Family as separate tables Scan Table A - 3 Machines Hit Table B Table A Machine 5 Machine 4 Machine 5 4-5 M Machine 3 Machine 3  Machine 4 3-4 M Machine 2 Row Ranges 2-3 M Machine 3 Machine 1 1-2 M Machine 2 0-1 M Machine 1 3 And our solution… Scan “Family A” - 5 Machines Hit Family A Family B
  • 14.
  • 15. Sent relevant section of a Field over network.public class DocFilter implements Filter { public void filterRow(List<KeyValue> kvL) { byte[] val = extractNeededPiece(kvL); kvL.clear(); kvL.add(new KeyValue(row,fam,,val)); } …. E.g. Computing a best match section from within a document for a given query
  • 16. Byte block caching to lower network and I/O trips to HBase 5 Object caching – With growing number of objects we encountered ‘Out of Memory’ exception HBase commit - Frequent flushing to HBase introduced network and I/O latencies.  Converting Objects to intermediate Byte Blocks increased record processing by 20x in 1 batch.
  • 17. Configuration to balance Network vs. CPU vs. I/O vs. Memory 6 Disk I/O Block Caching Compression Memory CPU Aggressive GC Network IPC Caching Compression In a Single Machine
  • 18. … and it’s settings Network Increased IPC Cache Limits (hbase.client.scanner.caching) CPU JVM agressive heap ("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “) I/O LZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”) Memory HBase block caching (hfile.block.cache.size) and overall memory allocation for data-node and region-server.
  • 19.
  • 21. FUTURE-coprocessors (hbase 0.92 release).Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count
  • 22. HSearch Benchmarks on AWS Amazon Large instance 7.5 GB Memory * 11 Machines with a single 7.5K SATA drive 100 Million Wikipedia pages of total 270GB and completely indexed (Included common stopwords)   10 Million pages repeated 10 times. (Total indexing time is 5 Hours) Search Query Response speed using a regular word is 1.5 sec common word such as “hill” found 1.6 million matches and sorted in 7 seconds
  • 23. www.sourceforge.net/bizosyshsearch Apache-licensed (2 versions released) Distributed, real-time search Supports XML documents with rich search syntax and various filtration criteria such as document type, field type.
  • 24. References Initial performance reports (Bizosys HSearch, a Nosql search engine, featured in Intel Cloud Builders Success Stories (July, 2010) HSearch is currently in use at http://www.10screens.com More on hsearch http://www.bizosys.com/blog/ http://bizosyshsearch.sourceforge.net/ More on SSD Product Technical Specifications http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf