Publicité
Publicité

Contenu connexe

Publicité
Publicité

Sept 17 2013 - THUG - HBase a Technical Introduction

  1. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Technical Deep Dive Sept 17 2013 – Toronto Hadoop User Group Adam Muise amuise@hortonworks.com
  2. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Deep Dive Agenda • Background – (how did we get here?) • High-level Architecture – (where are we?) • Anatomy of a RegionServer – (how does this thing work?) • Using HBase – (where do we go from here?) Page 2
  3. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Background Page 3
  4. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. So what is a BigTable anyway? • BigTable paper from Google, 2006, Dean et al. – “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.” – http://research.google.com/archive/bigtable.html • Key Features: – Distributed storage across cluster of machines – Random, online read and write data access – Schemaless data model (“NoSQL”) – Self-managed data partitions Page 4
  5. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Modern Datasets Break Traditional Databases Page 5 >  10x more always-connected mobile devices than seen in PC era. >  Sensor, video and other machine generated data easily exceeds 100TB / day. >  Traditional databases can’t serve modern application needs.
  6. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Apache HBase: The Database For Big Data Page 6 More data is the key to richer application experiences and deeper insights. With HBase you can: ü  Ingest and retain more data, to petabyte scale and beyond. ü  Store and access huge data volumes with low latency. ü  Store data of any structure. ü  Use the entire Hadoop ecosystem to gain deep insight on your data.
  7. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase At A Glance Page 7 1 2 4 CLIENT LAYER HBASE LAYER HDFS LAYER 1 Clients automatically load balanced across the cluster. 2 Scales linearly to handle any load. 3 Data stored in HDFS allows automated failover. 4 Analyze data with any Hadoop tool. 3
  8. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: Real-Time Data on Hadoop Page 8 >  Read, Write, Process and Query data in real time using Hadoop infrastructure.
  9. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: High Availability Page 9 >  Data safely protected in HDFS. >  Failed nodes are automatically recovered. >  No single point of failure, no manual intervention. HBase NodeHBase Node Replication Replication HDFS HDFS HDFS
  10. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: Multi-Datacenter Replication Page 10 >  Replicate data to 2 or more datacenters. >  Load balancing or disaster recovery.
  11. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase: Seamless Hadoop Integration Page 11 >  HBase makes deep analytics simple using any Hadoop tool. >  Query with Hive, process with Pig, classify with Mahout. HDFS
  12. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Apache Hadoop in Review • Apache Hadoop Distributed Filesystem (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – Uses a filesystem analogy, not structured tables – The Google File System, 2003, Ghemawat et al. – http://research.google.com/archive/gfs.html • Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – Line- or record-oriented processing of the entire dataset – “[Application] schema on read” – MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and Ghemawat – http://research.google.com/archive/mapreduce.html Page 12 For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  13. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. High-level Architecture Page 13
  14. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Logical Architecture • [Big]Tables consist of billions of rows, millions of columns • Records ordered by rowkey – Inserts require sort, write-side overhead – Applications can take advantage of the sort • Continuous sequences of rows partitioned into Regions – Regions partitioned at row boundary, according to size (bytes) • Regions automatically split when they grow too large • Regions automatically distributed around the cluster – ”Hands-free" partition management (mostly) Page 14
  15. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Logical Architecture Distributed, persistent partitions of a BigTable a b d c e f h g i j l k m n p o Table A Region 1 Region 2 Region 3 Region 4 Region Server 7 Table A, Region 1 Table A, Region 2 Table G, Region 1070 Table L, Region 25 Region Server 86 Table A, Region 3 Table C, Region 30 Table F, Region 160 Table F, Region 776 Region Server 367 Table A, Region 4 Table C, Region 17 Table E, Region 52 Table P, Region 1116 Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions. Page 15
  16. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Physical Architecture • RegionServers collocate with DataNode – Tight MapReduce integration – Opportunity for data-local online processing via coprocessors (experimental) • HBase Master process manages Region assignment • ZooKeeper configuration glue • Clients communicate directly with RegionServers (data path) – Horizontally scale client load – Significantly harder for a single ignorant process to DOS the cluster • DDL operations clients communicate with HBase Master • No persistent state in Master or ZooKeeper – Recover from HDFS snapshot – See also: AWS Elastic MapReduce's HBase restore path Page 16
  17. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 17 Physical Architecture Distribution and Data Path ... Zoo Keeper Zoo Keeper Zoo Keeper HBase Client JavaApp HBase Client JavaApp HBase Client HBase Shell HBase Client REST/Thrift Gateway HBase Client JavaApp HBase Client JavaApp Region Server Data Node Region Server Data Node ... Region Server Data Node Region Server Data Node HBase Master Name Node Legend: - An HBase RegionServer is collocated with an HDFS DataNode. - HBase clients communicate directly with Region Servers for sending and receiving data. - HMaster manages Region assignment and handles DDL operations. - Online configuration state is maintained in ZooKeeper. - HMaster and ZooKeeper are NOT involved in data path.
  18. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Logical Data Model • Table as a sorted map of maps {rowkey => {family => {qualifier => {version => value}}}} – Think: nested OrderedDictionary (C#), TreeMap (Java) • Basic data operations: GET, PUT, DELETE • SCAN over range of key-values – benefit of the sorted rowkey business – this is how you implement any kind of "complex query” • GET, SCAN support Filters – Push application logic to RegionServers • INCREMENT, CheckAnd{Put,Delete} – Server-side, atomic data operations – Require read lock, can be contentious • No: secondary indices, joins, multi-row transactions Page 18
  19. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 19 Logical Data Model A sparse, multi-dimensional, sorted map Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes. 1368387247 [3.6 kb png data]"thumb"cf2b a cf1 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" cf2 1368387684 "almost the loneliest number"1.0001 1368396302 "fourth of July""2011-07-04" Table A rowkey column family column qualifier timestamp value
  20. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Anatomy of a RegionServer Page 20
  21. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Storage Machinery • RegionServers host N Regions, as assigned by Master – Common case, Region data is local to the RegionServer/DataNode • Each column family stored in isolation of others – "column-family oriented” storage – NOT the same as column-oriented storage • Key-values managed by "HStore” – combined view over data on disk + in-memory edits – region manages one HStore for each column family • On disk: key-values stored sorted in "StoreFiles” – StoreFiles composed of ordered sequence of "Blocks” – also carries BloomFilter to minimize Block access • In memory: "MemStore" maintains heap of recent edits – not to be confused with "BlockCache” – this structure is essentially a log-structured merge tree (LSM-tree)* with MemStore C0 and StoreFiles C1 Page 21 * http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log- Structured%20Merge-Tree%20%28LSM-Tree%29.pdf
  22. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 22 RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: - A RegionServer contains a single WAL, single BlockCache, and multiple Regions. - A Region contains multiple Stores, one for each Column Family. - A Store consists of multiple StoreFiles and a MemStore. - A StoreFile corresponds to a single HFile. - HFiles and WAL are persisted on HDFS. Storage Machinery Implementing the data model
  23. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Write Path (Storage Machinery cont.) • Write summary: 1.  Log edit to HLog (WAL) 2.  Record in MemStore 3.  ACK write • Data events recorded to a WAL on HDFS, for durability – After fails, edits in WAL are replayed during recovery – WAL appends are immediate, in critical write-path • Data collected in "MemStore", until a "flush" writes new HFiles – Flush is automatic, based on configuration (size, or staleness interval) – Flush clears WAL entries corresponding to MemStore entries – Flush is deferred, not in critical write-path • HFiles are merge-sorted during "Compaction” – Small files compacted into larger files – old records discarded (major compaction only) – Lots of disk and network IO Page 23
  24. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 24 RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: 1. A MutateRequest is received by the RegionServer. 2. A WALEdit is appended to the HLog. 3. The new KeyValues are written to the MemStore. 4. The RegionServer acknowledges the edit with a MutateResponse. Write Path Storing a KeyValue 1 2 3 4
  25. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Read Path (Storage Machinery, cont.) • Read summary: 1.  Evaluate query predicate 2.  Materialize results from Stores 3.  Batch results to client • Scanners opened over all relevant StoreFiles + MemStore – “BlockCache” maintains recently accessed Blocks in memory – BloomFilter used to skip irrelevant Blocks – Predicate matchs accumulate, sorted, return ordered rows • Same Scanner APIs used for GET and SCAN – Different access patterns, different optimization strategies – SCAN: – HDFS optimized for throughput of long sequential reads – Consider larger Block size for more data per seek – GET: – BlockCache maintains hot Blocks for point access (GET) – Consider more granular BloomFilter Page 25
  26. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 26 RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: 1. A GetRequest is received by the RegionServer. 2. StoreScanners are opened over appropriate StoreFiles and the MemStore. 3. Blocks identified as potential matches are read from HDFS if not already in the BlockCache. 4. KeyValues are merged into the final set of Results. 5. A GetResponse containing the Results is returned to the client. Read Path Serving a single read request 1 5 2 3 3 2 4
  27. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Using HBase Page 27
  28. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For what kinds of workloads is it well suited? • It depends on how you tune it, but… • HBase is good for: – Large datasets – Sparse datasets – Loosely coupled (denormalized) records – Lots of concurrent clients • Try to avoid: – Small datasets (unless you have *lots* of them) – Highly relational records – Schema designs requiring transactions Page 28
  29. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Use Cases Page 29 Flexible  Schema  Huge  Data  Volume   High  Read  Rate   High  Write  Rate   Machine-­‐Generated   Data   Distributed  Messaging   Real-­‐Time   Analy@cs   Object  Store   User  Profile   Management  
  30. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hbase Example Use Case: Major Hard Drive Manufacturer Page 30 • Goal: detect defective drives before they leave the factory. • Solution: – Stream sensor data to HBase as it is generated by their test battery. – Perform real-time analysis as data is added and deep analytics offline. • HBase a perfect fit: – Scalable enough to accommodate all 250+ TB of data needed. – Seamless integration with Hadoop analytics tools. • Result: – Went from processing only 5% of drive test data to 100%.
  31. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Other Example HBase Use Cases • Facebook messaging and counts • Time series data • Exposing Machine Learning models (like risk sets) • Large message set store and forward, especially in social media • Geospatial indexing • Indexing the Internet Page 31
  32. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. How does it integrate with my infrastructure? • Horizontally scale application data – Highly concurrent, read/write access – Consistent, persisted shared state – Distributed online data processing via Coprocessors (experimental) • Gateway between online services and offline storage/ analysis – Staging area to receive new data – Serve online “views” on datasets in HDFS – Glue between batch (HDFS, MR1) and online (CEP, Storm) systems Page 32
  33. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. What data semantics does it provide? • GET, PUT, DELETE key-value operations • SCAN for queries • INCREMENT, CAS server-side atomic operations • Row-level write atomicity • MapReduce integration Page 33
  34. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Creating a table in HBase #!/bin/sh   #  Small  script  to  setup  the  hbase  table  used  by  OpenTSDB.     test  -­‐n  "$HBASE_HOME"  ||  {  #A   echo  >&2  'The  environment  variable  HBASE_HOME  must  be  set'     exit  1  }     test  -­‐d  "$HBASE_HOME"  ||  {   echo  >&2  "No  such  directory:  HBASE_HOME=$HBASE_HOME"     exit  1  }     TSDB_TABLE=${TSDB_TABLE-­‐'tsdb'}  UID_TABLE=${UID_TABLE-­‐'tsdb-­‐uid'}  COMPRESSION=$ {COMPRESSION-­‐'LZO'}     exec  "$HBASE_HOME/bin/hbase"  shell  <<EOF   create  '$UID_TABLE',  #B  {NAME  =>  'id',  COMPRESSION  =>  '$COMPRESSION'},  #B  {NAME  =>  'name',   COMPRESSION  =>  '$COMPRESSION'}  #B     create  '$TSDB_TABLE',  #C  {NAME  =>  't',  COMPRESSION  =>  '$COMPRESSION'}  #C       EOF     #A  From  environment,  not  parameter   #B  Make  the  tsdb-­‐uid  table  with  column  families  id  and  name     #C  Make  the  tsdb  table  with  the  t  column  family       #Script  taken  from  HBase  in  Action  -­‐  Chapter  7   Page 34
  35. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Coprocessors in a nutshell •  Two types of coprocessors: Observer and Endpoints •  Coprocessors are java code executed in each region server •  Observer –  Similar to a database trigger –  Available Observer types: RegionObserver, WALObserver, MasterObserver –  Mainly used to extend pre/post logic within region server events, WAL events, or DDL events •  Endpoint –  Sort of like a UDF –  Extend HBase client API to make functions exposed to a user –  Still executed on RegionServer –  Often used for sums/aggregations (HBase packs in an aggregate example) • BE VERY CAREFUL WITH COPROCESSORS –  They run in your region servers and buggy code can take down your cluster –  See HOYA details to help mitigate risk Page 35
  36. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. What about operational concerns? • Balance memory and IO for reads – Contention between random and sequential access – Configure Block size, BlockCache based on access patterns – Additional resources – “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase- performance-tuners – “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/ scanning-in-hbase.html • Balance IO for writes – Provision hardware with more spindles/TB – Configure L1 (compactions, region size, &c.) based on write pattern – Balance contention between maintaining L1 and serving reads – Additional resources – “Configuring HBase Memstore: what you should know,” http:// blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/ – “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/ visualizing-hbase-flushes-and-compactions/ Page 36
  37. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Operational Tidbits • Decommissioning Nodes will result in a downed server, use “graceful_stop.sh” to offload the workload from the region server • Use the “zk_dump” to find all of your region servers and how your zookeeper instances are faring • Use “status ‘summary’” or “status ‘detailed’” for a count of live/dead servers, average load, and file counts • User “balancer” to automatically balance regions if HBase is set to auto-balance • When using “hbase hbck” to diagnose and fix issues, RTFM! Page 37
  38. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. SQL and HBase Hive and Phoenix over HBase Page 38
  39. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Phoenix over HBase • Phoenix is a SQL shim over HBase • https://github.com/forcedotcom/phoenix • Hbase has fast write capabilities to Phoenix allows for fast simple query (no joins) and fast upserts • Phoenix implements it’s own JDBC driver so you can use your favorite tools Page 39
  40. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 Phoenix over HBase Page 40
  41. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive over HBase • Hive can be used directly with HBase • Hive uses the MapReduce InputFormat “HBaseStorageHandler” to query from the table • Storage Handler has hooks for – Getting input / output formats – Meta data operations hook: CREATE TABLE, DROP TABLE, etc • Storage Handler is a table level concept – Does not support Hive partitions, and buckets • Hive does not need to include all columns from HBase table Page 41
  42. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 Hive over HBase Page 42
  43. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive over HBase Page 43
  44. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive and Phoenix over HBase > hive add jar /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar; add jar /usr/lib/hbase/lib/zookeeper.jar; add jar /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar; add jar /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar; set hbase.zookeeper.quorum=node1.hadoop; CREATE EXTERNAL TABLE phoenix_mobilelograw( key string, ip string, ts string, code string, d1 string, d2 string, d3 string, d4 string, properties string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,F:IP,F:TS,F:CODE,F:D1,F:D2,F:D3,F:D4,F:PROPERTIES") TBLPROPERTIES ("hbase.table.name" = "MOBILELOGRAW”); set hive.hbase.wal.enabled=false; INSERT OVERWRITE TABLE phoenix_mobilelograw SELECT * FROM hive_mobilelograw; set hive.hbase.wal.enabled=true; Page 44
  45. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hbase Roadmap Page 45
  46. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hortonworks Focus Areas for HBase Page 46 •  Simplified Operations: •  Intelligent Compaction •  Automated Rebalancing •  Ambari Management: •  Snapshot / Revert •  Multimaster HA •  Cross-site Replication •  Backup / Restore •  Ambari Monitoring: •  Latency metrics •  Throughput metrics •  Heatmaps •  Region visualizations Simplified Operations Database Functionality •  First-Class Datatypes •  SQL Interface Support •  Indexes •  Security •  Encryption •  More Granular Permissions •  Performance: •  Stripe Compactions •  Short Circuit Read for Hadoop 2 •  Row and Entity Groups •  Deeper Hive/Pig Interop
  47. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Roadmap Details: Operations Page 47 • Snapshots: – Protect data or restore to a point in time. • Intelligent Compaction: – Compact when the system is lightly utilized. – Avoid “compaction storms” that can break SLAs. • Ambari Operational Improvements: – Configure multi-master HA. – Simple setup/configuration for replication. – Manage and schedule snapshots. – More visualizations, more health checks.
  48. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase Roadmap Details: Data Management Page 48 • Datatypes: – First-class datatypes offer performance benefits and better interoperability with tools and other databases. • SQL Interface (Preview): – SQL interface for simplified analysis of data within HBase. – JDBC driver allows embedding in existing applications. • Security: – Granular permissions on data within HBase.
  49. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HOYA HBase On YARN Page 49
  50. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HOYA? •  The new YARN resource negotiation layer in Hadoop allows for non-mapreduce applications to run on a Hadoop grid, why not allow HBase to take advantage of this capability? •  https://github.com/hortonworks/hoya/ •  HOYA is a YARN application that provisions regionservers based on an HBase cluster configuration •  HOYA helps to bring HBase into YARN resource management and paves the way for advanced resource management with HBase •  HOYA can be used to spin up temporary HBase clusters temporarily during MapReduce or other jobs Page 50
  51. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. A quick YARN refresher… Page 51
  52. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 The 1st Generation of Hadoop: Batch HADOOP 1.0 Built for Web-Scale Batch Apps Single  App   BATCH HDFS Single  App   INTERACTIVE Single  App   BATCH HDFS •  All other usage patterns must leverage that same infrastructure •  Forces the creation of silos for managing mixed workloads Single  App   BATCH HDFS Single  App   ONLINE
  53. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 A Transition From Hadoop 1 to 2 HADOOP 1.0 HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)  
  54. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 A Transition From Hadoop 1 to 2 HADOOP 1.0 HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)   HDFS   (redundant,  reliable  storage)   YARN   (cluster  resource  management)   MapReduce   (data  processing)   Others   (data  processing)   HADOOP 2.0
  55. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The Enterprise Requirement: Beyond Batch To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS Simultaneously & with predictable levels of service Page 55 HDFS  (Redundant,  Reliable  Storage)   BATCH   INTERACTIVE   STREAMING   GRAPH   IN-­‐MEMORY   HPC  MPI  ONLINE   OTHER  
  56. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. YARN: Taking Hadoop Beyond Batch •  Created to manage resource needs across all uses •  Ensures predictable performance & QoS for all apps •  Enables apps to run “IN” Hadoop rather than “ON” – Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc. Page 56 ApplicaDons  Run  NaDvely  IN  Hadoop   HDFS2  (Redundant,  Reliable  Storage)   YARN  (Cluster  Resource  Management)       BATCH   (MapReduce)   INTERACTIVE   (Tez)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   ONLINE   (HBase)   OTHER   (Search)   (Weave…)  
  57. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013 HOYA Architecture Page 57
  58. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Key HOYA Design Goals 1.  Create on-demand HBase clusters 2.  Maintain multiple HBase cluster configurations and implement them as required (i.e. high-load scenarios) 3.  Isolation – Sandbox clusters running different versions of HBase or with different coprocessors 4.  Create transient HBase clusters for MapReduce or other processing 5.  Elasticity of clusters for analytics, data-ingest, project-based work 6.  Leverage the scheduling in YARN to ensure HBase can be a good Hadoop cluster tenant Page 58
  59. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 59 Time to call it an evening. We all have important work to do…
  60. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Thank you…. Page 60 hbaseinaction.com For more information, check out HBase: The Definitive Guide Or HBase in Action
Publicité