HBaseConEast2016: HBase and Spark, State of the Art

HBaseCon East 2016
HBase and Spark, state of the art

• Java Message Service => JMS
• Solutions Architect at Cloudera
• A bit of everything…
• Development
• Team/Project manager
• Architect
• O'Reilly author of Architecting HBase Applications
• International
• Worked from Paris to Los Angeles
• More than 100 flights per year
• HBase (and others) contributor
About Jean-Marc Spaggiari
HBaseCon East 2016

Overview
• Where we came from
• Examples of code
• Improvements that are coming up
• Spark Streaming Use case
HBaseCon East 2016

Source of Demand
• Demand started in the field
• Use Cases
• APIs access Gets, Puts, Scans
• MapReduce Mass Scans
• MapReduce Bulk Load
• MapReduce Smart gets and puts
• Spark has all but killed MapReduce
• Spark Streaming has grown in popularity
• Populating Aggregates
• Entity Centric-Time Series data store i.e. OpenTSDB
• Look ups for joins or mutations
HBaseCon East 2016

How it Started
• Started on GitHub
• Andrew Purtell started the effort to put into HBase
• Big call out to Sean B, Jon H, Ted Y, Ted M and Matteo B
• Components
• Normal Spark
• Spark Streaming
• Bulk Load
• SparkSQL
HBaseCon East 2016

Under the covers
HBaseCon East 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks

Key Addition: HBaseContext
Create an HBaseContext :
// An Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
HBaseCon East 2016

• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
Operations on the HBaseContext
HBaseCon East2016

Foreach
Read data in parallel for each partition and compute :
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))
it.foreach(r => {
... // HBase API put/incr/append/cas calls
}
bufferedMutator.flush()
bufferedMutator.close()
})
HBaseCon East 2016

Foreach
Read data in parallel for each partition and compute :
hbaseContext.foreachPartition(keyValuesPuts,
new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {
@Override
public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {
BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);
while (t._1().hasNext()) {
... // HBase API put/incr/append/cas calls
}
mutator.flush();
mutator.close();
} }); });
HBaseCon East 2016

Map
Take a dataset and map it in parallel for each partition to produce a new RDD or process it
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
it.map( r => {
... // HBase API Scan Results
}
})
HBaseCon East 2016

BulkLoad
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon East 2016

Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
BulkLoad
HBaseCon East 2016

BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon East 2016

BulkPut
Parallelized HBase Multiput :
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
put
}
HBaseCon East 2016

BulkPut
Parallelized HBase Multiput :
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {
@Override
public Put call(String v1) throws Exception {
String[] tokens = v1.split("|");
Put put = new Put(Bytes.toBytes(tokens[0]));
put.addColumn(Bytes.toBytes("segment"),
Bytes.toBytes(tokens[1]),
Bytes.toBytes(tokens[2]));
return put;
}
});
HBaseCon East 2016

BulkDelete
Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,
putRecord => new Delete(putRecord),
4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName,
putRecord => new Delete(putRecord),
4) // batch size
HBaseCon East 2016

Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016

What Improvements Have We Made?
▪ Combine Spark and HBase
• Spark Catalyst Engine for Query Plan and Optimization
• HBase for Fast Access KV Store
• Implement Standard External Data Source with Built-in Filter
▪ High Performance
• Data Locality: Move Computation to Data
• Partition Pruning: Task only Performed in RS Holding Requested Data
• Column Pruning / Predicate Pushdown: Reduce Network Overhead
▪ Full Fledged DataFrame Support
• Spark-SQL
• Integrated Language Query
▪ Run on Top of Existing HBase Table
• Native Support Java Primitive Types
▪ Still some work and improvements to be done
• HBASE-16638 Reduce the number of Connections
• HBASE-14217 Add Java access to Spark bulk load functionality
HBaseCon East 2016

Data frame + HBase
WIP... 2.0?
HBaseCon East 2016

Usage - Define the Catalog
def catalog = s"""{
|"table":{"namespace":"default", "name":"table1"},
|"rowkey":"key",
|"columns":{
|"col0":{"cf":"rowkey", "col":"key", "type":"string"},
|"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},
|"col2":{"cf":"cf1", "col":"col2", "type":"string"}
|}
|}""".stripMargin
HBase table to dataframe table catalog mapping:
HBaseCon East 2016

Usage – Write to HBase
sc.parallelize(data)
.toDF
.write
.options(Map(HBaseTableCatalog.tableCatalog -> catalog,
HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
Export RDD into a new HBase table with DataFrame
HBaseCon East 2016

Usage– Construct DataFrame
def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->catalog))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016

Usage - Language Integrate Query/SQL
val df = withCatalog(catalog)
val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||
$"col0" === "row005" ||
$"col0" === "row020" ||
$"col0" === "r20" ||
$"col0" <= "row005") &&
($"col2" === 1 ||
$"col2" === 42))
.select("col0", "col1", "col2")
s.show
HBaseCon East 2016

Usage - Language Integrate Query/SQL
// Load the dataframe
val df = withCatalog(catalog)
//SQL example
df.registerTempTable("table")
sqlContext.sql("select count(col1) from table").show
HBaseCon East 2016

Spark Streaming Example
KafkaProducer
Spark
Streaming
HBase SOLR
HBaseCon East 2016

HBaseConEast2016: HBase and Spark, State of the Art

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à HBaseConEast2016: HBase and Spark, State of the Art

Similaire à HBaseConEast2016: HBase and Spark, State of the Art (20)

Plus de Michael Stack

Plus de Michael Stack (20)

Dernier

Dernier (20)

HBaseConEast2016: HBase and Spark, State of the Art