2. • Java Message Service => JMS
• Solutions Architect at Cloudera
• A bit of everything…
• Development
• Team/Project manager
• Architect
• O'Reilly author of Architecting HBase Applications
• International
• Worked from Paris to Los Angeles
• More than 100 flights per year
• HBase (and others) contributor
About Jean-Marc Spaggiari
HBaseCon East 2016
3. Overview
• Where we came from
• Examples of code
• Improvements that are coming up
• Spark Streaming Use case
HBaseCon East 2016
4. Source of Demand
• Demand started in the field
• Use Cases
• APIs access Gets, Puts, Scans
• MapReduce Mass Scans
• MapReduce Bulk Load
• MapReduce Smart gets and puts
• Spark has all but killed MapReduce
• Spark Streaming has grown in popularity
• Populating Aggregates
• Entity Centric-Time Series data store i.e. OpenTSDB
• Look ups for joins or mutations
HBaseCon East 2016
5. How it Started
• Started on GitHub
• Andrew Purtell started the effort to put into HBase
• Big call out to Sean B, Jon H, Ted Y, Ted M and Matteo B
• Components
• Normal Spark
• Spark Streaming
• Bulk Load
• SparkSQL
HBaseCon East 2016
6. Under the covers
HBaseCon East 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
7. Key Addition: HBaseContext
Create an HBaseContext :
// An Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
HBaseCon East 2016
8. • Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
Operations on the HBaseContext
HBaseCon East2016
9. Foreach
Read data in parallel for each partition and compute :
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))
it.foreach(r => {
... // HBase API put/incr/append/cas calls
}
bufferedMutator.flush()
bufferedMutator.close()
})
HBaseCon East 2016
10. Foreach
Read data in parallel for each partition and compute :
hbaseContext.foreachPartition(keyValuesPuts,
new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {
@Override
public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {
BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);
while (t._1().hasNext()) {
... // HBase API put/incr/append/cas calls
}
mutator.flush();
mutator.close();
} }); });
HBaseCon East 2016
11. Map
Take a dataset and map it in parallel for each partition to produce a new RDD or process it
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
it.map( r => {
... // HBase API Scan Results
}
})
HBaseCon East 2016
12. BulkLoad
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon East 2016
13. BulkLoad
Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon East 2016
14. Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
}, stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
BulkLoad
HBaseCon East 2016
15. BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon East 2016
16. BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon East 2016
17. BulkPut
Parallelized HBase Multiput :
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
put
}
HBaseCon East 2016
18. BulkPut
Parallelized HBase Multiput :
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {
@Override
public Put call(String v1) throws Exception {
String[] tokens = v1.split("|");
Put put = new Put(Bytes.toBytes(tokens[0]));
put.addColumn(Bytes.toBytes("segment"),
Bytes.toBytes(tokens[1]),
Bytes.toBytes(tokens[2]));
return put;
}
});
HBaseCon East 2016
20. Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
21. Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
22. Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
23. Table RDD
How to materialize a table as a Spark RDD.
HBaseCon East 2016
24. What Improvements Have We Made?
▪ Combine Spark and HBase
• Spark Catalyst Engine for Query Plan and Optimization
• HBase for Fast Access KV Store
• Implement Standard External Data Source with Built-in Filter
▪ High Performance
• Data Locality: Move Computation to Data
• Partition Pruning: Task only Performed in RS Holding Requested Data
• Column Pruning / Predicate Pushdown: Reduce Network Overhead
▪ Full Fledged DataFrame Support
• Spark-SQL
• Integrated Language Query
▪ Run on Top of Existing HBase Table
• Native Support Java Primitive Types
▪ Still some work and improvements to be done
• HBASE-16638 Reduce the number of Connections
• HBASE-14217 Add Java access to Spark bulk load functionality
HBaseCon East 2016
25. Data frame + HBase
WIP... 2.0?
HBaseCon East 2016
27. Usage – Write to HBase
sc.parallelize(data)
.toDF
.write
.options(Map(HBaseTableCatalog.tableCatalog -> catalog,
HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
Export RDD into a new HBase table with DataFrame
HBaseCon East 2016
28. Usage– Construct DataFrame
def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->catalog))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
29. Usage - Language Integrate Query/SQL
val df = withCatalog(catalog)
val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||
$"col0" === "row005" ||
$"col0" === "row020" ||
$"col0" === "r20" ||
$"col0" <= "row005") &&
($"col2" === 1 ||
$"col2" === 42))
.select("col0", "col1", "col2")
s.show
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016
30. Usage - Language Integrate Query/SQL
// Load the dataframe
val df = withCatalog(catalog)
//SQL example
df.registerTempTable("table")
sqlContext.sql("select count(col1) from table").show
Import RDD into a new HBase table with DataFrame
HBaseCon East 2016