SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Apache Spark vs rest of the world
- Problems and Solutions
Arkadiusz Jachnik
#BigDataSpain 2017
About Arkadiusz
• Senior Data Scientist at AGORA SA
- user profiling & content personalization

- recommendation system

• PhD Student at 

Poznan University of Technology
- multi-class & multi-label classification

- multi-output prediction

- recommendation algorithms
2
#BigDataSpain 2017
Agora’s BigData Team
3
my boss Luiza :) it’s me!
we are all here
at #BDS!
I invite to talk of these guys :)
Arek Wojtek
Paweł
Paweł
Dawid
Bartek Jacek Daniel
#BigDataSpain 20174
Internet Press
Polish Media Company
Magazines
Radio
Cinemas
Advertising
TV
Books
#BigDataSpain 2017
Spark in Agora's BigData Platform
5
DATA COLLECTING AND INTEGRATION
USER PROFILING

SYSTEM DATA ANALYTICSRECOMMENDATION
SYSTEM
DATA ENRICHMENT AND CONTENT STRUCTURISATION
HADOOP CLUSTER
own build, v2.2
structuredstreaming
Spark SQL, MLlib
Spark
streaming
over 3 years of experience
#BigDataSpain 2017
Today discussed problems
6
1. Processing parts of data and loading from 

Spark to relational database in parallel
2. Bulk loading do HBase database
3. From relational database to Spark DataFrame
(with user defined functions)
4. From HBase to Spark by Hive external table
(with timestamps of HBase cells)
5. Spark Streaming with Kafka - how to implement
own offset manager
#BigDataSpain 2017
I will show some code…
• I will show real technical problems we have
encountered during Spark deployment

• We use Spark in Agora for over 3 years so
we have great experience

• I will present practical solutions showing
some code in Scala

• Scala is natural for Spark
7
1. Processing and writing parts of data in parallel
Problem description:

• We have processed huge
DataFrame of computed
recommendations for users

• There are 4 defined types of
recommendations

• For each type we want to take
top-K recommendations for each
user

• Recommendations of each type
should be loaded to different
PostgreSQL table
#BigDataSpain 20178
User
Recommendation
type
Article Score
Grzegorz TYPE_3 Article F 1.0
Bożena TYPE_4 Article B 0.2
Grażyna TYPE_2 Article B 0.2
Grzegorz TYPE_3 Article D 0.9
Krzysztof TYPE_3 Article D 0.4
Grażyna TYPE_2 Article C 0.9
Grażyna TYPE_1 Article D 0.3
Bożena TYPE_2 Article E 0.9
Grzegorz TYPE_1 Article E 1.0
Grzegorz TYPE_1 Article A 0.7
#BigDataSpain 2017
Code intro: input & output
9
Grzegorz, Article A, 1.0
Grzegorz, Article F, 0.9
Grzegorz, Article C, 0.9
Grzegorz, Article D, 0.8
Grzegorz, Article B, 0.75
Bożena, ... ...
TYPE1

5recos.peruser
save table_1
Krzysztof, Article F, 1.0
Krzysztof, Article D, 1.0
Krzysztof, Article C, 0.8
Krzysztof, Article B, 0.85
Grażyna, Article C, 1.0
Grażyna, ... ...
TYPE2

4recos.peruser
save table_2
Grzegorz, Article E, 1.0
Grzegorz, Article B, 0.75
Grzegorz, Article A, 0.8
Bożena, Article E, 0.9
Bożena, Article A, 0.75
Bożena, Article C 0.75
TYPE3

3recos.peruser
save table_3
Grażyna, Article A, 1.0
Grażyna, Article F, 0.9
Bożena, Article B, 0.9
Bożena, Article D, 0.9
Grzegorz, Article B, 1.0
Grzegorz, Article E, 0.95
TYPE4

2recos.peruser
save table_4
#BigDataSpain 2017
Standard approach
recoTypes.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
10
no-parallelism parallelism but most of the tasks skipped
#BigDataSpain 2017
maybe we can add .par ?
recoTypes.par.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
11
parallelism but too much tasks :(
#BigDataSpain 2017
Our trick
parallelizeProcessing(recoTypes, (recoType: RecoType) => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = {
f(recoTypes.head)
if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_))
}
12
execute Spark action for the first type…
parallelize the rest
2. Fast bulk-loading to HBase
Problems with standard HBase
client (inserts with Put class):

• Difficult integration with Spark

• Complicated parallelization

• For non pre-splited tables problem
with *Region*Exception-s
• Slow for millions of rows
#BigDataSpain 201713
Spark DataFrame / RDD
.foreachPartition
hTable

.put(…)
hTable

.put(…)
hTable

.put(…)
hTable

.put(…)
#BigDataSpain 2017
Idea
Our approach is based on:

https://github.com/zeyuanxy/
spark-hbase-bulk-loading
Input RDD:

data: RDD[( //pair RDD
Array[Byte], //HBase row key
Map[ //data:
String, //column-family
Array[(
String, //column name
(String, //cell value
Long) //timestamp
)]
]
)]
14
General idea:

We have to save our RDD data as HFiles
(HBase data are stored in such files) and load
them into the given pre-existing table.
General steps:

1. Implement Spark Partitioner that defines
how our data in a key-value pair RDD
should be partitioned for HBase row key

2. Repartition and sort the RDD within
column-families and starting row keys for
every HBase region

3. Save RDD to HDFS as HFiles by
rdd.saveAsNewAPIHadoopFile method

4. Load files to table by
LoadIncrementalHFiles (HBase API)
#BigDataSpain 2017
Implementation
// Prepare hConnection, tableName, hTable ...
val regionLocator =
hConnection.getRegionLocator(tableName)
val columnFamilies = hTable.getTableDescriptor
.getFamiliesKeys.map(Bytes.toString(_))
val partitioner = new
HFilePartitioner(regionLocator.getStartKeys, fraction)
// prepare partitioned RDD
val rdds = for {
family <- columnFamilies
rdd = data
.collect{ case (key, dataMap) if
dataMap.contains(family) => (key, dataMap(family))}
.flatMap{ case (key, familyDataMap) =>
familyDataMap.map{
case (column: String, valueTs: (String, Long)) =>
(((key, Bytes.toBytes(column)), valueTs._2),
Bytes.toBytes(valueTs._1))
}
}
} yield getPartitionedRdd(rdd, family, partitioner)
15
val rddToSave = rdds.reduce(_ ++ _)
// prepare map-reduce job for bulk-load
HFileOutputFormat2.configureIncrementalLoad(
job, hTable, regionLocator)
// prepare path for HFiles output
val fs = FileSystem.get(hbaseConfig)
val hFilePath = new Path(...)
try {
rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString,
classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], job.getConfiguration)
// prepare HFiles for incremental load by setting
// folders permissions read/write/exec for all...
setRecursivePermission(hFilePath)
val loader = new LoadIncrementalHFiles(hbaseConfig)
loader.doBulkLoad(hFilePath, hConnection.getAdmin,
hTable, regionLocator)
} // finally close resources, ...
Prepare HBase
connection, table

and region locator
Prepare Spark
partitioner for
HBase regions
Repartition and sort
data within partitions
by the partitioner
Save HFiles by
NewAPIHadoopFile 

to HDFS
Load HFiles 

to HBase table
#BigDataSpain 2017
Keep in mind
• Set optimally HBase parameter:

hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32)

• For large data too small value of this parameter may causes

IllegalArgumentException: Size exceeds Integer.MAX_VALUE
• Create HBase tables with splits adapted to the expected row keys

- example: for row keys of HEX IDs create table with splits like:

create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',

‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']}
- for further single puts it minimizes *Region*Exceptions
16
#BigDataSpain 2017
3. Loading data from Postgres to Spark


This is possible for data from Hive:

val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val data: DataFrame = sparkSesstion.sql(
"SELECT id, toUpperCaseUdf(code) FROM types"
)
17
But this is not possible for data from
JDBC (for example PostgreSQL):

val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT toUpperCaseUdf(code) " +
"FROM codes) as codesData",
connectionConf)
this query is executed
by Postgres (not Spark)
here you can can specify
just Postgres table name
and how to parallelize
data loading?
#BigDataSpain 2017
Try to load ’raw’ data without UDFs and next
use .withColumn with UDF as expression:

val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT code " +
"FROM codes) as codesData",
connectionConf)
.withColumn("upperCode",
expr("toUpperCaseUdf(code)"))
Our solution
18
.jdbc produces
DataFrame
We will split the table read across executors
on the selected column:

val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(
url = jdbcUrl,
table = "(SELECT code, type_id " +
"FROM codes) as codesData",
columnName = "type_id",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 10,
connectionProperties = connectionConf)
but it’s one partition!
#BigDataSpain 2017
Is it working?
spark.read.jdbc(
url = "jdbc:mysql://localhost:3306/test",
table = "users",
properties = connectionProperties)
.cache()
spark.read.jdbc(
url = "jdbc:mysql://localhost:3306/test",
table = "users",
columnName = "type",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 4,
connectionProperties = connectionProperties)
.cache()
19
test data
1 partition
4 partitions
#BigDataSpain 2017
4. From HBase to Spark by Hive
There are commonly used method for loading
data from HBase to Spark by Hive external
table:

CREATE TABLE hive_view_on_hbase (
key int,
value string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key, cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);
20
72A9DBA74524
column-family: cities
Poznan Warsaw Cracow Gdansk
40 5 1 3
58383B36275A
Poznan Warsaw Cracow Gdansk
120 60 5
009D22419988
Poznan Warsaw Cracow Gdansk
75 1
user_id cities_map last_city
72A9DBA
74524
map(Poznan->40, Warsaw->5,

Cracow->1, Gdansk->3)
?
58383B3
6275A
map(Warsaw->120, 

Cracow->60, Gdansk->5)
?
009D224
19988
map(Poznan->75, Warsaw->1) ?
HiveHBaseHandler
but how to get the last
(most recent) values?
where aretimestamps?
#BigDataSpain 2017
Our case
• We use HDP distribution of Hadoop cluster
with HBase 1.1.x

• There is possibility to add to Hive view on
HBase table the latest timestamp of row
modification:

CREATE TABLE hive_view_on_hbase (
key int,
value string,
ts timestamp
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key, cf1:val, :timestamp'
)
TBLPROPERTIES (
'hbase.table.name' = 'xyz'
);
21
• How to extract timestamp of each cell?

• Answer: rewrite Hive-HBase-Handler that is
responsible for creating the Hive views on
HBase tables :) … but first …

• Do not download source code of Hive
from the Hive GitHub repository - check
your Hadoop distribution! (for example
HDP has own code branch)
#BigDataSpain 201722
There is a patch on Hive repo…
…but still not reviewed and merged :(
#BigDataSpain 2017
There is a lot of code…
…but we have some tips on how to change Hive-HBase-Handler:

• Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java
which returns ColumnMappings object

• LazyHBaseRow class stores data from HBase row.

• Timestamps of processed HBase cells can be read from loaded (by scanner) rows in
LazyHBaseCellMap class

• Column parser and HBase scanner is initialized in HBaseStorageHandler.java
23
#BigDataSpain 2017
5. Spark + Kafka: own offset manager
Problem description:

• Spark output operations are at-least-once

• For exactly-once semantics, you must store
offsets after an idempotent output, or in an
atomic transaction alongside output

• Options:

1. Checkpoints

+ easy to enable by Spark checkpointing

- output operation must be idempotent

- cannot recover from a checkpoint if
application code has changed

2. Own data store

+ regardless of changes to your application
code

+ you can use data stores that support
transactions

+ exactly-once semantics
24
Single Spark batch
Process
and save data
Save
offsets
Image source: Spark Streaming documentation

https://spark.apache.org/docs/latest/streaming-programming-guide.html
#BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(…)
val stream: DStream[ConsumerRecord[String, String]] = ...
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, ...)
})
25
Single Spark batch
Process
and save data
Save
offsets
#BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(...)
val stream: DStream[ConsumerRecord[String, String]] =
kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams)
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, zkPath)
})
def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore,
kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = {
offsetsStore.readOffsets(topic, zkPath) match {
case Some(offsetsMap) =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap))
case None =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams)
)
}
}
26
#BigDataSpain 2017
Code of offset store
class MyOffsetsStore(zkHosts: String) {
val zkUtils = ZkUtils(zkHosts, 10000, 10000, false)
def saveOffsets(rdd: RDD[_], zkPath: String): Unit = {
val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetsRanges.groupBy(_.topic).foreach {
case (topic, offsetsRangesPerTopic) => {
val offsetsRangesStr = offsetsRangesPerTopic
.map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",")
zkUtils.updatePersistentPath(zkPath, offsetsRangesStr)
}
}}
def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = {
val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath)
offsetsRangesStrOpt match {
case Some(offsetsRangesStr) =>
Some(offsetsRangesStr.split(",").map(s => s.split(":")).map {
case Array(partitionStr, offsetStr) =>
new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong
}.toMap)
case None => None
}
}
}
27
Thank you!
Questions?
arkadiusz.jachnik@agora.pl
www.linkedin.com/in/arkadiusz-jachnik

Contenu connexe

Tendances

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 

Tendances (20)

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 

Similaire à Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017techmaddy
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Azure database as a service options
Azure database as a service optionsAzure database as a service options
Azure database as a service optionsMarcelo Adade
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsBenjamin Darfler
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...Facultad de Informática UCM
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedis Labs
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Michael Rys
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generatorPayal Jain
 
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...George McGeachie
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsGleicon Moraes
 
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 

Similaire à Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017 (20)

Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Azure database as a service options
Azure database as a service optionsAzure database as a service options
Azure database as a service options
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generator
 
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Data Science
Data ScienceData Science
Data Science
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 

Plus de Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
 

Plus de Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

  • 1.
  • 2. Apache Spark vs rest of the world - Problems and Solutions Arkadiusz Jachnik
  • 3. #BigDataSpain 2017 About Arkadiusz • Senior Data Scientist at AGORA SA - user profiling & content personalization - recommendation system • PhD Student at 
 Poznan University of Technology - multi-class & multi-label classification - multi-output prediction - recommendation algorithms 2
  • 4. #BigDataSpain 2017 Agora’s BigData Team 3 my boss Luiza :) it’s me! we are all here at #BDS! I invite to talk of these guys :) Arek Wojtek Paweł Paweł Dawid Bartek Jacek Daniel
  • 5. #BigDataSpain 20174 Internet Press Polish Media Company Magazines Radio Cinemas Advertising TV Books
  • 6. #BigDataSpain 2017 Spark in Agora's BigData Platform 5 DATA COLLECTING AND INTEGRATION USER PROFILING
 SYSTEM DATA ANALYTICSRECOMMENDATION SYSTEM DATA ENRICHMENT AND CONTENT STRUCTURISATION HADOOP CLUSTER own build, v2.2 structuredstreaming Spark SQL, MLlib Spark streaming over 3 years of experience
  • 7. #BigDataSpain 2017 Today discussed problems 6 1. Processing parts of data and loading from 
 Spark to relational database in parallel 2. Bulk loading do HBase database 3. From relational database to Spark DataFrame (with user defined functions) 4. From HBase to Spark by Hive external table (with timestamps of HBase cells) 5. Spark Streaming with Kafka - how to implement own offset manager
  • 8. #BigDataSpain 2017 I will show some code… • I will show real technical problems we have encountered during Spark deployment • We use Spark in Agora for over 3 years so we have great experience • I will present practical solutions showing some code in Scala • Scala is natural for Spark 7
  • 9. 1. Processing and writing parts of data in parallel Problem description: • We have processed huge DataFrame of computed recommendations for users • There are 4 defined types of recommendations • For each type we want to take top-K recommendations for each user • Recommendations of each type should be loaded to different PostgreSQL table #BigDataSpain 20178 User Recommendation type Article Score Grzegorz TYPE_3 Article F 1.0 Bożena TYPE_4 Article B 0.2 Grażyna TYPE_2 Article B 0.2 Grzegorz TYPE_3 Article D 0.9 Krzysztof TYPE_3 Article D 0.4 Grażyna TYPE_2 Article C 0.9 Grażyna TYPE_1 Article D 0.3 Bożena TYPE_2 Article E 0.9 Grzegorz TYPE_1 Article E 1.0 Grzegorz TYPE_1 Article A 0.7
  • 10. #BigDataSpain 2017 Code intro: input & output 9 Grzegorz, Article A, 1.0 Grzegorz, Article F, 0.9 Grzegorz, Article C, 0.9 Grzegorz, Article D, 0.8 Grzegorz, Article B, 0.75 Bożena, ... ... TYPE1
 5recos.peruser save table_1 Krzysztof, Article F, 1.0 Krzysztof, Article D, 1.0 Krzysztof, Article C, 0.8 Krzysztof, Article B, 0.85 Grażyna, Article C, 1.0 Grażyna, ... ... TYPE2
 4recos.peruser save table_2 Grzegorz, Article E, 1.0 Grzegorz, Article B, 0.75 Grzegorz, Article A, 0.8 Bożena, Article E, 0.9 Bożena, Article A, 0.75 Bożena, Article C 0.75 TYPE3
 3recos.peruser save table_3 Grażyna, Article A, 1.0 Grażyna, Article F, 0.9 Bożena, Article B, 0.9 Bożena, Article D, 0.9 Grzegorz, Article B, 1.0 Grzegorz, Article E, 0.95 TYPE4
 2recos.peruser save table_4
  • 11. #BigDataSpain 2017 Standard approach recoTypes.foreach(recoType => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 10 no-parallelism parallelism but most of the tasks skipped
  • 12. #BigDataSpain 2017 maybe we can add .par ? recoTypes.par.foreach(recoType => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 11 parallelism but too much tasks :(
  • 13. #BigDataSpain 2017 Our trick parallelizeProcessing(recoTypes, (recoType: RecoType) => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = { f(recoTypes.head) if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_)) } 12 execute Spark action for the first type… parallelize the rest
  • 14. 2. Fast bulk-loading to HBase Problems with standard HBase client (inserts with Put class): • Difficult integration with Spark • Complicated parallelization • For non pre-splited tables problem with *Region*Exception-s • Slow for millions of rows #BigDataSpain 201713 Spark DataFrame / RDD .foreachPartition hTable
 .put(…) hTable
 .put(…) hTable
 .put(…) hTable
 .put(…)
  • 15. #BigDataSpain 2017 Idea Our approach is based on: https://github.com/zeyuanxy/ spark-hbase-bulk-loading Input RDD: data: RDD[( //pair RDD Array[Byte], //HBase row key Map[ //data: String, //column-family Array[( String, //column name (String, //cell value Long) //timestamp )] ] )] 14 General idea: We have to save our RDD data as HFiles (HBase data are stored in such files) and load them into the given pre-existing table. General steps: 1. Implement Spark Partitioner that defines how our data in a key-value pair RDD should be partitioned for HBase row key 2. Repartition and sort the RDD within column-families and starting row keys for every HBase region 3. Save RDD to HDFS as HFiles by rdd.saveAsNewAPIHadoopFile method 4. Load files to table by LoadIncrementalHFiles (HBase API)
  • 16. #BigDataSpain 2017 Implementation // Prepare hConnection, tableName, hTable ... val regionLocator = hConnection.getRegionLocator(tableName) val columnFamilies = hTable.getTableDescriptor .getFamiliesKeys.map(Bytes.toString(_)) val partitioner = new HFilePartitioner(regionLocator.getStartKeys, fraction) // prepare partitioned RDD val rdds = for { family <- columnFamilies rdd = data .collect{ case (key, dataMap) if dataMap.contains(family) => (key, dataMap(family))} .flatMap{ case (key, familyDataMap) => familyDataMap.map{ case (column: String, valueTs: (String, Long)) => (((key, Bytes.toBytes(column)), valueTs._2), Bytes.toBytes(valueTs._1)) } } } yield getPartitionedRdd(rdd, family, partitioner) 15 val rddToSave = rdds.reduce(_ ++ _) // prepare map-reduce job for bulk-load HFileOutputFormat2.configureIncrementalLoad( job, hTable, regionLocator) // prepare path for HFiles output val fs = FileSystem.get(hbaseConfig) val hFilePath = new Path(...) try { rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) // prepare HFiles for incremental load by setting // folders permissions read/write/exec for all... setRecursivePermission(hFilePath) val loader = new LoadIncrementalHFiles(hbaseConfig) loader.doBulkLoad(hFilePath, hConnection.getAdmin, hTable, regionLocator) } // finally close resources, ... Prepare HBase connection, table
 and region locator Prepare Spark partitioner for HBase regions Repartition and sort data within partitions by the partitioner Save HFiles by NewAPIHadoopFile 
 to HDFS Load HFiles 
 to HBase table
  • 17. #BigDataSpain 2017 Keep in mind • Set optimally HBase parameter:
 hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32) • For large data too small value of this parameter may causes
 IllegalArgumentException: Size exceeds Integer.MAX_VALUE • Create HBase tables with splits adapted to the expected row keys - example: for row keys of HEX IDs create table with splits like:
 create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',
 ‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']} - for further single puts it minimizes *Region*Exceptions 16
  • 18. #BigDataSpain 2017 3. Loading data from Postgres to Spark 
 This is possible for data from Hive:
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val data: DataFrame = sparkSesstion.sql( "SELECT id, toUpperCaseUdf(code) FROM types" ) 17 But this is not possible for data from JDBC (for example PostgreSQL):
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT toUpperCaseUdf(code) " + "FROM codes) as codesData", connectionConf) this query is executed by Postgres (not Spark) here you can can specify just Postgres table name and how to parallelize data loading?
  • 19. #BigDataSpain 2017 Try to load ’raw’ data without UDFs and next use .withColumn with UDF as expression: val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT code " + "FROM codes) as codesData", connectionConf) .withColumn("upperCode", expr("toUpperCaseUdf(code)")) Our solution 18 .jdbc produces DataFrame We will split the table read across executors on the selected column: val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc( url = jdbcUrl, table = "(SELECT code, type_id " + "FROM codes) as codesData", columnName = "type_id", lowerBound = 1L, upperBound = 100L, numPartitions = 10, connectionProperties = connectionConf) but it’s one partition!
  • 20. #BigDataSpain 2017 Is it working? spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", properties = connectionProperties) .cache() spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", columnName = "type", lowerBound = 1L, upperBound = 100L, numPartitions = 4, connectionProperties = connectionProperties) .cache() 19 test data 1 partition 4 partitions
  • 21. #BigDataSpain 2017 4. From HBase to Spark by Hive There are commonly used method for loading data from HBase to Spark by Hive external table: CREATE TABLE hive_view_on_hbase ( key int, value string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key, cf1:val" ) TBLPROPERTIES ( "hbase.table.name" = "xyz" ); 20 72A9DBA74524 column-family: cities Poznan Warsaw Cracow Gdansk 40 5 1 3 58383B36275A Poznan Warsaw Cracow Gdansk 120 60 5 009D22419988 Poznan Warsaw Cracow Gdansk 75 1 user_id cities_map last_city 72A9DBA 74524 map(Poznan->40, Warsaw->5,
 Cracow->1, Gdansk->3) ? 58383B3 6275A map(Warsaw->120, 
 Cracow->60, Gdansk->5) ? 009D224 19988 map(Poznan->75, Warsaw->1) ? HiveHBaseHandler but how to get the last (most recent) values? where aretimestamps?
  • 22. #BigDataSpain 2017 Our case • We use HDP distribution of Hadoop cluster with HBase 1.1.x • There is possibility to add to Hive view on HBase table the latest timestamp of row modification: CREATE TABLE hive_view_on_hbase ( key int, value string, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key, cf1:val, :timestamp' ) TBLPROPERTIES ( 'hbase.table.name' = 'xyz' ); 21 • How to extract timestamp of each cell? • Answer: rewrite Hive-HBase-Handler that is responsible for creating the Hive views on HBase tables :) … but first … • Do not download source code of Hive from the Hive GitHub repository - check your Hadoop distribution! (for example HDP has own code branch)
  • 23. #BigDataSpain 201722 There is a patch on Hive repo… …but still not reviewed and merged :(
  • 24. #BigDataSpain 2017 There is a lot of code… …but we have some tips on how to change Hive-HBase-Handler: • Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java which returns ColumnMappings object • LazyHBaseRow class stores data from HBase row. • Timestamps of processed HBase cells can be read from loaded (by scanner) rows in LazyHBaseCellMap class • Column parser and HBase scanner is initialized in HBaseStorageHandler.java 23
  • 25. #BigDataSpain 2017 5. Spark + Kafka: own offset manager Problem description: • Spark output operations are at-least-once • For exactly-once semantics, you must store offsets after an idempotent output, or in an atomic transaction alongside output • Options: 1. Checkpoints + easy to enable by Spark checkpointing - output operation must be idempotent - cannot recover from a checkpoint if application code has changed 2. Own data store + regardless of changes to your application code + you can use data stores that support transactions + exactly-once semantics 24 Single Spark batch Process and save data Save offsets Image source: Spark Streaming documentation https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • 26. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext = new StreamingContext(…) val stream: DStream[ConsumerRecord[String, String]] = ... stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, ...) }) 25 Single Spark batch Process and save data Save offsets
  • 27. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext = new StreamingContext(...) val stream: DStream[ConsumerRecord[String, String]] = kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams) stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, zkPath) }) def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore, kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = { offsetsStore.readOffsets(topic, zkPath) match { case Some(offsetsMap) => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap)) case None => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams) ) } } 26
  • 28. #BigDataSpain 2017 Code of offset store class MyOffsetsStore(zkHosts: String) { val zkUtils = ZkUtils(zkHosts, 10000, 10000, false) def saveOffsets(rdd: RDD[_], zkPath: String): Unit = { val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges offsetsRanges.groupBy(_.topic).foreach { case (topic, offsetsRangesPerTopic) => { val offsetsRangesStr = offsetsRangesPerTopic .map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",") zkUtils.updatePersistentPath(zkPath, offsetsRangesStr) } }} def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = { val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath) offsetsRangesStrOpt match { case Some(offsetsRangesStr) => Some(offsetsRangesStr.split(",").map(s => s.split(":")).map { case Array(partitionStr, offsetStr) => new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong }.toMap) case None => None } } } 27