SlideShare une entreprise Scribd logo
1  sur  28
|8/14/20
15
Jack Gudenkauf
VP Big Data
scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println()
https://twitter.com/_JG
2
AGENDA
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
3
AGENDA
1. Background
4
PLAYTIKA
 Founded in 2010
 Social Casino global category leader
 10 games
 13 platforms
 1000+ employees
A trifecta requires you to select the first three finishers in order
and can lead to big pay-offs.
Boxing lets your selections come in any order and you still win.
Kafka - Spark - Vertica
Placing Your Trifecta Box Bet on Kafka, Spark, and HP Vertica
https://www.linkedin.com/in/jackglinkedin
6
MY BACKGROUND
Playtika, VP of Big Data
Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL), Chris Bowden (Dev Demigod)]
MIS Director of several start-up companies
Dataflex a 4GL RDBMS. [E.F. Codd]
Self-employed Consultant
Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2 Mainframe
FoxPro, Sybase, MSSQL Server beta
Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]
Microsoft; Dev Manager, Architect CLR/.Net Framework, Product Unit Manager Technical Strategy Group
Inventor of “Shuttle”, a Microsoft product in use since 1999
A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS)
[Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]
Twitter, Manager of Analytics Data Warehouse
Core Storage; Hadoop, HBase, Cassandra, Blob Store
Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)
[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
7
A QUEST
With attributes of
Operational Robustness
High Availabilty
Stronger durability guarantees
Idempotent (an operation that is safe to repeat)
Productivity
Analytics
Streaming, Machine Learning, BI, BA, Data Science
Rich Development env.
Strongly typed, OO, Functional, with support for set based logic and aggregations (SQL)
Performance
Scalable in every tier
MPP for Transformations, Reads & Writes
A Unified Data Pipeline with Parallelism
from Streaming Data
through Data Transformations
to Data Storage (Semi-Structured, Structured, and Relational Data)
REST API
Flume
Apache Flume™
ETL
JAVA ™
Parser & Loader
MPP Columnar DW
HP Vertica™
Cluster
UserId <-> UserGId

Analytics of Relational Data
 Structured Relational and Aggregated Data
Application
Application
Game
Applications
UserId: INT
SessionId: UUId (36)
Bingo Blitz
UserId: Int
SessionId: UUId (32)
Slotomania
UserId: varchar(32)
SessionId: varchar(255)
WSOP
C O P Y
Playtika Santa Monica
original ETL Architecture
Extract Transform Load
Single Source of Truths to Global SOT
Unified Schema
JSON
Local Data Warehouses
Original Architecture (ETL)
1
2 3 4
5
9
AGENDA
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
Real-Time
Messaging
Apache Kafka™
Cluster
Analytics of [semi]Structured [non]Relational Data Stores
Real-Time Streaming ✓Machine Learning
✓ Semi-Structured Raw JSON Data
✖Structured (non)relational Parquet Data
 Structured Relational and Aggregated Data
ETL
Resilient Distributed Datasets
Apache Spark™ Hadoop™ Parquet™
Cluster
✓ ✓ ✖ 
REST API
Or Local Kafka
Application
Application
Game
Applications
UserId: INT
SessionId: UUId (36)
Bingo Blitz
UserId: Int
SessionId: UUId (32)
Slotomania
UserId: varchar(32)
SessionId: varchar(255)
WSOP
Unified Schema
JSON
Local Data Warehouses
PSTL is the new ETL
MPP Columnar DW
HP Vertica™
Cluster

MPP
1 2 3
P a r a l l e l i z e d S t r e a m i n g T r a n s f o r m a t i o n L o a d e r4
5
New PSTL ArchitectureNew PSTL Architecture
11
AGENDA
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
Apache Kafka ™
is a distributed, partitioned, replicated commit log service
Producer Producer Producer
Kafka Cluster
(Broker)
Consumer Consumer Consumer
A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks like
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time.
Kafka is Not a message Queue (Push/Pop)
Apache Kafka ™
SPARK RDD
A Resilient Distributed Dataset [in Memory]
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Node 1 Node 2 Node 3 Node …
RDD 1
RDD 1
Partition 1
RDD 1
Partition 2
RDD 1
Partition 3
RDD 3
RDD 3
Partition 2
RDD 3
Partition 3
RDD 3
Partition 1
RDD 2
RDD 2
Partition 1 to 64
RDD 2
Partition 65 to 128
RDD 2
Partition
193 to 256
RDD 2
Partition
129 to 192
15Initiator Node
An Initiator Node shuffles
data to storage nodes
Vertica Hashing & Partitioning
16
AGENDA
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
{"appId": 3, "sessionId": ”7”, "userId": ”42” }
{"appId": 3, "sessionId": ”6”, "userId": ”42” }
Node 1 Node 2 Node 3 Node 4
3 Import recent Sessions
Apache Kafka Cluster
Topic: “appId_1” Topic: “appId_2” Topic: “appId_3”
old new
Kafka Table
appId,
TopicOffsetRange,
Batch_Id
SessionMax Table
sessionGIdMax Int
UserMax Table
userGIdMax Int
appSessionMap_RDD
appId: Int
sessionId: String
sessionGId: Int
appUserMap_RDD
appId: Int
userId: String
userGId: Int
appSession
appId: Int
sessionId: varchar(255)
sessionGId: Int
appUser
appId: Int
userId: varchar(255)
userGId: Int
1 Start a Spark Driver per APP
Node 1 Node 2 Node 3
4 Spark Kafka [non]Streaming job per APP (read partition/offset range)
5 select for update;
update max GId
5 Assign userGIds To userId
sessionGIds To sessionId
6 Hash(userGId) to
RDD partitions with affinity
To Vertica Node (Parallelism)
7 userGIdRDD.foreachPartition
{…stream.writeTo(socket)...}
8 Idempotent: Write Raw
Unparsed JSON to hdfs
9 Idempotent: Write Parsed
JSON to .parquet hdfs
10 Update MySQL
Kafka Offsets
{"appId": 2, "sessionId": ”4”, "userId": ”KA” }
{"appId": 2, "sessionId": ”3”, "userId": ”KY” }
{"appId": 1, "sessionId": ”2”, "userId": ”CB” }
{"appId": 1, "sessionId": "1”, "userId": ”JG” }
4 appId {Game events, Users, Sessions,…} Partition 1..n RDDs
5 appId Users & Sessions Partition 1..n RDDs
5 appId appUserMap_RDD.union(assignedID_RDD) RDDs
6 appId Users & Sessions Partition 1..n RDDs
7 copy jackg.DIM_USER
with source SPARK(port='12345’,
nodes=‘node0001:4, node0002:4, node0003:4’)
direct;
2 Import Users
Apache Hadoop™ Spark™ Cluster
HP Vertica™ Cluster
18
AGENDA
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
Impressive Parallel COPY Performance
Loaded 2.42 Billion Rows (451 GB)
in 7min 35sec on an 8 Node Cluster
Key Takeaways
Parallel Kafka Reads to Spark RDD (in memory) with
Parallel writes to a Vertica via tcp server – ROCKS!
COPY 36 TB/Hour with 81 Node cluster
No ephemeral nodes needed for ingest
Kafka read parallelism to Spark RDD partitions
A priori hash() in Spark RDD Partitions (in Memory)
TCP Server as a Vertica User Define Copy Source
Single COPY does not preallocate Memory across nodes
http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/
* 270 Nodes ( 45 Ingest Nodes + 215 Data Nodes [225 ?] )
THANK YOU!
Q & A
Jack Gudenkauf
VP Big Data
https://twitter.com/_JG
https://www.linkedin.com/in/jackglinkedin
Backup Slides
21
PARALLEL COPY BENCHMARK
8 Node Cluster with Parallelism of 4 22
①copy jackg.CORE_SESSION_START_0 with source SPARK(port='12345', nodes='node0001:4,node0002:4,node0003:4,node0004:4,node0005:4,node0006:4,node0007:4,node0008:4') direct;
②copy jackg.CORE_SESSION_START_1 with source SPARK…
③copy jackg.CORE_SESSION_START_2 with source SPARK…
④copy jackg.CORE_SESSION_START_3 with source SPARK…
Netcat the pipe delimited text files to vertica hosts 10.91.101.19x on port 12345
nc 10.91.101.194 12345 < xad &
`split` file(s) for Reads
-rw-r--r-- 1 jgudenkauf __USERS__ 3925357079 Jul 2 20:16 xad
23
Vertica Parallel Performance
Total size in bytes of
all delimited text files
Record Count Duration Method Tested
451,358,287,648 2,420,989,007 16m26sec ParallelExporter (Market Place).
Read Vertica, Write to local node files
451,358,287,648 2,420,989,007 20m49sec * COPY command using all nodes local. Used
Pre-Hashed files on Vertica local files for read,
Write to Vertica
451,358,287,648 2,420,989,007 24min16sec ** Parallel INSERT DIRECT SELECT where hash()
= Local Node. Parallel reads & Writes In Vertica
Cluster (no flat files)
451,358,287,648 2,420,989,007 Toooo Slow and Pipes
Broke
cat file COPY… stdin
Spark-Streaming-Kafka 24
package com.playtika.data.ulib
import com.playtika.data.ulib.vertica._
import com.playtika.data.ulib.spark.RddExtensions._
import com.playtika.data.ulib.spark.streaming._
import com.playtika.data.ulib.etl._
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils}
object SparkStreamingExample extends Logging {
type JavaMap[T, U] = java.util.Map[T, U]
private val deserializer = Deserializer.json[JavaMap[String, Any]]
def main(args: Array[String]): Unit = {
val streamingContext = StreamingContext.getOrCreate("/tmp/ulib-kafka-streaming", createStreamingContext)
streamingContext.start()
streamingContext.awaitTermination()
}
def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KyroSerializer")
.configureEtlExtensions()
.configureVerticaExtensions()
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val config = Map[String, String](
"metadata.broker.list" -> "kafka-br01-dw-dev.smo.internal:9092,kafka-br02-dw-dev.smo.internal:9092,kafka-br03-dw-
dev.smo.internal:9092"
)
val topics = Set[String]("bingoblitz")
KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, config, topics)
.foreachRDD(rdd => {
rdd.eventProcessorLoop()
})
ssc.checkpoint("/tmp/ulib-etl/checkpoints")
ssc
}
}
25
package com.playtika.data.ulib.spark
import org.apache.spark.SparkConf
case class VerticaSparkConf(conf: SparkConf) extends SparkConfWrapper {
def configureVerticaExtensions(): SparkConf = {
import com.playtika.data.ulib.vertica._
conf.registerKryoClasses(Array(
classOf[ClusterConfig],
classOf[ClusterContext],
classOf[ProjectionContext],
classOf[TableContext]
))
}
}
JACKG GITHUB 26
https://github.com/jackghm/Vertica/wiki/Optimize-Tables-Not-Queries
Optimize Tables
Not Queries
A MODEL I DEVELOPED AT TWITTER
Custom hash() 27
CREATE PROJECTION jackg.CORE_SESSION_START_3_P1 /*+createtype(L)*/
(
CORE_SESSION_START_GID,
UUID,
EVENT ENCODING RLE,
EVENTTIMESTAMP,
DIM_DATE_GID ENCODING RLE,
DIM_APP_GID ENCODING RLE,
SESSIONID,
DIM_EVENT_CATEGORY_GID ENCODING RLE,
DIM_EVENT_TYPE_GID ENCODING RLE,
DIM_EVENT_SUBTYPE_GID ENCODING RLE,
DIM_USER_GID,
DIM_PLATFORM_GID ENCODING RLE,
DIM_APP_VERSION_GID ENCODING RLE,
INTERNALSESSIONID,
LOCATION_IPADDRESS,
LOCATION_LOCALE ENCODING RLE,
TRIGGER_EPOCH ENCODING RLE
)
AS
SELECT CORE_SESSION_START_GID,
UUID,
EVENT,
EVENTTIMESTAMP,
DIM_DATE_GID,
DIM_APP_GID,
SESSIONID,
DIM_EVENT_CATEGORY_GID,
DIM_EVENT_TYPE_GID,
DIM_EVENT_SUBTYPE_GID,
DIM_USER_GID,
DIM_PLATFORM_GID,
DIM_APP_VERSION_GID,
INTERNALSESSIONID,
LOCATION_IPADDRESS,
LOCATION_LOCALE,
TRIGGER_EPOCH
FROM jackg.CORE_SESSION_START_3
ORDER BY DIM_EVENT_SUBTYPE_GID,
DIM_APP_GID,
DIM_EVENT_CATEGORY_GID,
DIM_EVENT_TYPE_GID,
EVENT,
LOCATION_LOCALE,
DIM_PLATFORM_GID,
DIM_APP_VERSION_GID,
DIM_DATE_GID,
TRIGGER_EPOCH,
DIM_USER_GID,
CORE_SESSION_START_GID
SEGMENTED BY hash(DIM_USER_GID) ALL NODES KSAFE 1;
CREATE TABLE cbowden.chash_test
(
dim_date_gid int NOT NULL,
dim_user_gid int NOT NULL,
uuid char(36) NOT NULL,
chash_dim_user_gid int DEFAULT CHASH(dim_user_gid::varchar)
)
PARTITION BY ((chash_test.dim_date_gid / 100)::int);
CREATE PROJECTION cbowden.chash_test_p1
(
uuid,
dim_user_gid,
chash_dim_user_gid,
dim_date_gid ENCODING RLE
)
AS
SELECT chash_test.uuid,
chash_test.chash_dim_user_gid,
chash_test.dim_user_gid,
chash_test.dim_date_gid
FROM cbowden.chash_test
ORDER BY chash_test.dim_date_gid,
chash_test.chash_dim_user_gid,
chash_test.dim_user_gid,
chash_test.uuid
SEGMENTED BY chash_test.chash_dim_user_gid ALL NODES KSAFE 1;
FOOTER 28
MISC
C++ SDK Vertica::UDSource implemented as
CREATE SOURCE SPARK AS LANGUAGE 'C++' NAME 'TcpServerSourceFactory' LIBRARY ULIB;
Makefile
install:
$(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE LIBRARY ULIB AS '$(PWD)/bin/ulib.so' LANGUAGE 'C++';"
$(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE SOURCE SPARK AS LANGUAGE 'C++' NAME 'TcpServerSourceFactory' LIBRARY ULIB;"
$(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE FUNCTION HASH_SEGMENTATION AS LANGUAGE 'C++' NAME 'HashSegmentationFactory'
LIBRARY ULIB;"
SELECT get_projection_segments('jackg.CORE_SESSION_START_APPDATA_P1_b0'); -- high to low segment range by node
v_calamari_node0001|v_calamari_node0002|v_calamari_node0003|v_calamari_node0004|v_calamari_node0005|v_calamari_node0006|v_calamari_node0007|v_calamari_node0008
536870911 | 1073741823|1610612735|2147483647|2684354559|3221225471|3758096383|4294967295
0 | 536870912 |1073741824|1610612736|2147483648|2684354560|3221225472|3758096384
-- hash the UserID then get the segmentation then get the node the record is stored on
select a.dim_app_gid, a.dim_user_gid, a.core_session_start_gid, a.sessionId
, SEGMENTATION_NODE(HASH_SEGMENTATION(hash(a.dim_user_gid))) as a_node
, SEGMENTATION_NODE(HASH_SEGMENTATION(hash(b.dim_user_gid))) as b_node
from blitz.core_session_start a
join blitz.dim_user b
using (dim_user_gid)
where a.dim_date_gid = 20150701 and sessionid is not null limit 3

Contenu connexe

Tendances

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021Andrzej Ludwikowski
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLBjoern Rost
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector? confluent
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEkawamuray
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
 
KSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafkaconfluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)confluent
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configsconfluent
 

Tendances (20)

Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
KSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafka
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
 
Top Ten Kafka® Configs
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
 

En vedette

Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataSpark Summit
 
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new? Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new? Simeon Nedkov
 
Hadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterHadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterDataWorks Summit
 
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Dataconomy Media
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big DataVoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big DataVoltDB
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Alexis Seigneurin
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
 
Spark SQL principes et fonctions
Spark SQL principes et fonctionsSpark SQL principes et fonctions
Spark SQL principes et fonctionsMICHRAFY MUSTAFA
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 

En vedette (11)

Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new? Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
Olivier Courtin - PostGIS from 1.4 to 2.0: what is really new?
 
Hadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterHadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at Twitter
 
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"
 
Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big DataVoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
 
Spark SQL principes et fonctions
Spark SQL principes et fonctionsSpark SQL principes et fonctions
Spark SQL principes et fonctions
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Science Tools
Science ToolsScience Tools
Science Tools
 

Similaire à HPBigData2015 PSTL kafka spark vertica

A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Databricks
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Timothy Spann
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Analise NetFlow in Real Time
Analise NetFlow in Real TimeAnalise NetFlow in Real Time
Analise NetFlow in Real TimePiotr Perzyna
 

Similaire à HPBigData2015 PSTL kafka spark vertica (20)

A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Analise NetFlow in Real Time
Analise NetFlow in Real TimeAnalise NetFlow in Real Time
Analise NetFlow in Real Time
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Spark basic.pdf
Spark basic.pdfSpark basic.pdf
Spark basic.pdf
 

Dernier

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Dernier (20)

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

HPBigData2015 PSTL kafka spark vertica

  • 1. |8/14/20 15 Jack Gudenkauf VP Big Data scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println() https://twitter.com/_JG
  • 2. 2 AGENDA 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance!
  • 4. 4 PLAYTIKA  Founded in 2010  Social Casino global category leader  10 games  13 platforms  1000+ employees
  • 5. A trifecta requires you to select the first three finishers in order and can lead to big pay-offs. Boxing lets your selections come in any order and you still win. Kafka - Spark - Vertica Placing Your Trifecta Box Bet on Kafka, Spark, and HP Vertica
  • 6. https://www.linkedin.com/in/jackglinkedin 6 MY BACKGROUND Playtika, VP of Big Data Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL), Chris Bowden (Dev Demigod)] MIS Director of several start-up companies Dataflex a 4GL RDBMS. [E.F. Codd] Self-employed Consultant Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2 Mainframe FoxPro, Sybase, MSSQL Server beta Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four] Microsoft; Dev Manager, Architect CLR/.Net Framework, Product Unit Manager Technical Strategy Group Inventor of “Shuttle”, a Microsoft product in use since 1999 A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS) [Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)] Twitter, Manager of Analytics Data Warehouse Core Storage; Hadoop, HBase, Cassandra, Blob Store Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR) [Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
  • 7. 7 A QUEST With attributes of Operational Robustness High Availabilty Stronger durability guarantees Idempotent (an operation that is safe to repeat) Productivity Analytics Streaming, Machine Learning, BI, BA, Data Science Rich Development env. Strongly typed, OO, Functional, with support for set based logic and aggregations (SQL) Performance Scalable in every tier MPP for Transformations, Reads & Writes A Unified Data Pipeline with Parallelism from Streaming Data through Data Transformations to Data Storage (Semi-Structured, Structured, and Relational Data)
  • 8. REST API Flume Apache Flume™ ETL JAVA ™ Parser & Loader MPP Columnar DW HP Vertica™ Cluster UserId <-> UserGId  Analytics of Relational Data  Structured Relational and Aggregated Data Application Application Game Applications UserId: INT SessionId: UUId (36) Bingo Blitz UserId: Int SessionId: UUId (32) Slotomania UserId: varchar(32) SessionId: varchar(255) WSOP C O P Y Playtika Santa Monica original ETL Architecture Extract Transform Load Single Source of Truths to Global SOT Unified Schema JSON Local Data Warehouses Original Architecture (ETL) 1 2 3 4 5
  • 9. 9 AGENDA 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader
  • 10. Real-Time Messaging Apache Kafka™ Cluster Analytics of [semi]Structured [non]Relational Data Stores Real-Time Streaming ✓Machine Learning ✓ Semi-Structured Raw JSON Data ✖Structured (non)relational Parquet Data  Structured Relational and Aggregated Data ETL Resilient Distributed Datasets Apache Spark™ Hadoop™ Parquet™ Cluster ✓ ✓ ✖  REST API Or Local Kafka Application Application Game Applications UserId: INT SessionId: UUId (36) Bingo Blitz UserId: Int SessionId: UUId (32) Slotomania UserId: varchar(32) SessionId: varchar(255) WSOP Unified Schema JSON Local Data Warehouses PSTL is the new ETL MPP Columnar DW HP Vertica™ Cluster  MPP 1 2 3 P a r a l l e l i z e d S t r e a m i n g T r a n s f o r m a t i o n L o a d e r4 5 New PSTL ArchitectureNew PSTL Architecture
  • 11. 11 AGENDA 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica
  • 12. Apache Kafka ™ is a distributed, partitioned, replicated commit log service Producer Producer Producer Kafka Cluster (Broker) Consumer Consumer Consumer
  • 13. A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time. Kafka is Not a message Queue (Push/Pop) Apache Kafka ™
  • 14. SPARK RDD A Resilient Distributed Dataset [in Memory] Represents an immutable, partitioned collection of elements that can be operated on in parallel Node 1 Node 2 Node 3 Node … RDD 1 RDD 1 Partition 1 RDD 1 Partition 2 RDD 1 Partition 3 RDD 3 RDD 3 Partition 2 RDD 3 Partition 3 RDD 3 Partition 1 RDD 2 RDD 2 Partition 1 to 64 RDD 2 Partition 65 to 128 RDD 2 Partition 193 to 256 RDD 2 Partition 129 to 192
  • 15. 15Initiator Node An Initiator Node shuffles data to storage nodes Vertica Hashing & Partitioning
  • 16. 16 AGENDA 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader
  • 17. {"appId": 3, "sessionId": ”7”, "userId": ”42” } {"appId": 3, "sessionId": ”6”, "userId": ”42” } Node 1 Node 2 Node 3 Node 4 3 Import recent Sessions Apache Kafka Cluster Topic: “appId_1” Topic: “appId_2” Topic: “appId_3” old new Kafka Table appId, TopicOffsetRange, Batch_Id SessionMax Table sessionGIdMax Int UserMax Table userGIdMax Int appSessionMap_RDD appId: Int sessionId: String sessionGId: Int appUserMap_RDD appId: Int userId: String userGId: Int appSession appId: Int sessionId: varchar(255) sessionGId: Int appUser appId: Int userId: varchar(255) userGId: Int 1 Start a Spark Driver per APP Node 1 Node 2 Node 3 4 Spark Kafka [non]Streaming job per APP (read partition/offset range) 5 select for update; update max GId 5 Assign userGIds To userId sessionGIds To sessionId 6 Hash(userGId) to RDD partitions with affinity To Vertica Node (Parallelism) 7 userGIdRDD.foreachPartition {…stream.writeTo(socket)...} 8 Idempotent: Write Raw Unparsed JSON to hdfs 9 Idempotent: Write Parsed JSON to .parquet hdfs 10 Update MySQL Kafka Offsets {"appId": 2, "sessionId": ”4”, "userId": ”KA” } {"appId": 2, "sessionId": ”3”, "userId": ”KY” } {"appId": 1, "sessionId": ”2”, "userId": ”CB” } {"appId": 1, "sessionId": "1”, "userId": ”JG” } 4 appId {Game events, Users, Sessions,…} Partition 1..n RDDs 5 appId Users & Sessions Partition 1..n RDDs 5 appId appUserMap_RDD.union(assignedID_RDD) RDDs 6 appId Users & Sessions Partition 1..n RDDs 7 copy jackg.DIM_USER with source SPARK(port='12345’, nodes=‘node0001:4, node0002:4, node0003:4’) direct; 2 Import Users Apache Hadoop™ Spark™ Cluster HP Vertica™ Cluster
  • 18. 18 AGENDA 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance!
  • 19. Impressive Parallel COPY Performance Loaded 2.42 Billion Rows (451 GB) in 7min 35sec on an 8 Node Cluster Key Takeaways Parallel Kafka Reads to Spark RDD (in memory) with Parallel writes to a Vertica via tcp server – ROCKS! COPY 36 TB/Hour with 81 Node cluster No ephemeral nodes needed for ingest Kafka read parallelism to Spark RDD partitions A priori hash() in Spark RDD Partitions (in Memory) TCP Server as a Vertica User Define Copy Source Single COPY does not preallocate Memory across nodes http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/ * 270 Nodes ( 45 Ingest Nodes + 215 Data Nodes [225 ?] )
  • 20. THANK YOU! Q & A Jack Gudenkauf VP Big Data https://twitter.com/_JG https://www.linkedin.com/in/jackglinkedin
  • 22. PARALLEL COPY BENCHMARK 8 Node Cluster with Parallelism of 4 22 ①copy jackg.CORE_SESSION_START_0 with source SPARK(port='12345', nodes='node0001:4,node0002:4,node0003:4,node0004:4,node0005:4,node0006:4,node0007:4,node0008:4') direct; ②copy jackg.CORE_SESSION_START_1 with source SPARK… ③copy jackg.CORE_SESSION_START_2 with source SPARK… ④copy jackg.CORE_SESSION_START_3 with source SPARK… Netcat the pipe delimited text files to vertica hosts 10.91.101.19x on port 12345 nc 10.91.101.194 12345 < xad & `split` file(s) for Reads -rw-r--r-- 1 jgudenkauf __USERS__ 3925357079 Jul 2 20:16 xad
  • 23. 23 Vertica Parallel Performance Total size in bytes of all delimited text files Record Count Duration Method Tested 451,358,287,648 2,420,989,007 16m26sec ParallelExporter (Market Place). Read Vertica, Write to local node files 451,358,287,648 2,420,989,007 20m49sec * COPY command using all nodes local. Used Pre-Hashed files on Vertica local files for read, Write to Vertica 451,358,287,648 2,420,989,007 24min16sec ** Parallel INSERT DIRECT SELECT where hash() = Local Node. Parallel reads & Writes In Vertica Cluster (no flat files) 451,358,287,648 2,420,989,007 Toooo Slow and Pipes Broke cat file COPY… stdin
  • 24. Spark-Streaming-Kafka 24 package com.playtika.data.ulib import com.playtika.data.ulib.vertica._ import com.playtika.data.ulib.spark.RddExtensions._ import com.playtika.data.ulib.spark.streaming._ import com.playtika.data.ulib.etl._ import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils} object SparkStreamingExample extends Logging { type JavaMap[T, U] = java.util.Map[T, U] private val deserializer = Deserializer.json[JavaMap[String, Any]] def main(args: Array[String]): Unit = { val streamingContext = StreamingContext.getOrCreate("/tmp/ulib-kafka-streaming", createStreamingContext) streamingContext.start() streamingContext.awaitTermination() } def createStreamingContext(): StreamingContext = { val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KyroSerializer") .configureEtlExtensions() .configureVerticaExtensions() val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(10)) val config = Map[String, String]( "metadata.broker.list" -> "kafka-br01-dw-dev.smo.internal:9092,kafka-br02-dw-dev.smo.internal:9092,kafka-br03-dw- dev.smo.internal:9092" ) val topics = Set[String]("bingoblitz") KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, config, topics) .foreachRDD(rdd => { rdd.eventProcessorLoop() }) ssc.checkpoint("/tmp/ulib-etl/checkpoints") ssc } }
  • 25. 25 package com.playtika.data.ulib.spark import org.apache.spark.SparkConf case class VerticaSparkConf(conf: SparkConf) extends SparkConfWrapper { def configureVerticaExtensions(): SparkConf = { import com.playtika.data.ulib.vertica._ conf.registerKryoClasses(Array( classOf[ClusterConfig], classOf[ClusterContext], classOf[ProjectionContext], classOf[TableContext] )) } }
  • 27. Custom hash() 27 CREATE PROJECTION jackg.CORE_SESSION_START_3_P1 /*+createtype(L)*/ ( CORE_SESSION_START_GID, UUID, EVENT ENCODING RLE, EVENTTIMESTAMP, DIM_DATE_GID ENCODING RLE, DIM_APP_GID ENCODING RLE, SESSIONID, DIM_EVENT_CATEGORY_GID ENCODING RLE, DIM_EVENT_TYPE_GID ENCODING RLE, DIM_EVENT_SUBTYPE_GID ENCODING RLE, DIM_USER_GID, DIM_PLATFORM_GID ENCODING RLE, DIM_APP_VERSION_GID ENCODING RLE, INTERNALSESSIONID, LOCATION_IPADDRESS, LOCATION_LOCALE ENCODING RLE, TRIGGER_EPOCH ENCODING RLE ) AS SELECT CORE_SESSION_START_GID, UUID, EVENT, EVENTTIMESTAMP, DIM_DATE_GID, DIM_APP_GID, SESSIONID, DIM_EVENT_CATEGORY_GID, DIM_EVENT_TYPE_GID, DIM_EVENT_SUBTYPE_GID, DIM_USER_GID, DIM_PLATFORM_GID, DIM_APP_VERSION_GID, INTERNALSESSIONID, LOCATION_IPADDRESS, LOCATION_LOCALE, TRIGGER_EPOCH FROM jackg.CORE_SESSION_START_3 ORDER BY DIM_EVENT_SUBTYPE_GID, DIM_APP_GID, DIM_EVENT_CATEGORY_GID, DIM_EVENT_TYPE_GID, EVENT, LOCATION_LOCALE, DIM_PLATFORM_GID, DIM_APP_VERSION_GID, DIM_DATE_GID, TRIGGER_EPOCH, DIM_USER_GID, CORE_SESSION_START_GID SEGMENTED BY hash(DIM_USER_GID) ALL NODES KSAFE 1; CREATE TABLE cbowden.chash_test ( dim_date_gid int NOT NULL, dim_user_gid int NOT NULL, uuid char(36) NOT NULL, chash_dim_user_gid int DEFAULT CHASH(dim_user_gid::varchar) ) PARTITION BY ((chash_test.dim_date_gid / 100)::int); CREATE PROJECTION cbowden.chash_test_p1 ( uuid, dim_user_gid, chash_dim_user_gid, dim_date_gid ENCODING RLE ) AS SELECT chash_test.uuid, chash_test.chash_dim_user_gid, chash_test.dim_user_gid, chash_test.dim_date_gid FROM cbowden.chash_test ORDER BY chash_test.dim_date_gid, chash_test.chash_dim_user_gid, chash_test.dim_user_gid, chash_test.uuid SEGMENTED BY chash_test.chash_dim_user_gid ALL NODES KSAFE 1;
  • 28. FOOTER 28 MISC C++ SDK Vertica::UDSource implemented as CREATE SOURCE SPARK AS LANGUAGE 'C++' NAME 'TcpServerSourceFactory' LIBRARY ULIB; Makefile install: $(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE LIBRARY ULIB AS '$(PWD)/bin/ulib.so' LANGUAGE 'C++';" $(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE SOURCE SPARK AS LANGUAGE 'C++' NAME 'TcpServerSourceFactory' LIBRARY ULIB;" $(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE FUNCTION HASH_SEGMENTATION AS LANGUAGE 'C++' NAME 'HashSegmentationFactory' LIBRARY ULIB;" SELECT get_projection_segments('jackg.CORE_SESSION_START_APPDATA_P1_b0'); -- high to low segment range by node v_calamari_node0001|v_calamari_node0002|v_calamari_node0003|v_calamari_node0004|v_calamari_node0005|v_calamari_node0006|v_calamari_node0007|v_calamari_node0008 536870911 | 1073741823|1610612735|2147483647|2684354559|3221225471|3758096383|4294967295 0 | 536870912 |1073741824|1610612736|2147483648|2684354560|3221225472|3758096384 -- hash the UserID then get the segmentation then get the node the record is stored on select a.dim_app_gid, a.dim_user_gid, a.core_session_start_gid, a.sessionId , SEGMENTATION_NODE(HASH_SEGMENTATION(hash(a.dim_user_gid))) as a_node , SEGMENTATION_NODE(HASH_SEGMENTATION(hash(b.dim_user_gid))) as b_node from blitz.core_session_start a join blitz.dim_user b using (dim_user_gid) where a.dim_date_gid = 20150701 and sessionid is not null limit 3