My new industry acronym: PSTL
the Parallelized Streaming Transformation Loader (Pron. PiSToL) is an architecture for highly scalable and reliable, data ingestion pipelines
While there is guidance on using; Apache Kafka™ for Streaming (or non-Streaming), and Apache Spark™ for Transformations, and Loading data (e.g., COPY) into an HP-Vertica™ columnar Data Warehouse, there is very little prescriptive guidance on how to truly parallelize a unified data pipeline - until now.
4. 4
PLAYTIKA
Founded in 2010
Social Casino global category leader
10 games
13 platforms
1000+ employees
5. A trifecta requires you to select the first three finishers in order
and can lead to big pay-offs.
Boxing lets your selections come in any order and you still win.
Kafka - Spark - Vertica
Placing Your Trifecta Box Bet on Kafka, Spark, and HP Vertica
6. https://www.linkedin.com/in/jackglinkedin
6
MY BACKGROUND
Playtika, VP of Big Data
Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL), Chris Bowden (Dev Demigod)]
MIS Director of several start-up companies
Dataflex a 4GL RDBMS. [E.F. Codd]
Self-employed Consultant
Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2 Mainframe
FoxPro, Sybase, MSSQL Server beta
Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]
Microsoft; Dev Manager, Architect CLR/.Net Framework, Product Unit Manager Technical Strategy Group
Inventor of “Shuttle”, a Microsoft product in use since 1999
A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS)
[Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]
Twitter, Manager of Analytics Data Warehouse
Core Storage; Hadoop, HBase, Cassandra, Blob Store
Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)
[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
7. 7
A QUEST
With attributes of
Operational Robustness
High Availabilty
Stronger durability guarantees
Idempotent (an operation that is safe to repeat)
Productivity
Analytics
Streaming, Machine Learning, BI, BA, Data Science
Rich Development env.
Strongly typed, OO, Functional, with support for set based logic and aggregations (SQL)
Performance
Scalable in every tier
MPP for Transformations, Reads & Writes
A Unified Data Pipeline with Parallelism
from Streaming Data
through Data Transformations
to Data Storage (Semi-Structured, Structured, and Relational Data)
8. REST API
Flume
Apache Flume™
ETL
JAVA ™
Parser & Loader
MPP Columnar DW
HP Vertica™
Cluster
UserId <-> UserGId
Analytics of Relational Data
Structured Relational and Aggregated Data
Application
Application
Game
Applications
UserId: INT
SessionId: UUId (36)
Bingo Blitz
UserId: Int
SessionId: UUId (32)
Slotomania
UserId: varchar(32)
SessionId: varchar(255)
WSOP
C O P Y
Playtika Santa Monica
original ETL Architecture
Extract Transform Load
Single Source of Truths to Global SOT
Unified Schema
JSON
Local Data Warehouses
Original Architecture (ETL)
1
2 3 4
5
10. Real-Time
Messaging
Apache Kafka™
Cluster
Analytics of [semi]Structured [non]Relational Data Stores
Real-Time Streaming ✓Machine Learning
✓ Semi-Structured Raw JSON Data
✖Structured (non)relational Parquet Data
Structured Relational and Aggregated Data
ETL
Resilient Distributed Datasets
Apache Spark™ Hadoop™ Parquet™
Cluster
✓ ✓ ✖
REST API
Or Local Kafka
Application
Application
Game
Applications
UserId: INT
SessionId: UUId (36)
Bingo Blitz
UserId: Int
SessionId: UUId (32)
Slotomania
UserId: varchar(32)
SessionId: varchar(255)
WSOP
Unified Schema
JSON
Local Data Warehouses
PSTL is the new ETL
MPP Columnar DW
HP Vertica™
Cluster
MPP
1 2 3
P a r a l l e l i z e d S t r e a m i n g T r a n s f o r m a t i o n L o a d e r4
5
New PSTL ArchitectureNew PSTL Architecture
12. Apache Kafka ™
is a distributed, partitioned, replicated commit log service
Producer Producer Producer
Kafka Cluster
(Broker)
Consumer Consumer Consumer
13. A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks like
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time.
Kafka is Not a message Queue (Push/Pop)
Apache Kafka ™
14. SPARK RDD
A Resilient Distributed Dataset [in Memory]
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Node 1 Node 2 Node 3 Node …
RDD 1
RDD 1
Partition 1
RDD 1
Partition 2
RDD 1
Partition 3
RDD 3
RDD 3
Partition 2
RDD 3
Partition 3
RDD 3
Partition 1
RDD 2
RDD 2
Partition 1 to 64
RDD 2
Partition 65 to 128
RDD 2
Partition
193 to 256
RDD 2
Partition
129 to 192
19. Impressive Parallel COPY Performance
Loaded 2.42 Billion Rows (451 GB)
in 7min 35sec on an 8 Node Cluster
Key Takeaways
Parallel Kafka Reads to Spark RDD (in memory) with
Parallel writes to a Vertica via tcp server – ROCKS!
COPY 36 TB/Hour with 81 Node cluster
No ephemeral nodes needed for ingest
Kafka read parallelism to Spark RDD partitions
A priori hash() in Spark RDD Partitions (in Memory)
TCP Server as a Vertica User Define Copy Source
Single COPY does not preallocate Memory across nodes
http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/
* 270 Nodes ( 45 Ingest Nodes + 215 Data Nodes [225 ?] )
20. THANK YOU!
Q & A
Jack Gudenkauf
VP Big Data
https://twitter.com/_JG
https://www.linkedin.com/in/jackglinkedin
22. PARALLEL COPY BENCHMARK
8 Node Cluster with Parallelism of 4 22
①copy jackg.CORE_SESSION_START_0 with source SPARK(port='12345', nodes='node0001:4,node0002:4,node0003:4,node0004:4,node0005:4,node0006:4,node0007:4,node0008:4') direct;
②copy jackg.CORE_SESSION_START_1 with source SPARK…
③copy jackg.CORE_SESSION_START_2 with source SPARK…
④copy jackg.CORE_SESSION_START_3 with source SPARK…
Netcat the pipe delimited text files to vertica hosts 10.91.101.19x on port 12345
nc 10.91.101.194 12345 < xad &
`split` file(s) for Reads
-rw-r--r-- 1 jgudenkauf __USERS__ 3925357079 Jul 2 20:16 xad
23. 23
Vertica Parallel Performance
Total size in bytes of
all delimited text files
Record Count Duration Method Tested
451,358,287,648 2,420,989,007 16m26sec ParallelExporter (Market Place).
Read Vertica, Write to local node files
451,358,287,648 2,420,989,007 20m49sec * COPY command using all nodes local. Used
Pre-Hashed files on Vertica local files for read,
Write to Vertica
451,358,287,648 2,420,989,007 24min16sec ** Parallel INSERT DIRECT SELECT where hash()
= Local Node. Parallel reads & Writes In Vertica
Cluster (no flat files)
451,358,287,648 2,420,989,007 Toooo Slow and Pipes
Broke
cat file COPY… stdin
24. Spark-Streaming-Kafka 24
package com.playtika.data.ulib
import com.playtika.data.ulib.vertica._
import com.playtika.data.ulib.spark.RddExtensions._
import com.playtika.data.ulib.spark.streaming._
import com.playtika.data.ulib.etl._
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils}
object SparkStreamingExample extends Logging {
type JavaMap[T, U] = java.util.Map[T, U]
private val deserializer = Deserializer.json[JavaMap[String, Any]]
def main(args: Array[String]): Unit = {
val streamingContext = StreamingContext.getOrCreate("/tmp/ulib-kafka-streaming", createStreamingContext)
streamingContext.start()
streamingContext.awaitTermination()
}
def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KyroSerializer")
.configureEtlExtensions()
.configureVerticaExtensions()
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val config = Map[String, String](
"metadata.broker.list" -> "kafka-br01-dw-dev.smo.internal:9092,kafka-br02-dw-dev.smo.internal:9092,kafka-br03-dw-
dev.smo.internal:9092"
)
val topics = Set[String]("bingoblitz")
KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, config, topics)
.foreachRDD(rdd => {
rdd.eventProcessorLoop()
})
ssc.checkpoint("/tmp/ulib-etl/checkpoints")
ssc
}
}
27. Custom hash() 27
CREATE PROJECTION jackg.CORE_SESSION_START_3_P1 /*+createtype(L)*/
(
CORE_SESSION_START_GID,
UUID,
EVENT ENCODING RLE,
EVENTTIMESTAMP,
DIM_DATE_GID ENCODING RLE,
DIM_APP_GID ENCODING RLE,
SESSIONID,
DIM_EVENT_CATEGORY_GID ENCODING RLE,
DIM_EVENT_TYPE_GID ENCODING RLE,
DIM_EVENT_SUBTYPE_GID ENCODING RLE,
DIM_USER_GID,
DIM_PLATFORM_GID ENCODING RLE,
DIM_APP_VERSION_GID ENCODING RLE,
INTERNALSESSIONID,
LOCATION_IPADDRESS,
LOCATION_LOCALE ENCODING RLE,
TRIGGER_EPOCH ENCODING RLE
)
AS
SELECT CORE_SESSION_START_GID,
UUID,
EVENT,
EVENTTIMESTAMP,
DIM_DATE_GID,
DIM_APP_GID,
SESSIONID,
DIM_EVENT_CATEGORY_GID,
DIM_EVENT_TYPE_GID,
DIM_EVENT_SUBTYPE_GID,
DIM_USER_GID,
DIM_PLATFORM_GID,
DIM_APP_VERSION_GID,
INTERNALSESSIONID,
LOCATION_IPADDRESS,
LOCATION_LOCALE,
TRIGGER_EPOCH
FROM jackg.CORE_SESSION_START_3
ORDER BY DIM_EVENT_SUBTYPE_GID,
DIM_APP_GID,
DIM_EVENT_CATEGORY_GID,
DIM_EVENT_TYPE_GID,
EVENT,
LOCATION_LOCALE,
DIM_PLATFORM_GID,
DIM_APP_VERSION_GID,
DIM_DATE_GID,
TRIGGER_EPOCH,
DIM_USER_GID,
CORE_SESSION_START_GID
SEGMENTED BY hash(DIM_USER_GID) ALL NODES KSAFE 1;
CREATE TABLE cbowden.chash_test
(
dim_date_gid int NOT NULL,
dim_user_gid int NOT NULL,
uuid char(36) NOT NULL,
chash_dim_user_gid int DEFAULT CHASH(dim_user_gid::varchar)
)
PARTITION BY ((chash_test.dim_date_gid / 100)::int);
CREATE PROJECTION cbowden.chash_test_p1
(
uuid,
dim_user_gid,
chash_dim_user_gid,
dim_date_gid ENCODING RLE
)
AS
SELECT chash_test.uuid,
chash_test.chash_dim_user_gid,
chash_test.dim_user_gid,
chash_test.dim_date_gid
FROM cbowden.chash_test
ORDER BY chash_test.dim_date_gid,
chash_test.chash_dim_user_gid,
chash_test.dim_user_gid,
chash_test.uuid
SEGMENTED BY chash_test.chash_dim_user_gid ALL NODES KSAFE 1;
28. FOOTER 28
MISC
C++ SDK Vertica::UDSource implemented as
CREATE SOURCE SPARK AS LANGUAGE 'C++' NAME 'TcpServerSourceFactory' LIBRARY ULIB;
Makefile
install:
$(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE LIBRARY ULIB AS '$(PWD)/bin/ulib.so' LANGUAGE 'C++';"
$(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE SOURCE SPARK AS LANGUAGE 'C++' NAME 'TcpServerSourceFactory' LIBRARY ULIB;"
$(VSQL) -U $(VSQL_USER) -w $(VSQL_PASS) -c "CREATE FUNCTION HASH_SEGMENTATION AS LANGUAGE 'C++' NAME 'HashSegmentationFactory'
LIBRARY ULIB;"
SELECT get_projection_segments('jackg.CORE_SESSION_START_APPDATA_P1_b0'); -- high to low segment range by node
v_calamari_node0001|v_calamari_node0002|v_calamari_node0003|v_calamari_node0004|v_calamari_node0005|v_calamari_node0006|v_calamari_node0007|v_calamari_node0008
536870911 | 1073741823|1610612735|2147483647|2684354559|3221225471|3758096383|4294967295
0 | 536870912 |1073741824|1610612736|2147483648|2684354560|3221225472|3758096384
-- hash the UserID then get the segmentation then get the node the record is stored on
select a.dim_app_gid, a.dim_user_gid, a.core_session_start_gid, a.sessionId
, SEGMENTATION_NODE(HASH_SEGMENTATION(hash(a.dim_user_gid))) as a_node
, SEGMENTATION_NODE(HASH_SEGMENTATION(hash(b.dim_user_gid))) as b_node
from blitz.core_session_start a
join blitz.dim_user b
using (dim_user_gid)
where a.dim_date_gid = 20150701 and sessionid is not null limit 3