At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
2. OUR SPEAKERS
Hao Zou
Software Engineer
Data Science & Engineering
GoPro
David Winters
Big Data Architect
Data Science & Engineering
GoPro
3. TOPICS TO COVER
• Background and Business
• GoPro Data Platform Architecture
• Old File-based Pipeline Architecture
• New Dynamic DDL Architecture
• Dynamic DDL Deep Dive
• Using Cloud-Based Services (Optional)
• Questions
8. DATA CHALLENGES AT GOPRO
• Variety of data - Hardware and Software products
• Software - Mobile and Desktop Apps
• Hardware - Cameras, Drones, Controllers, Accessories,
etc.
• External - CRM, ERP, OTT, E-Commerce, Web, Social, etc.
• Variety of data ingestion mechanisms - Lambda Architecture
• Real-time streaming pipeline - GoPro products
• Batch pipeline - External 3rd party systems
• Complex Transformations
• Data often stored in binary to conserve space in cameras
• Heterogeneous data formats (JSON, XML, and packed
binary)
• Seamless Data Aggregations
• Blend data between different sources, hardware, and
15. PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters
16. NEW DYNAMIC DDL ARCHITECTURE
Amazon S3
Bucket
Real Time
Cluster
Batch
Induction
Framework
Hive
Metastore
Ephemeral
ETL
Cluster
Parquet
+
DDL
Aggregates
Events
+
State Ephemeral
Data Mart
Cluster #1
Ephemeral
Data Mart
Cluster #2
Ephemeral
Data Mart
Cluster #N
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
•Notebook
s
•Tableau
•Plotly
•Python
•R
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Single long running cluster to maintain
Parquet
+
DDL
Dynamic
DDL!
18. NEW DYNAMIC DDL ARCHITECTURE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive
MetaStore
For each topic, dynamically add the table
structure and create the table or insert
data into the table if already exists
19. DYNAMIC DDL
• What is Dynamic DDL?
• Dynamic DDL is adding structure (schema) to the data on the fly whenever the providers of the data are changing
their structure.
• Why is Dynamic DDL needed?
• Providers of data are changing their structure constantly. Without Dynamic DDL, the table schema is hard coded
and has to be manually updated based on the changes of the incoming data.
• All of the aggregation SQL would have to be manually updated due to the schema change.
• Faster turnaround for the data ingestion. Data can be ingested and made available within minutes (sometimes
seconds).
• How we did this?
• Using Spark SQL/Dataframe
• See example
20. DYNAMIC DDL
• Example:
{"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-20T00:06:01Z"}}
Fixed schema
Dynamically generated
schema
{"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"}
Flatten the data first
SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state,
MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name,
MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name,
MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city,
id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
21. DYNAMIC DDL USING SPARK SQL/DATAFRAME
• Code snippet of Dynamic DDL transforming new JSON attributes into relational columns
Add the partition columns
Manually create the table due to a bug in
spark
22. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new columns that exist in the incoming
data frame but do not exist yet in the destination
table
This syntax is not working anymore after upgrading to spark 2.x
23. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Three temporary way to solve the problem in spark 2.x:
• Launch a hiveserver2 service, then use jdbc call hive to alter the table
• Use spark to directly connect to hivemetastore, then update the
metadata
• Update spark source code to support Alter table syntax and
repackage it
24. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Project all columns from the table
Append the data into the destination table
25. DYNAMIC DDL USING SPARK SQL/DATAFRAME
Add the new partition key
• Reprocessing the DDL Table with new partition Key (Tuning tips)
Choose the partition key wisely
Use coalesce if there too many partitions
Use Coalesce to control the job tasksUse filter if Data still too large
27. USING S3: WHAT IS S3?
• S3 is not a file system.
• S3 is an object store. Similar to a key-value store.
• S3 objects are presented in a hierarchical view but are not stored in that manner.
• S3 objects are stored with a key derived from a “path”.
• The key is used to fan out the objects across shards.
• The path is for display purposes only. Only the first 3 to 4 characters are
used for sharding.
• S3 does not have strong transactional semantics but instead has eventual
consistency.
• S3 is not appropriate for realtime updates.
• S3 is suited for longer term storage.
28. USING S3: BEHAVIORS
• S3 has similar behaviors to HDFS but even more extreme.
• Larger latencies
• Larger files/writes – Think GBs
• Write and read latencies are larger but the bandwidth is much larger with S3.
• Thus throughput can be increased with parallel writers (same latency but
more throughput through parallel operations)
• Partition your RDDs/DataFrames and increase your workers/executors
to optimize the parallelism.
• Each write/read has more overhead due to the web service calls.
• So use larger buffers.
• Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS.
• Collect data for longer durations before writing large buffers in parallel to S3.
• Retry logic – Writes to S3 can and will fail.
• Cannot stream to S3 – Complete files must be uploaded.
• Technically, you can simulate streaming with multipart upload.
29. USING S3: TIPS
• Tips for using S3 with HDFS
• Use the s3a scheme.
• Many optimizations including buffering options (disk-based, on-heap, or
off-heap) and incremental parallel uploads (S3A Fast Upload).
• More here: http://hadoop.apache.org/docs/current/hadoop-
aws/tools/hadoop-aws/index.html#S3A
• Don’t use rename/move.
• Moves are great for HDFS to support better transactional semantics
when streaming files.
• For S3, moves/renames are copy and delete operations which can be
very slow especially due to the eventual consistency.
• Other advanced S3 techniques:
• Hash object names to better shard the objects in a bucket.
• Use multiple buckets to increase bandwidth.
GoPro is the ultimate accessory for active people with smartphones.
Products and services include:
Core products -- Cameras
Advanced Solutions – Karma, stabilization and VR
Accessories and Mounts
Software suite that ties it all together
GoPro has become the ultimate, end-to-end storytelling solution.
This is what we wanted to build.
These were the challenges to solve.
High Level Architecture of Data Platform
Isolation of workloads 3 clusters (ingest, ETL, delivery)
Lamdba architecture
Input and output data formats
Cadence of clusters
A word about Data Sources:
IoT data
Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc.
Some Raw and Gzip, Some Binary and JSON
Some streaming and some batch
Batch data
Web marketing, campaigns
Social media
ERP
CRM
Lambda architecture
Both batch and stream processing
Basic needs/workloads in a Data Platform
High throughput ingestion
Transformations: joins, aggregations, etc.
Fast queries
Today, we have 3 clusters to isolate these workloads
We started with one cluster, ETL
Everything ran there
Ingest (Flume)
Batch (Framework)
ETL (Hive)
Analytical (Impala)
Lots of resource contention (I/O, memory, cores)
To alleviate the resource contention, we opted for 3 clusters to isolate the workloads.
Ingest cluster for near real-time streaming
Kafka, Spark Streaming (Cloudera Parcels)
Input: Logs, Output: JSON
Minutes cadence
Moving towards more real-time in seconds
Induction framework for scheduled batch ingestion
ETL cluster for heavy duty aggregation
Input: JSON flat files, Output: Aggregated Parquet files
Hive (Map/Reduce)
Hourly cadence
Secure Data Mart
Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers)
Input: Compressed Parquet files
Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio)
With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future.
Kudu is one possible new technology that could help us to consolidate some of the clusters.
Let’s take a deeper dive into our streaming ingestion…
Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS
Custom servlet pushes logs into Kafka topics by environment
A series of Spark streaming jobs process the logs from Kafka
Landing place in ingestion cluster is HDFS with JSON flat files
Rationalization of tech stacks…
Why Kafka?
Unrivaled write throughput for a queue
Traditional queue throughput: 100K writes/sec on the biggest box you can buy
Kafka throughput: 1M writes/sec on 3-4 commodity servers
Strong ordering policy of messages
Distributed
Fault-tolerant through replication
Support synchronous and asynchronous writes
Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks)
Why Spark Streaming?
Strong transactional semantics - "exactly once" processing
Leverage Spark technology for both data ingest and analytics
Horizontally scalable - High throughput for micro-batching
Large open source community
Keyword: Impedance mismatch
As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events
Vary significantly in size from < 1 KB to > 1 MB
Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job
Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic
Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes
There are generic jobs/services and specialized jobs/services
Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS
We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB)
Specialized services contain business logic
Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data)
Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
A few things…
Two flows of data: streaming and batch
Join data sources
Aggregate data sources
Convert to compressed columnar format (gzipped Parquet fies)
On the ETL cluster…
Here’s where we do our heavy lifting.
Almost entirely all Hive Map Reduce jobs
Some Impala to make the really big narly aggregations more performant
Previously, had a custom Java Map Reduce job for sessionization of events
This has been replaced with a Spark Streaming job on the ingestion cluster
In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing
We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.)
The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore.
The Parquet files are then copied via distcp to the Secure Data Mart.
Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart.
The Secure Data Mart is protected with Apache Sentry.
Kerberos is used for authentication. Corporate Standard
Active Directory stores the groups. Corporate Standard
Access control is role based and the roles are assigned with Sentry.
Hue has a Sentry UI app to manage authorization.
Store data in one place Data (S3) + Structure (Hive Metastore)
Separate compute nodes from storage nodes
Elasticity size of clusters and number of clusters
Lower operational overhead of maintaining HDFS storage nodes
Redirect batch ingest into stream ingest (Pump batch data into Kafka) RESULT: One codebase for both stream and batch ingestion
Even thought the fixed schema resolves the problem of data provider changing the data structure frequently.
It becomes really difficult for the analysts and data scientists to analyze the data.
1.how would they know these two rows are coming from one event?
2.has to use id to associate these two rows
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table . So we create the table first and later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/
/** The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata* gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter* the table and add columns to it. See below for more details:** https://issues.apache.org/jira/browse/SPARK-9761*/