GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
2. ABOUT SPEAKER: CHESTER CHEN
• Head of Data Science & Engineering (DSE) at GoPro
• Prev. Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
6. AGENDA
• Business Use Cases
• Evalution of GoPro Data Platform
• Analytics Metrics Delivery via Slack
• ML Feature Visualization Features with Google Facets and Spark
9. EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• features adoptions, user engagements, User segmentation, churn analysis, funnel analysis,
conversion rate etc.
• Web/E-Commercial Analytics
• Camera Analytics
• Scene change detection, feature usages etc.
• Mobile Analytics
• Camera connections, story sharing etc.
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
• Media classifications, recommendations, storage analysis.
15. BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf
18. AIRFLOW SETUP
Web Server LB
Scheduler
Airflow Metastore
WorkersWorkers
Workers
Workers
Workers
Web Server B
Web Server LB
Web Server A
Message Queue
Airflow
DAGs
sync
Push DAGs to S3
19. TAKEAWAYS
• Key Changes
• Centralize hive meta store
• Separate compute and
storage needs
• Leverage S3 as storage
• Horizontal scale with cluster
elasticity
• Less time in managing
infrastructure
• Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year
ago
• Operation
• Reduce the complexity of DevOp Support
• Analytics tools
• SQL only => Notebook with (SQL, Python,
Scala)
22. BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job
23. TABLE WRITER JOBS
• Job are identified by JobType, JobName, JobConfig
• Majority of the Spark ETL Jobs are Table Writers
• Load data into DataFrame
• DataFrame to DataFrame transformation
• Output DataFrame to Hive Table
• Majority of table writer Jobs can be de-composed as one of the
following sub jobs
24. TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs
31. DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL
32. SCALA SCRIPTS
• Define a special SparkJob : Spark Job Code Runner
• Load Scala script files from specified location (defined by config)
• Dynamically compiles the scala code into classes
• For the compiled classes : run spark jobs defined in the scripts
• Twitter EVAL Util: Dynamically evaluates Scala strings and files.
• <groupId>com.twitter</groupId>
<artifactId>util-eval_2.11</artifactId>
<version>6.24.0</version>
33. SCALA SCRIPTS
object SparkJobCodeRunner extends SparkJob {
private val LOG = LoggerFactory.getLogger(getClass)
import collection.JavaConverters._
override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val jobFileNames: List[String] = //...
jobFileNames.foreach{ x =>
val clazzes : Option[Any] = evalFromFileName[Any](x)
clazzes.foreach{c =>
c match {
case job: SparkJob => job.run(sc, jobType, jobName, config)
case _ => LOG.info("not match")
}
}
}
}
}
35. SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob
38. DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• Data Metrics Delivery
• Delivery to Slack : make metrics more accessible to broader audience
• Data Slice & Dice
• Leverage Real-Time OLAP database (Druid) (ongoing project)
• Analytics Visualization (ongoing project)
• Leverage Superset and Data Management Application
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• ML Feature Visualization
39. Spark Cluster
New or existing cluster
Spark Cluster
Long Running Cluster
Metrics Batch Ingestion
Streaming Ingestion
Output Metrics
BedRock
DATA
VISUALIZATION
&
MANAGEMENT
Working in Progress
42. SLACK METRICS DELIVERY
• Why Slack ?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Move key metrics move away from Tableau Dashboard and put
metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.
43. SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations
45. BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks
46. CONFIGURATION-DRIVEN
slack-plus-push-weekly { //job name
persist-metrics="true"
channels {
dse-metrics {
post-urls {
plus-metrics = "https://hooks.slack.com/services/XXXX"
dse-metrics-test = "https://hooks.slack.com/services/XXXX"
}
plus-metrics { //metrics group
//metrics in the same group will delivered as together in one message
//metrics in different groups will be delivered as separate messages
//overwrite above template with specific name
}
}
}
} //slack-plus-push-weekly
48. SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics
51. FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?
52. FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.
54. FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower
56. FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?
58. PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))
59. SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)
63. INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.
64. WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features 200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?
66. FINAL THOUGHTS
• We are still in the earlier stage of Data Platform Evolution,
• We will continue to share we experience with you along the way.
• Questions ?
Thanks You
Chester Chen, Ph.D.
Data Science & Engineering
GoPro