Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro

Analytics metrics delivery and
ML Feature visualization
Evolution of Data Platform at GoPro

ABOUT SPEAKER: CHESTER CHEN
• Head of Data Science & Engineering (DSE) at GoPro
• Prev. Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup

SF BIG
ANALYTICS
Upcoming Talks
SF BIG
ANALYTICS

AGENDA
• Business Use Cases
• Evalution of GoPro Data Platform
• Analytics Metrics Delivery via Slack
• ML Feature Visualization Features with Google Facets and Spark

GROWING DATA NEED FROM GOPRO ECOSYSTEM

DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce Social Media/OTT
3rd party data
Product Insight
User segmentation
CRM/Marketing
/Personalization

EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• features adoptions, user engagements, User segmentation, churn analysis, funnel analysis,
conversion rate etc.
• Web/E-Commercial Analytics
• Camera Analytics
• Scene change detection, feature usages etc.
• Mobile Analytics
• Camera connections, story sharing etc.
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
• Media classifications, recommendations, storage analysis.

DATA PLATFORM ARCHITECTURE TRANSFORMATION
Batch Ingestion
Framework
•Batch Ingestion
•Pre-processing
Streaming ingestion
Batch Ingestion
Cloud-Based Elastic Clusters
PLOT.LY SERVER
TABLEAU SERVER
EXTERNAL SERFVICE
Notebook
Rest API,
FTP
S3 sync,etc
Dynamic DDL
State Sync
Parquet

STREAMING PIPELINES
Spark Cluster
Long Running Cluster

BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf

INTERACTIVE/NOTEBOOKS
Spark Cluster
Long Running Clusters
Notebooks Scripts
(SQL, Python, Scala)
Scheduled Notebook Jobs
auto-scale
mixed on-demand &
spot Instances

AIRFLOW SETUP
Web Server LB
Scheduler
Airflow Metastore
WorkersWorkers
Workers
Workers
Workers
Web Server B
Web Server LB
Web Server A
Message Queue
Airflow
DAGs
sync
Push DAGs to S3

TAKEAWAYS
• Key Changes
• Centralize hive meta store
• Separate compute and
storage needs
• Leverage S3 as storage
• Horizontal scale with cluster
elasticity
• Less time in managing
infrastructure
• Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year
ago
• Operation
• Reduce the complexity of DevOp Support
• Analytics tools
• SQL only => Notebook with (SQL, Python,
Scala)

CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK
HIVE SQL  Spark

BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job

TABLE WRITER JOBS
• Job are identified by JobType, JobName, JobConfig
• Majority of the Spark ETL Jobs are Table Writers
• Load data into DataFrame
• DataFrame to DataFrame transformation
• Output DataFrame to Hive Table
• Majority of table writer Jobs can be de-composed as one of the
following sub jobs

TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs

HIVE TABLE WRITER JOB
trait HiveTableWriter extends CoreHiveTableWriter with SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config)
def load(sqlContext: SQLContext, ioInfos: Seq[(String, Seq[InputOutputInfo])]): Seq[(InputOutputInfo, DataFrame)]
def initProcess(sqlContext: SQLContext, jobTypeConfig: Config, jobConfig: Config)
def preProcess(hadoopConf: Configuration, ioInfos: Seq[InputOutputInfo]): Seq[InputOutputInfo]
def process(jobName: String, sqlContext: SQLContext, ioInfos: Seq[InputOutputInfo], jobTypeConfig: Config, jobConfig: Config)
def postProcess(….)
def getInputOutputInfos(sc: SparkContext, jobName: String, jobTypeConfig: Config, jobConfig: Config) : Seq[InputOutputInfo]
def groupIoInfos(ioInfos: Seq[InputOutputInfo]): Seq[(String, Seq[InputOutputInfo])]

ETL JOB CONFIGURATION
gopro.dse.config.etl {
mobile-job {
conf {}
process {}
input {}
output {}
post.process {}
}
}
include classpath("conf/production/etl_mobile_quik.conf")
include classpath("conf/production/etl_mobile_capture.conf")
include classpath("conf/production/etl_mobile_product_events.conf")
Job-level conf override JobType Conf
Job specifics includes
JobType
JobName
Input & output specification

ETL JOB CONFIGURATION
xyz {
process {}
input {
delimiter = ","
inputDirPattern = "s3a://teambucket/xyz/raw/production"
file.ext = "csv"
file.format = "csv"
date.format = "yyyy-MM-dd hh:mm:ss"
table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName"
}
output {
database = “mobile",
file.format = "parquet"
date.format = "yyyy-MM-dd hh:mm:ss"
partitions = 2
file.compression.codec.key = "spark.sql.parquet.compression.codec"
file.compression.codec.value = "gzip”
save.mode = ”append"
transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer]
}
post.process {
deleteSource = true
}
}
Save Mode
JobName
Input specification
output specification

Files needs to goto proper tables
TABLE NAME GENERATION
• Table Name Extractor
• From File Name
• From Directory Name
• Custom Plugin

EXTRACT TABLE NAMES
• From Table Name
• /databucket/3rdparty/ABC/campaign-20180212.csv
• From Directory Name
• /databucket/3rdparty/ABC/campaign/file-20180212.csv
• From ID Mapping
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/10.log.gz
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/11.log.gz
• /databucket/ABC/2018/02/17/ae6905b068c7beb08d681a5/12.log.gz
• /databucket/ABC/2018/02/18/ae6905b068c7beb08d681a5/13.log.gz
• Table Name, File Date
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
Configuration
• b2a932aeddbf0f11bae9573  mobile_ios
• ae6905b068c7beb08d681a  mobile_android
Table Extraction
• (mobile_ios, 2017-01-11)
• (mobile_android, 2018-02-17)
• (mobile_android, 2018-02-18)

Data Transformation
ETL With SQL & Scala

DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL

SCALA SCRIPTS
• Define a special SparkJob : Spark Job Code Runner
• Load Scala script files from specified location (defined by config)
• Dynamically compiles the scala code into classes
• For the compiled classes : run spark jobs defined in the scripts
• Twitter EVAL Util: Dynamically evaluates Scala strings and files.
• <groupId>com.twitter</groupId>
<artifactId>util-eval_2.11</artifactId>
<version>6.24.0</version>

SCALA SCRIPTS
object SparkJobCodeRunner extends SparkJob {
private val LOG = LoggerFactory.getLogger(getClass)
import collection.JavaConverters._
override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val jobFileNames: List[String] = //...
jobFileNames.foreach{ x =>
val clazzes : Option[Any] = evalFromFileName[Any](x)
clazzes.foreach{c =>
c match {
case job: SparkJob => job.run(sc, jobType, jobName, config)
case _ => LOG.info("not match")
}
}
}
}
}

SCALA SCRIPTS
import com.twitter.util.Eval
def evalFromFile[T](path: Path)(implicit header: String = ""): Option[T] = {
val fs = //get Hadoop File System …
eval(IOUtils.toString(fs.open(path), "UTF-8"))(header)
}
def eval[T](code: String)(implicit header: String = ""): Option[T] =
Try(Eval[T](header + "n" + code)).toOption

SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob

Data Democratization,
Visualization and Data
Management

DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• Data Metrics Delivery
• Delivery to Slack : make metrics more accessible to broader audience
• Data Slice & Dice
• Leverage Real-Time OLAP database (Druid) (ongoing project)
• Analytics Visualization (ongoing project)
• Leverage Superset and Data Management Application
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• ML Feature Visualization

Spark Cluster
New or existing cluster
Spark Cluster
Long Running Cluster
Metrics Batch Ingestion
Streaming Ingestion
Output Metrics
BedRock
DATA
VISUALIZATION
&
MANAGEMENT
Working in Progress

SLACK METRICS DELIVERY
xxxxxx
xxxxxxx
xxxxx xxxxxxxxxx
xxxxx
xxxxxxx xxxxxx xxxxxx
xxxxx
xxxxx
xxxx
xxxxxxxxxxxxxxxx

SLACK METRICS DELIVERY
• Why Slack ?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Move key metrics move away from Tableau Dashboard and put
metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.

SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations

SLACK METRICS CONCEPTS
• Slack Job 
• Channels (private channels) 
• Metrics Groups 
• Metrics1
• …
• MetricsN
• Main Query
• Compare Query (Optional)
• Chart Query (Options)
• Persistence (optional)
• Hive + S3
• Additional deliveries (Optional)
• Kafka
• Other Cache stores (Http Post)

BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks

CONFIGURATION-DRIVEN
slack-plus-push-weekly { //job name
persist-metrics="true"
channels {
dse-metrics {
post-urls {
plus-metrics = "https://hooks.slack.com/services/XXXX"
dse-metrics-test = "https://hooks.slack.com/services/XXXX"
}
plus-metrics { //metrics group
//metrics in the same group will delivered as together in one message
//metrics in different groups will be delivered as separate messages
//overwrite above template with specific name
}
}
}
} //slack-plus-push-weekly

SLACK METRICS CONFIGURATION
slack-mobile-push-weekly.channels.mobile-metrics.capture-metrics { //Job, Channel, KPI Group
//…
weekly-capture-users-by-platform { //metrics name
slack-display.attachment.title = "GoPro Mobile App -- Users by Platform"
metric-period = "weekly”
slack-display.chartstyle { … }
query = ""” … """
compare.query = ""” … """
chart query = ""”… ""”
}
//rest of configuration
}

SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics

Machine Learning Feature
Visualization with Facets + Spark

FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?

FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.

FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower

FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?

FACETS OVERVIEW + SPARK
ScalaPB

PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))

SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)

SPARK FACETS STATS GENERATOR
def protoFromDataFrames(dataFrames: List[NamedDataFrame],
features : Set[String] = Set.empty[String],
histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList

INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.

WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features  200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?

PG #
RC Playbook: Your guide to
success at GoPro
FINAL THOUGHTS

FINAL THOUGHTS
• We are still in the earlier stage of Data Platform Evolution,
• We will continue to share we experience with you along the way.
• Questions ?
Thanks You
Chester Chen, Ph.D.
Data Science & Engineering
GoPro

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro

Similaire à Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro (20)

Plus de Chester Chen

Plus de Chester Chen (20)

Dernier

Dernier (20)

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro