SlideShare une entreprise Scribd logo
1  sur  66
Analytics metrics delivery and
ML Feature visualization
Evolution of Data Platform at GoPro
ABOUT SPEAKER: CHESTER CHEN
• Head of Data Science & Engineering (DSE) at GoPro
• Prev. Director of Engineering, Alpine Data Labs
• Founder and Organizer of SF Big Analytics meetup
SF BIG
ANALYTICS
Recent Talks
SF BIG
ANALYTICS
Upcoming Talks
SF BIG
ANALYTICS
AGENDA
• Business Use Cases
• Evalution of GoPro Data Platform
• Analytics Metrics Delivery via Slack
• ML Feature Visualization Features with Google Facets and Spark
GROWING DATA NEED FROM GOPRO ECOSYSTEM
DATA
Analytics
Platform
Consumer Devices GoPro Apps
E-Commerce Social Media/OTT
3rd party data
Product Insight
User segmentation
CRM/Marketing
/Personalization
EXAMPLES OF ANALYTICS USE CASES
• Product Analytics
• features adoptions, user engagements, User segmentation, churn analysis, funnel analysis,
conversion rate etc.
• Web/E-Commercial Analytics
• Camera Analytics
• Scene change detection, feature usages etc.
• Mobile Analytics
• Camera connections, story sharing etc.
• GoPro Plus Analytics
• CRM Analytics
• Digital Marketing Analytics
• Social Media Analytics
• Cloud Media Analysis
• Media classifications, recommendations, storage analysis.
Evolution of Data Platform
EVOLUTION OF DATA PLATFORM
EVOLUTION OF DATA PLATFORM
DATA PLATFORM ARCHITECTURE TRANSFORMATION
Batch Ingestion
Framework
•Batch Ingestion
•Pre-processing
Streaming ingestion
Batch Ingestion
Cloud-Based Elastic Clusters
PLOT.LY SERVER
TABLEAU SERVER
EXTERNAL SERFVICE
Notebook
Rest API,
FTP
S3 sync,etc
Dynamic DDL
State Sync
Parquet
STREAMING PIPELINES
Spark Cluster
Long Running Cluster
BATCH JOBS
Job Gateway
Spark ClusterScheduled Jobs
New cluster per Job
Dev
Machines
Spark ClusterDev Jobs
New or existing cluster
Production
Job.conf
Dev
Job.conf
INTERACTIVE/NOTEBOOKS
Spark Cluster
Long Running Clusters
Notebooks Scripts
(SQL, Python, Scala)
Scheduled Notebook Jobs
auto-scale
mixed on-demand &
spot Instances
AIRFLOW SETUP
AIRFLOW SETUP
Web Server LB
Scheduler
Airflow Metastore
WorkersWorkers
Workers
Workers
Workers
Web Server B
Web Server LB
Web Server A
Message Queue
Airflow
DAGs
sync
Push DAGs to S3
TAKEAWAYS
• Key Changes
• Centralize hive meta store
• Separate compute and
storage needs
• Leverage S3 as storage
• Horizontal scale with cluster
elasticity
• Less time in managing
infrastructure
• Key Benefits
• Cost
• Reduce redundant storage, compute cost.
• Use the smaller instance types
• 60% AWS cost saving comparing to 1 year
ago
• Operation
• Reduce the complexity of DevOp Support
• Analytics tools
• SQL only => Notebook with (SQL, Python,
Scala)
CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK
HIVE SQL  Spark
EVOLUTION OF DATA PLATFORM
BATCH INGESTION
GoPro Product data
3rd Parties Data
3rd Parties Data
3rd Parties Data
Rest APIs
sftp
s3 sync
s3 sync
Batch Data Downloads Input File Formats: CSV, JSON
Spark Cluster
New cluster per Job
TABLE WRITER JOBS
• Job are identified by JobType, JobName, JobConfig
• Majority of the Spark ETL Jobs are Table Writers
• Load data into DataFrame
• DataFrame to DataFrame transformation
• Output DataFrame to Hive Table
• Majority of table writer Jobs can be de-composed as one of the
following sub jobs
TABLE WRITER JOBS
SparkJob
HiveTableWriter
JDBCToHiveTableWriter
AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter
CSVTableWriter JSONTableWriter
FileToHiveTableWriter
HBaseToHiveTableWriter TableToHiveTableWriter
HBaseSnapshotJob
TableSnapshotJob
CoreTableWriter
Customized Json JobCustomized CSV Job
mixin
All jobs has the same way of configuration loading,
Job State and error reports
All table writers will have the Dynamic DDL
capabilities, as long as they becomes DataFrames,
they will be behave the same
CSV and JSON have
different loader
Need different Loader to
load HBase Record to
DataFrame
Aggregate Jobs
HIVE TABLE WRITER JOB
trait HiveTableWriter extends CoreHiveTableWriter with SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config)
def load(sqlContext: SQLContext, ioInfos: Seq[(String, Seq[InputOutputInfo])]): Seq[(InputOutputInfo, DataFrame)]
def initProcess(sqlContext: SQLContext, jobTypeConfig: Config, jobConfig: Config)
def preProcess(hadoopConf: Configuration, ioInfos: Seq[InputOutputInfo]): Seq[InputOutputInfo]
def process(jobName: String, sqlContext: SQLContext, ioInfos: Seq[InputOutputInfo], jobTypeConfig: Config, jobConfig: Config)
def postProcess(….)
def getInputOutputInfos(sc: SparkContext, jobName: String, jobTypeConfig: Config, jobConfig: Config) : Seq[InputOutputInfo]
def groupIoInfos(ioInfos: Seq[InputOutputInfo]): Seq[(String, Seq[InputOutputInfo])]
ETL JOB CONFIGURATION
gopro.dse.config.etl {
mobile-job {
conf {}
process {}
input {}
output {}
post.process {}
}
}
include classpath("conf/production/etl_mobile_quik.conf")
include classpath("conf/production/etl_mobile_capture.conf")
include classpath("conf/production/etl_mobile_product_events.conf")
Job-level conf override JobType Conf
Job specifics includes
JobType
JobName
Input & output specification
ETL JOB CONFIGURATION
xyz {
process {}
input {
delimiter = ","
inputDirPattern = "s3a://teambucket/xyz/raw/production"
file.ext = "csv"
file.format = "csv"
date.format = "yyyy-MM-dd hh:mm:ss"
table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName"
}
output {
database = “mobile",
file.format = "parquet"
date.format = "yyyy-MM-dd hh:mm:ss"
partitions = 2
file.compression.codec.key = "spark.sql.parquet.compression.codec"
file.compression.codec.value = "gzip”
save.mode = ”append"
transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer]
}
post.process {
deleteSource = true
}
}
Save Mode
JobName
Input specification
output specification
Files needs to goto proper tables
TABLE NAME GENERATION
• Table Name Extractor
• From File Name
• From Directory Name
• Custom Plugin
EXTRACT TABLE NAMES
• From Table Name
• /databucket/3rdparty/ABC/campaign-20180212.csv
• /databucket/3rdparty/ABC/campaign-20180213.csv
• /databucket/3rdparty/ABC/campaign-20180214.csv
• From Directory Name
• /databucket/3rdparty/ABC/campaign/file-20180212.csv
• /databucket/3rdparty/ABC/campaign/file-20180213.csv
• /databucket/3rdparty/ABC/campaign/file-20180214.csv
• From ID Mapping
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/10.log.gz
• /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/11.log.gz
• /databucket/ABC/2018/02/17/ae6905b068c7beb08d681a5/12.log.gz
• /databucket/ABC/2018/02/18/ae6905b068c7beb08d681a5/13.log.gz
• Table Name, File Date
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
• Table Name, File Date
• (campaign, 2018-02-12)
• (campaign, 2018-02-13)
• (campaign, 2018-02-14)
• Table Name, File Date
Configuration
• b2a932aeddbf0f11bae9573  mobile_ios
• ae6905b068c7beb08d681a  mobile_android
Table Extraction
• (mobile_ios, 2017-01-11)
• (mobile_android, 2018-02-17)
• (mobile_android, 2018-02-18)
Data Transformation
ETL With SQL & Scala
DATA TRANSFORMATION
• HSQL over JDBC via beeline
• Suitable for non-java/scala/python-programmers
• Spark Job
• Requires Spark and Scala knowledge, need to setup job, configurations etc.
• Dynamic Scala Scripts
• Scala as script, compile Scala at Runtime, mixed with Spark SQL
SCALA SCRIPTS
• Define a special SparkJob : Spark Job Code Runner
• Load Scala script files from specified location (defined by config)
• Dynamically compiles the scala code into classes
• For the compiled classes : run spark jobs defined in the scripts
• Twitter EVAL Util: Dynamically evaluates Scala strings and files.
• <groupId>com.twitter</groupId>
<artifactId>util-eval_2.11</artifactId>
<version>6.24.0</version>
SCALA SCRIPTS
object SparkJobCodeRunner extends SparkJob {
private val LOG = LoggerFactory.getLogger(getClass)
import collection.JavaConverters._
override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val jobFileNames: List[String] = //...
jobFileNames.foreach{ x =>
val clazzes : Option[Any] = evalFromFileName[Any](x)
clazzes.foreach{c =>
c match {
case job: SparkJob => job.run(sc, jobType, jobName, config)
case _ => LOG.info("not match")
}
}
}
}
}
SCALA SCRIPTS
import com.twitter.util.Eval
def evalFromFile[T](path: Path)(implicit header: String = ""): Option[T] = {
val fs = //get Hadoop File System …
eval(IOUtils.toString(fs.open(path), "UTF-8"))(header)
}
def eval[T](code: String)(implicit header: String = ""): Option[T] =
Try(Eval[T](header + "n" + code)).toOption
SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE
class CameraAggCaptureMainJob extends SparkJob {
def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = {
val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc)
val cameraCleanDataSchema = … //define DataFrame Schema
val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema)
.json("s3a://databucket/camera/work/production/clean-events/final/*")
cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data")
sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=false
set hive.auto.convert.join=false
set hive.merge.mapredfiles=true""")
sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on
select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” )
//rest of code
}
new CameraAggCaptureMainJob
Data Democratization,
Visualization and Data
Management
EVOLUTION OF DATA PLATFORM
DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS
• Data Metrics Delivery
• Delivery to Slack : make metrics more accessible to broader audience
• Data Slice & Dice
• Leverage Real-Time OLAP database (Druid) (ongoing project)
• Analytics Visualization (ongoing project)
• Leverage Superset and Data Management Application
• BedRock: Self-Service & Data Management (ongoing project)
• Pipeline Monitoring
• Product Analytics Visualization
• Self-service Ingestion
• ML Feature Visualization
Spark Cluster
New or existing cluster
Spark Cluster
Long Running Cluster
Metrics Batch Ingestion
Streaming Ingestion
Output Metrics
BedRock
DATA
VISUALIZATION
&
MANAGEMENT
Working in Progress
Delivery Metrics via Slack
SLACK METRICS DELIVERY
xxxxxx
xxxxxxx
xxxxx xxxxxxxxxx
xxxxx
xxxxxxx xxxxxx xxxxxx
xxxxx
xxxxx
xxxx
xxxxxxxxxxxxxxxx
SLACK METRICS DELIVERY
• Why Slack ?
• Push vs. Pull -- Easy Access
• Avoid another Login when view metrics
• When Slack Connected, you are already login
• Move key metrics move away from Tableau Dashboard and put
metrics generation into software engineering process
• SQL code is under software control
• publishing job is scheduled and performance is monitored
• Discussion/Question/Comments on the specific metrics can be
done directly at the channel with people involved.
SLACK DELIVERY FRAMEWORK
• Slack Metrics Delivery Framework
• Configuration Driven
• Multiple private Channels : Mobile/Cloud/Subscription/Web etc.
• Daily/Weekly/Monthly Delivery and comparison
• New metrics can be added easily with new SQL and configurations
SLACK METRICS CONCEPTS
• Slack Job 
• Channels (private channels) 
• Metrics Groups 
• Metrics1
• …
• MetricsN
• Main Query
• Compare Query (Optional)
• Chart Query (Options)
• Persistence (optional)
• Hive + S3
• Additional deliveries (Optional)
• Kafka
• Other Cache stores (Http Post)
BLACK KPI DELIVERY ARCHITECTURE
Slack message json
HTTP POST Rest API Server
Rest API Server
generate graphMetrics Json
Return Image
HTTP POST
Save/Get Image
Plot.ly json
Save Metrics to Hive Table
Slack Spark Job
Get Image URL
Webhooks
CONFIGURATION-DRIVEN
slack-plus-push-weekly { //job name
persist-metrics="true"
channels {
dse-metrics {
post-urls {
plus-metrics = "https://hooks.slack.com/services/XXXX"
dse-metrics-test = "https://hooks.slack.com/services/XXXX"
}
plus-metrics { //metrics group
//metrics in the same group will delivered as together in one message
//metrics in different groups will be delivered as separate messages
//overwrite above template with specific name
}
}
}
} //slack-plus-push-weekly
SLACK METRICS CONFIGURATION
slack-mobile-push-weekly.channels.mobile-metrics.capture-metrics { //Job, Channel, KPI Group
//…
weekly-capture-users-by-platform { //metrics name
slack-display.attachment.title = "GoPro Mobile App -- Users by Platform"
metric-period = "weekly”
slack-display.chartstyle { … }
query = ""” … """
compare.query = ""” … """
chart query = ""”… ""”
}
//rest of configuration
}
SLACK DELIVERY BENEFITS
• Pros:
• Quick and easy access via Slack
• Can quickly deliver to engineering manager, executives, business owner and product
manager
• 100+ members subscribed different channels, since we launch the service
• Cons
• Limited by Slack UI Real-States, can only display key metrics in two-column formats,
only suitable for hive-level summary metrics
Machine Learning Feature
Visualization with Facets + Spark
EVOLUTION OF DATA PLATFORM
FEATURE VISUALIZATION
• Explore Feature Visualization via Google Facets
• Part 1 : Overview
• Part 2: Dive
• What is Facets Overview ?
FACETS OVERVIEW INTRODUCTION
• From Facets Home Page
• https://pair-code.github.io/facets/
• "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by
feature and visualizes the analysis.
• Overview can help uncover issues with datasets, including the following:
• Unexpected feature values
• Missing feature values for a large number of examples
• Training/serving skew
• Training/test/validation set skew
• Key aspects of the visualization are outlier detection and distribution comparison across multiple
datasets.
• Interesting values (such as a high proportion of missing data, or very different distributions of a
feature across multiple datasets) are highlighted in red.
• Features can be sorted by values of interest such as the number of missing values or the skew
between the different datasets.
FACETS OVERVIEW SAMPLE
FACETS OVERVIEW IMPLEMENTATIONS
• The Facets-overview implementation is consists of
• Feature Statistics Protocol Buffer definition
• Feature Statistics Generation
• Visualization
• Visualization
• The visualizations are implemented as Polymer web components, backed
by Typescript code
• It can be embedded into Jupyter notebooks or webpages.
• Feature Statistics Generation
• There are two implementations for the stats generation: Python and Javascripts
• Python : using numpy, pandas to generate stats
• JavaScripts: using javascripts to generate stats
• Both implementations are running stats generation in brower
FACETS OVERVIEW
FEATURE OVERVIEW SPARK
• Initial exploration attempt
• Is it possible to generate larger datasets with small stats size ?
• can we generate stats leveraging distributed computing capability
of spark instead just using one node ?
• Can we generate the stats in Spark, and then used by Python
and/or Javascripts ?
FACETS OVERVIEW + SPARK
ScalaPB
PREPARE SPARK DATA FRAME
case class NamedDataFrame(name:String, data: DataFrame)
val features = Array("Age", "Workclass", ….)
val trainData: DataFrame = loadCSVFile(”./adult.data.csv")
val testData = loadCSVFile("./adult.test.txt")
val train = trainData.toDF(features: _*)
val test = testData.toDF(features: _*)
val dataframes = List(NamedDataFrame(name = "train", train),
NamedDataFrame(name = "test", test))
SPARK FACETS STATS GENERATOR
val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList())
val proto = generator.protoFromDataFrames(dataframes)
persistProto(proto)
SPARK FACETS STATS GENERATOR
def protoFromDataFrames(dataFrames: List[NamedDataFrame],
features : Set[String] = Set.empty[String],
histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
FACET OVERVIEW SPARK
FACET OVERVIEW SPARK
DEMO
INITIAL FINDINGS
• Implementation
• 1st Pass implementation is not efficient
• We have to go through each feature multiple paths, with increase number of features, the
performance suffers, this limits number of features to be used
• The size of dataset used for generate stats also determines the size of the generated protobuf file
• I haven’t dive deeper into this as to what’s contributing the change of the size
• The combination of data size and feature size can produce a large file, which won’t fit in browser
• With Spark DataFrame, we can’t support Tensorflow Records
• The Base64-encoded protobuf String can be loaded by Python or Javascripts
• Protobuf binary file can also be loaded by Python
• But it somehow not be able to loaded by Javascripts.
WHAT’S NEXT?
• Improve implementation performance
• When we have a lot of data and features, what’s the proper size that
generate proper stats size that can be loaded into browser or notebook ?
• For example, One experiments: 300 Features  200MB size
• How do we efficiently partition the features so that can be viewable ?
• Data is changing : how can we incremental update the stats on the regular
basis ?
• How to integrate this into production?
PG #
RC Playbook: Your guide to
success at GoPro
FINAL THOUGHTS
FINAL THOUGHTS
• We are still in the earlier stage of Data Platform Evolution,
• We will continue to share we experience with you along the way.
• Questions ?
Thanks You
Chester Chen, Ph.D.
Data Science & Engineering
GoPro

Contenu connexe

Tendances

Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
MLFlow 1.0 Meetup
MLFlow 1.0 Meetup MLFlow 1.0 Meetup
MLFlow 1.0 Meetup Databricks
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibSpark Summit
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
 
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...Flink Forward
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricksLiangjun Jiang
 
Streaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXStreaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXDatabricks
 
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Lucidworks
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI EcosystemJiangjie Qin
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Balancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningBalancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningDatabricks
 
Lift scaffolding from existing database
Lift scaffolding from existing databaseLift scaffolding from existing database
Lift scaffolding from existing databasetalexandre
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkJen Aman
 

Tendances (20)

Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
MLFlow 1.0 Meetup
MLFlow 1.0 Meetup MLFlow 1.0 Meetup
MLFlow 1.0 Meetup
 
Production Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlibProduction Readiness Testing At Salesforce Using Spark MLlib
Production Readiness Testing At Salesforce Using Spark MLlib
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...Introducing Arc:  A Common Intermediate Language for Unified Batch and Stream...
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
 
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
Flink Forward San Francisco 2018: Seth Wiesman - "Testing Stateful Streaming ...
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
 
Streaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXStreaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFX
 
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
 
[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem[FFE19] Build a Flink AI Ecosystem
[FFE19] Build a Flink AI Ecosystem
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Balancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine LearningBalancing Automation and Explanation in Machine Learning
Balancing Automation and Explanation in Machine Learning
 
Lift scaffolding from existing database
Lift scaffolding from existing databaseLift scaffolding from existing database
Lift scaffolding from existing database
 
MLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using SparkMLeap: Productionize Data Science Workflows Using Spark
MLeap: Productionize Data Science Workflows Using Spark
 

Similaire à Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro

WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...Fabio Franzini
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffJAX London
 
Analytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationAnalytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationBill Liu
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingJim Dowling
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Generating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data ModelerGenerating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data ModelerRob van den Berg
 

Similaire à Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro (20)

Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Scala and Spring
Scala and SpringScala and Spring
Scala and Spring
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
 
Analytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationAnalytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature Visualization
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Generating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data ModelerGenerating Code with Oracle SQL Developer Data Modeler
Generating Code with Oracle SQL Developer Data Modeler
 

Plus de Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfChester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdfChester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataChester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapChester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathChester Chen
 
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018Chester Chen
 

Plus de Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
 
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
SF big Analytics : Stream all things by Gwen Shapira @ Lyft 2018
 

Dernier

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 

Dernier (20)

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Platform at GoPro

  • 1. Analytics metrics delivery and ML Feature visualization Evolution of Data Platform at GoPro
  • 2. ABOUT SPEAKER: CHESTER CHEN • Head of Data Science & Engineering (DSE) at GoPro • Prev. Director of Engineering, Alpine Data Labs • Founder and Organizer of SF Big Analytics meetup
  • 3.
  • 6. AGENDA • Business Use Cases • Evalution of GoPro Data Platform • Analytics Metrics Delivery via Slack • ML Feature Visualization Features with Google Facets and Spark
  • 7. GROWING DATA NEED FROM GOPRO ECOSYSTEM
  • 8. DATA Analytics Platform Consumer Devices GoPro Apps E-Commerce Social Media/OTT 3rd party data Product Insight User segmentation CRM/Marketing /Personalization
  • 9. EXAMPLES OF ANALYTICS USE CASES • Product Analytics • features adoptions, user engagements, User segmentation, churn analysis, funnel analysis, conversion rate etc. • Web/E-Commercial Analytics • Camera Analytics • Scene change detection, feature usages etc. • Mobile Analytics • Camera connections, story sharing etc. • GoPro Plus Analytics • CRM Analytics • Digital Marketing Analytics • Social Media Analytics • Cloud Media Analysis • Media classifications, recommendations, storage analysis.
  • 10. Evolution of Data Platform
  • 11. EVOLUTION OF DATA PLATFORM
  • 12. EVOLUTION OF DATA PLATFORM
  • 13. DATA PLATFORM ARCHITECTURE TRANSFORMATION Batch Ingestion Framework •Batch Ingestion •Pre-processing Streaming ingestion Batch Ingestion Cloud-Based Elastic Clusters PLOT.LY SERVER TABLEAU SERVER EXTERNAL SERFVICE Notebook Rest API, FTP S3 sync,etc Dynamic DDL State Sync Parquet
  • 15. BATCH JOBS Job Gateway Spark ClusterScheduled Jobs New cluster per Job Dev Machines Spark ClusterDev Jobs New or existing cluster Production Job.conf Dev Job.conf
  • 16. INTERACTIVE/NOTEBOOKS Spark Cluster Long Running Clusters Notebooks Scripts (SQL, Python, Scala) Scheduled Notebook Jobs auto-scale mixed on-demand & spot Instances
  • 18. AIRFLOW SETUP Web Server LB Scheduler Airflow Metastore WorkersWorkers Workers Workers Workers Web Server B Web Server LB Web Server A Message Queue Airflow DAGs sync Push DAGs to S3
  • 19. TAKEAWAYS • Key Changes • Centralize hive meta store • Separate compute and storage needs • Leverage S3 as storage • Horizontal scale with cluster elasticity • Less time in managing infrastructure • Key Benefits • Cost • Reduce redundant storage, compute cost. • Use the smaller instance types • 60% AWS cost saving comparing to 1 year ago • Operation • Reduce the complexity of DevOp Support • Analytics tools • SQL only => Notebook with (SQL, Python, Scala)
  • 20. CONFIGURABLE SPARK BATCH INGESTION FRAMEWORK HIVE SQL  Spark
  • 21. EVOLUTION OF DATA PLATFORM
  • 22. BATCH INGESTION GoPro Product data 3rd Parties Data 3rd Parties Data 3rd Parties Data Rest APIs sftp s3 sync s3 sync Batch Data Downloads Input File Formats: CSV, JSON Spark Cluster New cluster per Job
  • 23. TABLE WRITER JOBS • Job are identified by JobType, JobName, JobConfig • Majority of the Spark ETL Jobs are Table Writers • Load data into DataFrame • DataFrame to DataFrame transformation • Output DataFrame to Hive Table • Majority of table writer Jobs can be de-composed as one of the following sub jobs
  • 24. TABLE WRITER JOBS SparkJob HiveTableWriter JDBCToHiveTableWriter AbstractCSVHiveTableWriter AbstractJSONHiveTableWriter CSVTableWriter JSONTableWriter FileToHiveTableWriter HBaseToHiveTableWriter TableToHiveTableWriter HBaseSnapshotJob TableSnapshotJob CoreTableWriter Customized Json JobCustomized CSV Job mixin All jobs has the same way of configuration loading, Job State and error reports All table writers will have the Dynamic DDL capabilities, as long as they becomes DataFrames, they will be behave the same CSV and JSON have different loader Need different Loader to load HBase Record to DataFrame Aggregate Jobs
  • 25. HIVE TABLE WRITER JOB trait HiveTableWriter extends CoreHiveTableWriter with SparkJob { def run(sc: SparkContext, jobType: String, jobName: String, config: Config) def load(sqlContext: SQLContext, ioInfos: Seq[(String, Seq[InputOutputInfo])]): Seq[(InputOutputInfo, DataFrame)] def initProcess(sqlContext: SQLContext, jobTypeConfig: Config, jobConfig: Config) def preProcess(hadoopConf: Configuration, ioInfos: Seq[InputOutputInfo]): Seq[InputOutputInfo] def process(jobName: String, sqlContext: SQLContext, ioInfos: Seq[InputOutputInfo], jobTypeConfig: Config, jobConfig: Config) def postProcess(….) def getInputOutputInfos(sc: SparkContext, jobName: String, jobTypeConfig: Config, jobConfig: Config) : Seq[InputOutputInfo] def groupIoInfos(ioInfos: Seq[InputOutputInfo]): Seq[(String, Seq[InputOutputInfo])]
  • 26. ETL JOB CONFIGURATION gopro.dse.config.etl { mobile-job { conf {} process {} input {} output {} post.process {} } } include classpath("conf/production/etl_mobile_quik.conf") include classpath("conf/production/etl_mobile_capture.conf") include classpath("conf/production/etl_mobile_product_events.conf") Job-level conf override JobType Conf Job specifics includes JobType JobName Input & output specification
  • 27. ETL JOB CONFIGURATION xyz { process {} input { delimiter = "," inputDirPattern = "s3a://teambucket/xyz/raw/production" file.ext = "csv" file.format = "csv" date.format = "yyyy-MM-dd hh:mm:ss" table.name.extractor.method.name = "com.gopro.dse.batch.spark.job.FromFileName" } output { database = “mobile", file.format = "parquet" date.format = "yyyy-MM-dd hh:mm:ss" partitions = 2 file.compression.codec.key = "spark.sql.parquet.compression.codec" file.compression.codec.value = "gzip” save.mode = ”append" transformers = [com.gopro.dse.batch.spark.transformer.csv.xyz.XYZColumnTransformer] } post.process { deleteSource = true } } Save Mode JobName Input specification output specification
  • 28. Files needs to goto proper tables TABLE NAME GENERATION • Table Name Extractor • From File Name • From Directory Name • Custom Plugin
  • 29. EXTRACT TABLE NAMES • From Table Name • /databucket/3rdparty/ABC/campaign-20180212.csv • /databucket/3rdparty/ABC/campaign-20180213.csv • /databucket/3rdparty/ABC/campaign-20180214.csv • From Directory Name • /databucket/3rdparty/ABC/campaign/file-20180212.csv • /databucket/3rdparty/ABC/campaign/file-20180213.csv • /databucket/3rdparty/ABC/campaign/file-20180214.csv • From ID Mapping • /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/10.log.gz • /databucket/ABC/2017/01/11/b2a932aeddbf0f11bae9573/11.log.gz • /databucket/ABC/2018/02/17/ae6905b068c7beb08d681a5/12.log.gz • /databucket/ABC/2018/02/18/ae6905b068c7beb08d681a5/13.log.gz • Table Name, File Date • (campaign, 2018-02-12) • (campaign, 2018-02-13) • (campaign, 2018-02-14) • Table Name, File Date • (campaign, 2018-02-12) • (campaign, 2018-02-13) • (campaign, 2018-02-14) • Table Name, File Date Configuration • b2a932aeddbf0f11bae9573  mobile_ios • ae6905b068c7beb08d681a  mobile_android Table Extraction • (mobile_ios, 2017-01-11) • (mobile_android, 2018-02-17) • (mobile_android, 2018-02-18)
  • 31. DATA TRANSFORMATION • HSQL over JDBC via beeline • Suitable for non-java/scala/python-programmers • Spark Job • Requires Spark and Scala knowledge, need to setup job, configurations etc. • Dynamic Scala Scripts • Scala as script, compile Scala at Runtime, mixed with Spark SQL
  • 32. SCALA SCRIPTS • Define a special SparkJob : Spark Job Code Runner • Load Scala script files from specified location (defined by config) • Dynamically compiles the scala code into classes • For the compiled classes : run spark jobs defined in the scripts • Twitter EVAL Util: Dynamically evaluates Scala strings and files. • <groupId>com.twitter</groupId> <artifactId>util-eval_2.11</artifactId> <version>6.24.0</version>
  • 33. SCALA SCRIPTS object SparkJobCodeRunner extends SparkJob { private val LOG = LoggerFactory.getLogger(getClass) import collection.JavaConverters._ override def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = { val jobFileNames: List[String] = //... jobFileNames.foreach{ x => val clazzes : Option[Any] = evalFromFileName[Any](x) clazzes.foreach{c => c match { case job: SparkJob => job.run(sc, jobType, jobName, config) case _ => LOG.info("not match") } } } } }
  • 34. SCALA SCRIPTS import com.twitter.util.Eval def evalFromFile[T](path: Path)(implicit header: String = ""): Option[T] = { val fs = //get Hadoop File System … eval(IOUtils.toString(fs.open(path), "UTF-8"))(header) } def eval[T](code: String)(implicit header: String = ""): Option[T] = Try(Eval[T](header + "n" + code)).toOption
  • 35. SCALA SCRIPTS EXAMPLES -- ONE SCALA SCRIPT FILE class CameraAggCaptureMainJob extends SparkJob { def run(sc: SparkContext, jobType: String, jobName: String, config: Config): Unit = { val sqlContext: SQLContext = HiveContextFactory.getOrCreate(sc) val cameraCleanDataSchema = … //define DataFrame Schema val = sqlContext.read.schema(ccameraCleanDataStageDFameraCleanDataSchema) .json("s3a://databucket/camera/work/production/clean-events/final/*") cameraCleanDataStageDF.createOrReplaceTempView("camera_clean_data") sqlContext.sql( ""” set hive.exec.dynamic.partition.mode=nonstrict set hive.enforce.bucketing=false set hive.auto.convert.join=false set hive.merge.mapredfiles=true""") sqlContext.sql( """insert overwrite table work.camera_setting_shutter_dse_on select row_number() over (partition by metadata_file_name order by log_ts) , …. “”” ) //rest of code } new CameraAggCaptureMainJob
  • 37. EVOLUTION OF DATA PLATFORM
  • 38. DATA DEMOCRATIZATION & MANAGEMENT FOCUS AREAS • Data Metrics Delivery • Delivery to Slack : make metrics more accessible to broader audience • Data Slice & Dice • Leverage Real-Time OLAP database (Druid) (ongoing project) • Analytics Visualization (ongoing project) • Leverage Superset and Data Management Application • BedRock: Self-Service & Data Management (ongoing project) • Pipeline Monitoring • Product Analytics Visualization • Self-service Ingestion • ML Feature Visualization
  • 39. Spark Cluster New or existing cluster Spark Cluster Long Running Cluster Metrics Batch Ingestion Streaming Ingestion Output Metrics BedRock DATA VISUALIZATION & MANAGEMENT Working in Progress
  • 41. SLACK METRICS DELIVERY xxxxxx xxxxxxx xxxxx xxxxxxxxxx xxxxx xxxxxxx xxxxxx xxxxxx xxxxx xxxxx xxxx xxxxxxxxxxxxxxxx
  • 42. SLACK METRICS DELIVERY • Why Slack ? • Push vs. Pull -- Easy Access • Avoid another Login when view metrics • When Slack Connected, you are already login • Move key metrics move away from Tableau Dashboard and put metrics generation into software engineering process • SQL code is under software control • publishing job is scheduled and performance is monitored • Discussion/Question/Comments on the specific metrics can be done directly at the channel with people involved.
  • 43. SLACK DELIVERY FRAMEWORK • Slack Metrics Delivery Framework • Configuration Driven • Multiple private Channels : Mobile/Cloud/Subscription/Web etc. • Daily/Weekly/Monthly Delivery and comparison • New metrics can be added easily with new SQL and configurations
  • 44. SLACK METRICS CONCEPTS • Slack Job  • Channels (private channels)  • Metrics Groups  • Metrics1 • … • MetricsN • Main Query • Compare Query (Optional) • Chart Query (Options) • Persistence (optional) • Hive + S3 • Additional deliveries (Optional) • Kafka • Other Cache stores (Http Post)
  • 45. BLACK KPI DELIVERY ARCHITECTURE Slack message json HTTP POST Rest API Server Rest API Server generate graphMetrics Json Return Image HTTP POST Save/Get Image Plot.ly json Save Metrics to Hive Table Slack Spark Job Get Image URL Webhooks
  • 46. CONFIGURATION-DRIVEN slack-plus-push-weekly { //job name persist-metrics="true" channels { dse-metrics { post-urls { plus-metrics = "https://hooks.slack.com/services/XXXX" dse-metrics-test = "https://hooks.slack.com/services/XXXX" } plus-metrics { //metrics group //metrics in the same group will delivered as together in one message //metrics in different groups will be delivered as separate messages //overwrite above template with specific name } } } } //slack-plus-push-weekly
  • 47. SLACK METRICS CONFIGURATION slack-mobile-push-weekly.channels.mobile-metrics.capture-metrics { //Job, Channel, KPI Group //… weekly-capture-users-by-platform { //metrics name slack-display.attachment.title = "GoPro Mobile App -- Users by Platform" metric-period = "weekly” slack-display.chartstyle { … } query = ""” … """ compare.query = ""” … """ chart query = ""”… ""” } //rest of configuration }
  • 48. SLACK DELIVERY BENEFITS • Pros: • Quick and easy access via Slack • Can quickly deliver to engineering manager, executives, business owner and product manager • 100+ members subscribed different channels, since we launch the service • Cons • Limited by Slack UI Real-States, can only display key metrics in two-column formats, only suitable for hive-level summary metrics
  • 50. EVOLUTION OF DATA PLATFORM
  • 51. FEATURE VISUALIZATION • Explore Feature Visualization via Google Facets • Part 1 : Overview • Part 2: Dive • What is Facets Overview ?
  • 52. FACETS OVERVIEW INTRODUCTION • From Facets Home Page • https://pair-code.github.io/facets/ • "Facets Overview "takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis. • Overview can help uncover issues with datasets, including the following: • Unexpected feature values • Missing feature values for a large number of examples • Training/serving skew • Training/test/validation set skew • Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. • Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. • Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets.
  • 54. FACETS OVERVIEW IMPLEMENTATIONS • The Facets-overview implementation is consists of • Feature Statistics Protocol Buffer definition • Feature Statistics Generation • Visualization • Visualization • The visualizations are implemented as Polymer web components, backed by Typescript code • It can be embedded into Jupyter notebooks or webpages. • Feature Statistics Generation • There are two implementations for the stats generation: Python and Javascripts • Python : using numpy, pandas to generate stats • JavaScripts: using javascripts to generate stats • Both implementations are running stats generation in brower
  • 56. FEATURE OVERVIEW SPARK • Initial exploration attempt • Is it possible to generate larger datasets with small stats size ? • can we generate stats leveraging distributed computing capability of spark instead just using one node ? • Can we generate the stats in Spark, and then used by Python and/or Javascripts ?
  • 57. FACETS OVERVIEW + SPARK ScalaPB
  • 58. PREPARE SPARK DATA FRAME case class NamedDataFrame(name:String, data: DataFrame) val features = Array("Age", "Workclass", ….) val trainData: DataFrame = loadCSVFile(”./adult.data.csv") val testData = loadCSVFile("./adult.test.txt") val train = trainData.toDF(features: _*) val test = testData.toDF(features: _*) val dataframes = List(NamedDataFrame(name = "train", train), NamedDataFrame(name = "test", test))
  • 59. SPARK FACETS STATS GENERATOR val generator = new FeatureStatsGenerator(DatasetFeatureStatisticsList()) val proto = generator.protoFromDataFrames(dataframes) persistProto(proto)
  • 60. SPARK FACETS STATS GENERATOR def protoFromDataFrames(dataFrames: List[NamedDataFrame], features : Set[String] = Set.empty[String], histgmCatLevelsCount:Option[Int]=None): DatasetFeatureStatisticsList
  • 63. INITIAL FINDINGS • Implementation • 1st Pass implementation is not efficient • We have to go through each feature multiple paths, with increase number of features, the performance suffers, this limits number of features to be used • The size of dataset used for generate stats also determines the size of the generated protobuf file • I haven’t dive deeper into this as to what’s contributing the change of the size • The combination of data size and feature size can produce a large file, which won’t fit in browser • With Spark DataFrame, we can’t support Tensorflow Records • The Base64-encoded protobuf String can be loaded by Python or Javascripts • Protobuf binary file can also be loaded by Python • But it somehow not be able to loaded by Javascripts.
  • 64. WHAT’S NEXT? • Improve implementation performance • When we have a lot of data and features, what’s the proper size that generate proper stats size that can be loaded into browser or notebook ? • For example, One experiments: 300 Features  200MB size • How do we efficiently partition the features so that can be viewable ? • Data is changing : how can we incremental update the stats on the regular basis ? • How to integrate this into production?
  • 65. PG # RC Playbook: Your guide to success at GoPro FINAL THOUGHTS
  • 66. FINAL THOUGHTS • We are still in the earlier stage of Data Platform Evolution, • We will continue to share we experience with you along the way. • Questions ? Thanks You Chester Chen, Ph.D. Data Science & Engineering GoPro