Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher Burdorf, NBC Universal)

•

9 j'aime•5,534 vues

This document discusses using Spark MLlib to predict which digital media files should be offlined from storage to free up space. It describes using k-means clustering, naive Bayes classification, and support vector machines (SVM) on features like file size, age, and airing schedule. SVM performed best and allowed building a predictive system in under an hour. The system is run twice daily on a Spark cluster to select files for purging from a large storage system based on predictions. Some initial issues were addressed and the system is now running robustly in production.

Données & analyses

USE OF SPARK MLLIB
FOR PREDICTING
OFFLINING OF DIGITAL
MEDIA
Christopher Burdorf
NBCUniversal

OVERVIEW
λ Problem Definition
λ Cluster Configurations
λ Parameters
λ MLLib libraries utilized
λ Results
λ Conclusions

Problem Definition
λ Digital media files distributed internationally 24/7 over internet to
cable TV channels in APAC, Europe, and Asia.
λ TV Shows, Commercials
λ Fast Isilon storage fills up frequently, thus offlining is necessary to
create space for new files
λ Multiple parameters are used to pick candidates
λ Previous system works well, but is slow and has large overhead

Cluster Configurations
λ Development
- 1 Linux System with 16 cores and 16 Gigs RAM
- 3 Mac OS Systems each with 8 cores and 16 Gigs RAM
- Mesos cluster manager
λ Production
- 3 Linux systems with 32 cores and 64 Gigs of RAM
- Mesos master and slaves running in Docker containers
-

Features/Parameters
λ File size
λ File age
λ Days since last airing
λ Days until next airing
λ Immediately remove files that have not been scheduled for
more than 3 weeks (was 2 weeks to start with).

K-Means Clustering
,
where xi(j) -cj is a chosen distance measure between a data point and the cluster centre , is an indicator of the
distance of the n data points from their respective cluster centers.

K means clustering results
λ Attempted with multiple centroid points
λ No meaningful clusters for any sets of data attempted
λ Results appeared to be tied to the nature of the data
λ Multiple parameters created difficulties

Naive Bayes classifier
λ From Bayes theorem: p(C_k|x)=(p(C_k)p(x|C_k))/p(x)
λ  posterior = (prior X likelihood)/evidence
λ Thus, the attempt was to classify media files for removal or
retention on the file system based on the different parameter
values and their expected results
λ Had difficulties training for many factors
- Adding new factors to training set altered results for
previously trained classifier and required retraining the entire
classification system

SVM Results
λ Found to be most robust method
λ Compensated well for new added features
λ Built training set from production data and auto-generated data
that fit the criteria
λ Spark Mllib optimization allows us to build a predictive system in
just over an hour
- Run twice daily due to constant changes in schedules and
online media.

Production environment
λ Spark Mllib SVM is trained and predictions are generated for
each media file online twice daily. Predictions are stored in
HBase
λ Media manager daemon regularly scans the file system and if
available space requires purging, it will select files based on
predictions stored in Hbase
λ Backup system checks to see if schedule has changed for a
given media file selected to be purged, because Spark SVM
predictions are only generated twice daily

Production issues
λ System was run in test mode for 6 weeks without a problem
λ Once switched to production, issues arose
- 2 weeks was too short a time to immediately offline files
λ Switched to 3 weeks
- Docker containers filled up with Spark temp files and
crashed over the weekend (ouch!)
λ Solved with cron job periodically removing them.
- Web service access to Hbase locked up.
λ Solved by having thread timeout on web services call.

Conclusions
λ System has been running in production mode for some time
now.
- Fine tuning appears to be complete
λ Spark SVM has performed well
- Fast and robust
λ Good application for machine learning though critical aspect of
the system increased task complexity.

Recommandé

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit

Spark Summit EU talk by Berni SchieferSpark Summit

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit

Recommandé

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit

Spark Summit EU talk by Berni SchieferSpark Summit

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit

Spark Summit EU talk by Luca CanaliSpark Summit

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

SSR: Structured Streaming for R and Machine Learningfelixcss

Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit

Spark Summit EU talk by Qifan PuSpark Summit

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks

Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit

Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit

Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit

Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Low Latency Execution For Apache SparkJen Aman

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot

Apache Flink vs Apache Spark - Reproducible experiments on cloud.Shelan Perera

Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks

Clipper: A Low-Latency Online Prediction Serving SystemDatabricks

ResumeJagannathJagannath Timma

S104872 spectrum nas-one-day-jburg-v1809eTony Pearson

Contenu connexe

Tendances