SlideShare une entreprise Scribd logo
1  sur  25
Video Transcoding on Hadoop
P R E S E N T E D B Y S h i t a l M e h t a a n d K i s h o r e A n g a n i ⎪ J u n e 3 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
Outline
2 2014 Hadoop Summit, San Jose, California
 Video Transcoding at Yahoo
 Current Architecture: (Hadoop 0.23.x)
 New Requirements
 Generic YARN (master / worker)
Video Transcoding at Yahoo
Video Transcoding
4 Yahoo Confidential & Proprietary
 Convert source videos to standard output formats
› input support
• > 10 container formats
• > 40 video codecs
• > 60 audio codecs
› output support (at various resolutions and bitrates)
• mp4/h264/AAC
• webm/vp8/vorbis
AVI
MP4
Mov
3GP
FLV
WebM
…
MP4
WebM
Related Jobs
5 Yahoo Confidential & Proprietary
 Post Transcode enrichments
› watermarking
› previews
› thumbnails
› visual seek
 Machine learning
Extremely Compute and I/O intensive
6 Yahoo Confidential & Proprietary
 SLA is measured in multiples of source video length
 FFmpeg takes between 0.5x to 5x video duration
› depending on hardware / resources available
› tool configuration, etc
 Computation requirements are dependent on:
› source and destination parameters
 Job parallelism
› some jobs can work on fragmented videos
› many require the whole video file for optimal results
The Processing Job (DAG)
7 Yahoo Confidential & Proprietary
job1
jobn
t1
job split
(DAG planning based on source
video / requester)
t2 … tn
partial callbacks,
intermediate uploads
t0
start
td
done
Download
Input Video
Merge,
Cleanup
Download
Input Video
Merge,
Cleanup
(E) Previews
(E) Thumbnails
(T) mp4/h264/AAC/720p
(T) webm/vp8/vorbis/1080p
(T) webm/vp8/vorbis/720p
(E) enrichments
(T) mp4/h264/AAC/1080p
(T) mp4/h264/AAC/720p
(T) mp4/h264/AAC/360p
Job Characteristics
8 Yahoo Confidential & Proprietary
 Tens of thousands of input videos / day
 Source duration ranges from 10 seconds to 2 hours
 Video sizes vary from a few MBs to a few GBs
 Variable source / output fan-out
› 5 to 15 output jobs per source video
› hundreds of thousands of processing tasks per day
 Job split and planning at ‘t1’
› dependent on source video parameters
 Static Job plan (DAGs) based approaches lead to:
› high resource wastage with reduced concurrency if the DAG over provisioned
› high resource contention with SLA misses when DAG plan too strict
 SLA and predictability are very important
Current Architecture: (Hadoop 0.23.x)
Cascaded Map – Reduce Jobs
10 Yahoo Confidential & Proprietary
MR Job
MR Job
OOZIE
MR Job
(M)
Download + Split
Generation
Video Store
HDFS
MR Job
MR Job
MR Job
(R) Cleanup,
Notify
(M) Transcode
(M) Transcode
(M) Transcode
API
API
Why Hadoop 1/2
11 Yahoo Confidential & Proprietary
 Extremely reliable as a framework
 Good Resource Management
› custom container asks based on source video parameters
› multiple 2G to 6G MR jobs spawned on demand
› minimal resource wastage (job plan decided by the parent MR job)
 Distributed File System (HDFS)
› used to share video files between various transcode jobs
 Elasticity
› scaling achieved by increasing queue capacity
 Fault Tolerance
 OOZIE provides job level fault tolerance
 MR framework provides task level fault tolerance
Why Hadoop 2/2
12 Yahoo Confidential & Proprietary
 Log analysis and reporting
› run as MR jobs alongside transcode jobs in the same queue
 All functions well contained within the Hadoop MR ecosystem
 Very low maintenance
› over and above Grid maintenance
 Lets us focus on the business logic and functions
 Excellent SLA for big jobs
New Requirements
(UGC and near real-time processing)
UGC and the current architecture (shortcomings)
14 Yahoo Confidential & Proprietary
 Very high variance in User Generated Content
› duration, size, bitrates, etc.
 Users want immediate feedback
› SLA very important here
 Large number of short length videos (< 30 seconds)
 SLAs on small videos is very high
› latency in MR containers’ allocation and preparation
› some latency added by OOZIE scheduling
 OOZIE / MR designed for batch jobs
The Latency
15 Yahoo Confidential & Proprietary
 Total Δt1 ~ 50 seconds to a minute, Δt2 ~ few seconds
 Job split decision point important
› leads to efficient resource utilization
 Map Reduce framework very good for batch jobs
› but not suitable for near real-time processing
 Well known and documented
 Alternate low latency frameworks available
OOZIE MR1
Δt1
MR3
Δt1
MR2
Δt1
MR4
Δt1
t1
job split
(DAG planning based on source
video / requester)
Δt1
Job Queuing / Scheduling
Container Allocation
Container Localization
Δt2
Δt2
Δt2
Δt2
Container warming
- (ML Models, etc)
New Requirements and options explored
16 Yahoo Confidential & Proprietary
 Need
› near real-time scheduling (Δt1)
› long running re-usable containers (Δt2)
 Options explored
› Tez
› Storm / Spark
› Slider
Issues with options explored
17 Yahoo Confidential & Proprietary
 Most (if not all) frameworks optimized for captive data flow
› (in our case) only job metadata flows through the framework
› while video blobs are consumed from outer subsystems (HDFS / local storage)
› metadata is not a clear indicator of job characteristics
 Video vs Text Processing
› cannot process line by line
› no key / value decomposition
› many jobs require the whole video file to be present locally
The Comparison Sheet
18 Yahoo Confidential & Proprietary
Requirement Current Tez Storm / Spark Slider
Elasticity High High High High
Latency High Low Low Low
Resource Efficiency (usage %) High Low* High High
Dynamic DAG Yes No No No DAG
Fault Tolerance Framework Framework Framework Framework
Resource Management Fine Fine Coarse / None Fine
Job / Task Abstraction Yes Yes Yes No
Container Release Yes Yes No No
Container Isolation Yes Yes No Yes
Container PreWarm Per Job Once Once Once
* Containers remain idle as DAG cannot be changed post first step
New Architecture:
Generic YARN (master / worker)
Generic YARN Master / Worker
20 Yahoo Confidential & Proprietary
Master w1
Workers – (Type 1…k)
… wn
Jobs RPC
 Extremely simple framework
 Master manages a pool of workers
 Master reads jobs and distributes to workers over Hadoop RPC
 Framework has pluggable master and worker tasks
 Pluggable scheduling strategy to manage workers
 Heterogeneous worker tasks in same pool
 Custom resource allocation per worker type
 Worker resources setup once at bootstrap
 State management is done by Master using HDFS
 Security and token management by framework harness
…
Master, Worker Interfaces
21 Yahoo Confidential & Proprietary
public interface Master {
Job getJobInput(String workerName);
void setJobOutput(Job jobOutput);
}
public interface Worker {
public Job execute(Job jobInput);
}
New Architecture for Transcoding
22 Yahoo Confidential & Proprietary
HDFS
Pool
Master
w1
Worker1
… w
m
Client
API
Job Queue
w1
Workerk
… wn
API
State
Information
Video Storage
…
Characteristics of the New Framework
23 Yahoo Confidential & Proprietary
 Long running workers in YARN containers
› configurable TTL and timeouts
 Pools consists of 1 Master and multiple workers
 Multiple pools are managed by the client
 Multiple clients across clusters
 Adaptive container allocation and release
› scheduling strategy (low – high watermark based)
 Significant improvements in latency
› job scheduling and distribution in milliseconds
 YARN and the Client provide Master fault tolerance
 Master takes care of fault tolerance for workers
What Next …
24 Yahoo Confidential & Proprietary
 Hope to release to the community soon
 In-principle similar to Google containers
› with a low latency Job abstraction
 YARN (nice to have):
› Multi dimensional scheduling
› Node Labels
Thank You
@kishore_angani
@smcal75
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

Contenu connexe

Tendances

Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?
DataWorks Summit
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
DataWorks Summit
 
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-MLRedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
Redis Labs
 

Tendances (20)

Streaming solutions for real time problems
Streaming solutions for real time problems Streaming solutions for real time problems
Streaming solutions for real time problems
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosDeep Dive - Usage of on premises data gateway for hybrid integration scenarios
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-MLRedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
 

En vedette

Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
DataWorks Summit
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTV
Ninou Haiko
 
Semantic repository of things
Semantic repository of thingsSemantic repository of things
Semantic repository of things
Pratik Desai, PhD
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Accelerating Real-Time Analytics Insights Through Hadoop Open Source Ecosystem
Accelerating Real-Time Analytics Insights Through Hadoop Open Source EcosystemAccelerating Real-Time Analytics Insights Through Hadoop Open Source Ecosystem
Accelerating Real-Time Analytics Insights Through Hadoop Open Source Ecosystem
DataWorks Summit
 

En vedette (20)

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 
Use of Big Data Technology in the area of Video Analytics
Use of Big Data Technology in the area of Video AnalyticsUse of Big Data Technology in the area of Video Analytics
Use of Big Data Technology in the area of Video Analytics
 
An Introduction to Video Analytics
An Introduction to Video Analytics An Introduction to Video Analytics
An Introduction to Video Analytics
 
Real time video analytics with InfoSphere Streams, OpenCV and R
Real time video analytics with InfoSphere Streams, OpenCV and RReal time video analytics with InfoSphere Streams, OpenCV and R
Real time video analytics with InfoSphere Streams, OpenCV and R
 
More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)More Than Websites: PHP And The Firehose @DataSift (2013)
More Than Websites: PHP And The Firehose @DataSift (2013)
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTV
 
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreIBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
 
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud Computing
 
New trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industryNew trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industry
 
My PhD thesis defense presentation
My PhD thesis defense presentationMy PhD thesis defense presentation
My PhD thesis defense presentation
 
Collecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesCollecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databases
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309
 
Semantic repository of things
Semantic repository of thingsSemantic repository of things
Semantic repository of things
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
What is a thing of the IoT? Aspiration of things narrated by a 'Thing Interpr...
What is a thing of the IoT? Aspiration of things narrated by a 'Thing Interpr...What is a thing of the IoT? Aspiration of things narrated by a 'Thing Interpr...
What is a thing of the IoT? Aspiration of things narrated by a 'Thing Interpr...
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپIntroduction to Apache Hadoop in Persian - آشنایی با هدوپ
Introduction to Apache Hadoop in Persian - آشنایی با هدوپ
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Accelerating Real-Time Analytics Insights Through Hadoop Open Source Ecosystem
Accelerating Real-Time Analytics Insights Through Hadoop Open Source EcosystemAccelerating Real-Time Analytics Insights Through Hadoop Open Source Ecosystem
Accelerating Real-Time Analytics Insights Through Hadoop Open Source Ecosystem
 

Similaire à Video Transcoding on Hadoop

Arm html5 presentation
Arm html5 presentationArm html5 presentation
Arm html5 presentation
Ian Renyard
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 

Similaire à Video Transcoding on Hadoop (20)

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding Platform
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Supporting Digital Media Workflows in the Cloud with Perforce Helix
Supporting Digital Media Workflows in the Cloud with Perforce HelixSupporting Digital Media Workflows in the Cloud with Perforce Helix
Supporting Digital Media Workflows in the Cloud with Perforce Helix
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Immutable Kubernetes with Digital Rebar Provision
Immutable Kubernetes with Digital Rebar ProvisionImmutable Kubernetes with Digital Rebar Provision
Immutable Kubernetes with Digital Rebar Provision
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Arm html5 presentation
Arm html5 presentationArm html5 presentation
Arm html5 presentation
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Migration DB2 to EDB - Project Experience
 Migration DB2 to EDB - Project Experience Migration DB2 to EDB - Project Experience
Migration DB2 to EDB - Project Experience
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Video Transcoding on Hadoop

  • 1. Video Transcoding on Hadoop P R E S E N T E D B Y S h i t a l M e h t a a n d K i s h o r e A n g a n i ⎪ J u n e 3 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 2. Outline 2 2014 Hadoop Summit, San Jose, California  Video Transcoding at Yahoo  Current Architecture: (Hadoop 0.23.x)  New Requirements  Generic YARN (master / worker)
  • 4. Video Transcoding 4 Yahoo Confidential & Proprietary  Convert source videos to standard output formats › input support • > 10 container formats • > 40 video codecs • > 60 audio codecs › output support (at various resolutions and bitrates) • mp4/h264/AAC • webm/vp8/vorbis AVI MP4 Mov 3GP FLV WebM … MP4 WebM
  • 5. Related Jobs 5 Yahoo Confidential & Proprietary  Post Transcode enrichments › watermarking › previews › thumbnails › visual seek  Machine learning
  • 6. Extremely Compute and I/O intensive 6 Yahoo Confidential & Proprietary  SLA is measured in multiples of source video length  FFmpeg takes between 0.5x to 5x video duration › depending on hardware / resources available › tool configuration, etc  Computation requirements are dependent on: › source and destination parameters  Job parallelism › some jobs can work on fragmented videos › many require the whole video file for optimal results
  • 7. The Processing Job (DAG) 7 Yahoo Confidential & Proprietary job1 jobn t1 job split (DAG planning based on source video / requester) t2 … tn partial callbacks, intermediate uploads t0 start td done Download Input Video Merge, Cleanup Download Input Video Merge, Cleanup (E) Previews (E) Thumbnails (T) mp4/h264/AAC/720p (T) webm/vp8/vorbis/1080p (T) webm/vp8/vorbis/720p (E) enrichments (T) mp4/h264/AAC/1080p (T) mp4/h264/AAC/720p (T) mp4/h264/AAC/360p
  • 8. Job Characteristics 8 Yahoo Confidential & Proprietary  Tens of thousands of input videos / day  Source duration ranges from 10 seconds to 2 hours  Video sizes vary from a few MBs to a few GBs  Variable source / output fan-out › 5 to 15 output jobs per source video › hundreds of thousands of processing tasks per day  Job split and planning at ‘t1’ › dependent on source video parameters  Static Job plan (DAGs) based approaches lead to: › high resource wastage with reduced concurrency if the DAG over provisioned › high resource contention with SLA misses when DAG plan too strict  SLA and predictability are very important
  • 10. Cascaded Map – Reduce Jobs 10 Yahoo Confidential & Proprietary MR Job MR Job OOZIE MR Job (M) Download + Split Generation Video Store HDFS MR Job MR Job MR Job (R) Cleanup, Notify (M) Transcode (M) Transcode (M) Transcode API API
  • 11. Why Hadoop 1/2 11 Yahoo Confidential & Proprietary  Extremely reliable as a framework  Good Resource Management › custom container asks based on source video parameters › multiple 2G to 6G MR jobs spawned on demand › minimal resource wastage (job plan decided by the parent MR job)  Distributed File System (HDFS) › used to share video files between various transcode jobs  Elasticity › scaling achieved by increasing queue capacity  Fault Tolerance  OOZIE provides job level fault tolerance  MR framework provides task level fault tolerance
  • 12. Why Hadoop 2/2 12 Yahoo Confidential & Proprietary  Log analysis and reporting › run as MR jobs alongside transcode jobs in the same queue  All functions well contained within the Hadoop MR ecosystem  Very low maintenance › over and above Grid maintenance  Lets us focus on the business logic and functions  Excellent SLA for big jobs
  • 13. New Requirements (UGC and near real-time processing)
  • 14. UGC and the current architecture (shortcomings) 14 Yahoo Confidential & Proprietary  Very high variance in User Generated Content › duration, size, bitrates, etc.  Users want immediate feedback › SLA very important here  Large number of short length videos (< 30 seconds)  SLAs on small videos is very high › latency in MR containers’ allocation and preparation › some latency added by OOZIE scheduling  OOZIE / MR designed for batch jobs
  • 15. The Latency 15 Yahoo Confidential & Proprietary  Total Δt1 ~ 50 seconds to a minute, Δt2 ~ few seconds  Job split decision point important › leads to efficient resource utilization  Map Reduce framework very good for batch jobs › but not suitable for near real-time processing  Well known and documented  Alternate low latency frameworks available OOZIE MR1 Δt1 MR3 Δt1 MR2 Δt1 MR4 Δt1 t1 job split (DAG planning based on source video / requester) Δt1 Job Queuing / Scheduling Container Allocation Container Localization Δt2 Δt2 Δt2 Δt2 Container warming - (ML Models, etc)
  • 16. New Requirements and options explored 16 Yahoo Confidential & Proprietary  Need › near real-time scheduling (Δt1) › long running re-usable containers (Δt2)  Options explored › Tez › Storm / Spark › Slider
  • 17. Issues with options explored 17 Yahoo Confidential & Proprietary  Most (if not all) frameworks optimized for captive data flow › (in our case) only job metadata flows through the framework › while video blobs are consumed from outer subsystems (HDFS / local storage) › metadata is not a clear indicator of job characteristics  Video vs Text Processing › cannot process line by line › no key / value decomposition › many jobs require the whole video file to be present locally
  • 18. The Comparison Sheet 18 Yahoo Confidential & Proprietary Requirement Current Tez Storm / Spark Slider Elasticity High High High High Latency High Low Low Low Resource Efficiency (usage %) High Low* High High Dynamic DAG Yes No No No DAG Fault Tolerance Framework Framework Framework Framework Resource Management Fine Fine Coarse / None Fine Job / Task Abstraction Yes Yes Yes No Container Release Yes Yes No No Container Isolation Yes Yes No Yes Container PreWarm Per Job Once Once Once * Containers remain idle as DAG cannot be changed post first step
  • 19. New Architecture: Generic YARN (master / worker)
  • 20. Generic YARN Master / Worker 20 Yahoo Confidential & Proprietary Master w1 Workers – (Type 1…k) … wn Jobs RPC  Extremely simple framework  Master manages a pool of workers  Master reads jobs and distributes to workers over Hadoop RPC  Framework has pluggable master and worker tasks  Pluggable scheduling strategy to manage workers  Heterogeneous worker tasks in same pool  Custom resource allocation per worker type  Worker resources setup once at bootstrap  State management is done by Master using HDFS  Security and token management by framework harness …
  • 21. Master, Worker Interfaces 21 Yahoo Confidential & Proprietary public interface Master { Job getJobInput(String workerName); void setJobOutput(Job jobOutput); } public interface Worker { public Job execute(Job jobInput); }
  • 22. New Architecture for Transcoding 22 Yahoo Confidential & Proprietary HDFS Pool Master w1 Worker1 … w m Client API Job Queue w1 Workerk … wn API State Information Video Storage …
  • 23. Characteristics of the New Framework 23 Yahoo Confidential & Proprietary  Long running workers in YARN containers › configurable TTL and timeouts  Pools consists of 1 Master and multiple workers  Multiple pools are managed by the client  Multiple clients across clusters  Adaptive container allocation and release › scheduling strategy (low – high watermark based)  Significant improvements in latency › job scheduling and distribution in milliseconds  YARN and the Client provide Master fault tolerance  Master takes care of fault tolerance for workers
  • 24. What Next … 24 Yahoo Confidential & Proprietary  Hope to release to the community soon  In-principle similar to Google containers › with a low latency Job abstraction  YARN (nice to have): › Multi dimensional scheduling › Node Labels
  • 25. Thank You @kishore_angani @smcal75 We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.