SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Presented by: Huy Nguyen @huyng
Computer vision at scale with
Hadoop and Storm
Copyright © 2014 Yahoo, Inc.
Outline
▪ Computer vision at 10,000 feet
▪ Hadoop training system
▪ Storm continuous stream processing
▪ Hadoop batch processing
▪ Q&A
How can we help Flickr users better
organize, search, and browse
their photo collection?
photo credit: Quinn Dombrowski
What is AutoTagging?
Any photo you upload to Flickr is now automatically
classified using computer vision software
Class Probability
Flowers 0.98
Outdoors 0.95
Cat 0.001
Grass 0.6
Classifiers
Classifiers
Classifiers
Demo
Auto Tags Feeding Search
▪
▪
▪
▪
▪
▪
Our Goals
▪ Train thousands of classifiers
▪ Classify millions of photos per day
▪ Compute auto tags on backlog of billions of photos
▪ Enable reprocessing bi-weekly and every quarter
▪ Do it all within 90 days of being acquired!
Our system for training classifiers
▪ High level overview of image classification
▪ Hadoop workflow
Classification in a nutshell
How do we define a classifier?
A classifier is simply the line “z = wx -
b” which divides positive and negative
examples with the largest margin.
Entirely defined by w and b
Training is simply adjusting w and b to
find the line with the largest margin
dividing negative and positive example
data
Classification is done by computing z,
given some data point x
Image classification in a nutshell
To classify images we need to convert
raw pixels into a “feature” to plug into x
compute
feature
How are image classifiers made?
flower Class Name
Negative
Examples
(approx. 100K)
Positive Examples
(approx. 10K)
Repeat a few thousand times
for each class
Compute
Features
Train
Classifier
w,b
Hadoop Classifier Training
Hadoop Classifier Training
HDFS
Pixel Data
+
Label Data
1 <base64>
2 <base64>
3 <base64>
4 <base64>
.
.
n <base64>
dog 1 TRUE
dog 2 FALSE
dog 3 TRUE
cat 1 FALSE
cat 2 TRUE
cat 3 FALSE
.
.
Pixel Data Label Data
photo_id, pixel_data class, photo_id, label
Hadoop Classifier Training - JOIN
map
map
map
reduce
Pixels
Labels
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]
3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
.
.
▪ mappers emit (photo_id, pixels) OR (photo_id, class, label)
▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))
Hadoop Classifier Training - TRAIN
1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]
3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]
.
.
map
map
map
reduce
compute
feature
training
classifier
dog
clf.
cat
clf.
▪ mappers compute features, emits a (class, feat64, label) for each class
image associated with image
▪ reducers, takes advantage of sorting to train. Emits (class, w, b)
Hadoop Classifier Training
▪ Workflow coordinated with OOZIE
Hadoop Classifier Training
▪ 10,000 mappers
▪ 1,000 reducers
▪ A few thousand classifiers in less than 6 hours
▪ Replaced OpenMPI based system
▪ Easier deployment
▪ Access to more hardware
▪ Much simpler code
Area of Improvements
▪ Reducers have all training data in memory
▪ A potential bottleneck if we want to expand training set for any given
set of classifiers
▪ Known path to avoiding this bottleneck
▪ Incrementally update models as new data arrives
Our system for tagging the flickr corpus
▪ Batch processing with Hadoop
▪ Stream processing with Storm
Batch processing in Hadoop
www HDFS Autotags
Mapper
Search
Engine
Indexer
HDFS
Indexing
Reducer
Autotags
MapperAutotags
Mapper
Hadoop Auto tagging
HDFS
Pixel Data
Autotag Results
1 <base64>
2 <base64>
3 <base64>
4 <base64>
.
.
n <base64>
1 { “cat”: .98, “dog”: 0.3 }
2 { “cat”: .97, “dog”: 0.2 }
3 { “cat”: .2, “dog”: 0.3 }
4 { “cat”: .5, “dog”: 0.9 }
.
.
n { “cat”: .90, “dog”: 0.23 }
Pixel Data Autotag Results
photo_id, pixel_data photo_id, results
Hadoop classification performance
▪ 10,000 Mappers
▪ Processed in about 1 week entire corpus
Main challenge: packaging our code
▪ Heterogenous Python & C++ codebase
▪ How do we get all of the shared library dependencies
into a heterogenous system distributed across many
machines?
Main challenge: packaging our code
▪ tried: pyinstaller, didn’t work
▪ rolled our own solution using strace
▪ /autotags_package
▪ /lib
▪ /lib/python2.7/site-packages
▪ /bin
▪ /app
Stream processing (our first attempt)
www Redis
PHP
WorkersPHP
WorkersPHP
Workers
fork autotags a.jpg
search
db
Why we started looking into Storm
Why we started looking into Storm
Feature
Computation
Classification
300 ms 10 ms
Classifier
Load
300 ms
Classifier loading accounted for
49% of total processing time
Solutions we evaluated at the time
▪ Mem-mapping model
▪ Batching up operations
▪ Daemonize our autotagging code
Enter
Storm
Storm computational framework
▪ Spouts
▪ The source of your data
stream. Spouts “emit”
tuples
▪ Tuples
▪ An atomic unit of work
▪ Bolts
▪ Small programs that
processes your tuple
stream
▪ Topology
▪ A graph of bolts and
spouts
Defining a topology
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("images", new ImageSpout(), 10);
builder.setBolt("autotag", new AutoTagBolt(), 3)
.shuffleGrouping("images")
builder.setBolt("dbwriter", new DBWriterBolt(), 2)
.shuffleGrouping("autotag")
Defining a bolt
class Autotag(storm.BasicBolt):
def __init__(self):
self.classifier.load()
def process(self, tuple):
<process tuple here>
Our final solution
www Redis Autotags
Spout
Autotags
Bolt
Search
Engine
Indexer
Bolt
DB
Writer
Bolt
Topology
Storm architecture
Supervisors
(10 nodes)
Nimbus
(1 node)
Zookeepers
(3 nodes)
Spouts and bolts live here
Fault tolerant against node failures
Nimbus server dies
▪ no new jobs can be submitted
▪ current jobs stay running
▪ simply restart
Supervisor node dies
▪ All unprocessed jobs for that node get reassigned
▪ Supervisor restarted
Zookeeper node dies
▪ Restart node, no data loss as long as not too many die at once
storm commands
storm jar topology-jar-path class
storm commands
storm list
storm commands
storm kill topology-name
storm commands
storm deactivate topology-jar-path class
storm commands
storm activate topology-name
Is Storm just another queuing system?
Yes, and more
▪ A framework for separating application design from
scaling concerns
▪ Has a great deployment story
▪ atomic start, kill, deactivate for free
▪ no more rsync
▪ No single point of failure
▪ Wide industry adoption
▪ Is programming language agnostic
Try storm out!
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/nathanmarz/storm-deploy
Thank you
Presentation copyright © 2014 Yahoo, Inc.

Contenu connexe

Tendances

A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningMakoto Yui
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Get Competitive with Driverless AI
Get Competitive with Driverless AIGet Competitive with Driverless AI
Get Competitive with Driverless AISri Ambati
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sjHolden Karau
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Databricks
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 

Tendances (20)

A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Get Competitive with Driverless AI
Get Competitive with Driverless AIGet Competitive with Driverless AI
Get Competitive with Driverless AI
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 

Similaire à Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyGanesan Narayanasamy
 
Information from pixels
Information from pixelsInformation from pixels
Information from pixelsDave Snowdon
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
Puppet and CloudStack
Puppet and CloudStackPuppet and CloudStack
Puppet and CloudStackke4qqq
 
Python: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingPython: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingDamian T. Gordon
 
Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"Lviv Startup Club
 
Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012Roland Tritsch
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
NUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloudNUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloudNUS-ISS
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppet
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Puppet and Apache CloudStack
Puppet and Apache CloudStackPuppet and Apache CloudStack
Puppet and Apache CloudStackPuppet
 
Infrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStackInfrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStackke4qqq
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentMike Brittain
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用Qiangning Hong
 
Machine Learning Models on Mobile Devices
Machine Learning Models on Mobile DevicesMachine Learning Models on Mobile Devices
Machine Learning Models on Mobile DevicesLars Gregori
 
Usecase examples of Packer
Usecase examples of Packer Usecase examples of Packer
Usecase examples of Packer Hiroshi SHIBATA
 

Similaire à Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen) (20)

OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon Valley
 
Information from pixels
Information from pixelsInformation from pixels
Information from pixels
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Puppet and CloudStack
Puppet and CloudStackPuppet and CloudStack
Puppet and CloudStack
 
Python: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented ProgrammingPython: Migrating from Procedural to Object-Oriented Programming
Python: Migrating from Procedural to Object-Oriented Programming
 
Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"Viktor Tsykunov "Microsoft AI platform for every Developer"
Viktor Tsykunov "Microsoft AI platform for every Developer"
 
Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012Puppet Camp Dublin - 06/2012
Puppet Camp Dublin - 06/2012
 
Django at Scale
Django at ScaleDjango at Scale
Django at Scale
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
NUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloudNUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloud
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Puppet and Apache CloudStack
Puppet and Apache CloudStackPuppet and Apache CloudStack
Puppet and Apache CloudStack
 
Infrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStackInfrastructure as code with Puppet and Apache CloudStack
Infrastructure as code with Puppet and Apache CloudStack
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
 
Devf (Shoe Lovers)
Devf (Shoe Lovers)Devf (Shoe Lovers)
Devf (Shoe Lovers)
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用
 
Machine Learning Models on Mobile Devices
Machine Learning Models on Mobile DevicesMachine Learning Models on Mobile Devices
Machine Learning Models on Mobile Devices
 
Usecase examples of Packer
Usecase examples of Packer Usecase examples of Packer
Usecase examples of Packer
 

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Dernier

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Dernier (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

  • 1. Presented by: Huy Nguyen @huyng Computer vision at scale with Hadoop and Storm Copyright © 2014 Yahoo, Inc.
  • 2. Outline ▪ Computer vision at 10,000 feet ▪ Hadoop training system ▪ Storm continuous stream processing ▪ Hadoop batch processing ▪ Q&A
  • 3.
  • 4. How can we help Flickr users better organize, search, and browse their photo collection? photo credit: Quinn Dombrowski
  • 5. What is AutoTagging? Any photo you upload to Flickr is now automatically classified using computer vision software Class Probability Flowers 0.98 Outdoors 0.95 Cat 0.001 Grass 0.6 Classifiers Classifiers Classifiers
  • 7. Auto Tags Feeding Search ▪ ▪ ▪ ▪ ▪ ▪
  • 8. Our Goals ▪ Train thousands of classifiers ▪ Classify millions of photos per day ▪ Compute auto tags on backlog of billions of photos ▪ Enable reprocessing bi-weekly and every quarter ▪ Do it all within 90 days of being acquired!
  • 9. Our system for training classifiers ▪ High level overview of image classification ▪ Hadoop workflow
  • 10. Classification in a nutshell How do we define a classifier? A classifier is simply the line “z = wx - b” which divides positive and negative examples with the largest margin. Entirely defined by w and b Training is simply adjusting w and b to find the line with the largest margin dividing negative and positive example data Classification is done by computing z, given some data point x
  • 11. Image classification in a nutshell To classify images we need to convert raw pixels into a “feature” to plug into x compute feature
  • 12. How are image classifiers made? flower Class Name Negative Examples (approx. 100K) Positive Examples (approx. 10K) Repeat a few thousand times for each class Compute Features Train Classifier w,b
  • 14. Hadoop Classifier Training HDFS Pixel Data + Label Data 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> dog 1 TRUE dog 2 FALSE dog 3 TRUE cat 1 FALSE cat 2 TRUE cat 3 FALSE . . Pixel Data Label Data photo_id, pixel_data class, photo_id, label
  • 15. Hadoop Classifier Training - JOIN map map map reduce Pixels Labels 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . ▪ mappers emit (photo_id, pixels) OR (photo_id, class, label) ▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))
  • 16. Hadoop Classifier Training - TRAIN 1 <pixelsb64> [(cat, FALSE), (dog, TRUE)] 2 <pixelsb64> [(cat,TRUE), (dog, FALSE)] 3 <pixelsb64> [(cat, FALSE), (dog, TRUE)] . . map map map reduce compute feature training classifier dog clf. cat clf. ▪ mappers compute features, emits a (class, feat64, label) for each class image associated with image ▪ reducers, takes advantage of sorting to train. Emits (class, w, b)
  • 17. Hadoop Classifier Training ▪ Workflow coordinated with OOZIE
  • 18. Hadoop Classifier Training ▪ 10,000 mappers ▪ 1,000 reducers ▪ A few thousand classifiers in less than 6 hours ▪ Replaced OpenMPI based system ▪ Easier deployment ▪ Access to more hardware ▪ Much simpler code
  • 19. Area of Improvements ▪ Reducers have all training data in memory ▪ A potential bottleneck if we want to expand training set for any given set of classifiers ▪ Known path to avoiding this bottleneck ▪ Incrementally update models as new data arrives
  • 20. Our system for tagging the flickr corpus ▪ Batch processing with Hadoop ▪ Stream processing with Storm
  • 21. Batch processing in Hadoop www HDFS Autotags Mapper Search Engine Indexer HDFS Indexing Reducer Autotags MapperAutotags Mapper
  • 22. Hadoop Auto tagging HDFS Pixel Data Autotag Results 1 <base64> 2 <base64> 3 <base64> 4 <base64> . . n <base64> 1 { “cat”: .98, “dog”: 0.3 } 2 { “cat”: .97, “dog”: 0.2 } 3 { “cat”: .2, “dog”: 0.3 } 4 { “cat”: .5, “dog”: 0.9 } . . n { “cat”: .90, “dog”: 0.23 } Pixel Data Autotag Results photo_id, pixel_data photo_id, results
  • 23. Hadoop classification performance ▪ 10,000 Mappers ▪ Processed in about 1 week entire corpus
  • 24. Main challenge: packaging our code ▪ Heterogenous Python & C++ codebase ▪ How do we get all of the shared library dependencies into a heterogenous system distributed across many machines?
  • 25. Main challenge: packaging our code ▪ tried: pyinstaller, didn’t work ▪ rolled our own solution using strace ▪ /autotags_package ▪ /lib ▪ /lib/python2.7/site-packages ▪ /bin ▪ /app
  • 26. Stream processing (our first attempt) www Redis PHP WorkersPHP WorkersPHP Workers fork autotags a.jpg search db
  • 27. Why we started looking into Storm
  • 28. Why we started looking into Storm Feature Computation Classification 300 ms 10 ms Classifier Load 300 ms Classifier loading accounted for 49% of total processing time
  • 29. Solutions we evaluated at the time ▪ Mem-mapping model ▪ Batching up operations ▪ Daemonize our autotagging code
  • 31.
  • 32. Storm computational framework ▪ Spouts ▪ The source of your data stream. Spouts “emit” tuples ▪ Tuples ▪ An atomic unit of work ▪ Bolts ▪ Small programs that processes your tuple stream ▪ Topology ▪ A graph of bolts and spouts
  • 33. Defining a topology TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("images", new ImageSpout(), 10); builder.setBolt("autotag", new AutoTagBolt(), 3) .shuffleGrouping("images") builder.setBolt("dbwriter", new DBWriterBolt(), 2) .shuffleGrouping("autotag")
  • 34. Defining a bolt class Autotag(storm.BasicBolt): def __init__(self): self.classifier.load() def process(self, tuple): <process tuple here>
  • 35. Our final solution www Redis Autotags Spout Autotags Bolt Search Engine Indexer Bolt DB Writer Bolt Topology
  • 36. Storm architecture Supervisors (10 nodes) Nimbus (1 node) Zookeepers (3 nodes) Spouts and bolts live here
  • 37. Fault tolerant against node failures Nimbus server dies ▪ no new jobs can be submitted ▪ current jobs stay running ▪ simply restart Supervisor node dies ▪ All unprocessed jobs for that node get reassigned ▪ Supervisor restarted Zookeeper node dies ▪ Restart node, no data loss as long as not too many die at once
  • 38. storm commands storm jar topology-jar-path class
  • 40. storm commands storm kill topology-name
  • 41. storm commands storm deactivate topology-jar-path class
  • 43. Is Storm just another queuing system?
  • 44. Yes, and more ▪ A framework for separating application design from scaling concerns ▪ Has a great deployment story ▪ atomic start, kill, deactivate for free ▪ no more rsync ▪ No single point of failure ▪ Wide industry adoption ▪ Is programming language agnostic
  • 46. Thank you Presentation copyright © 2014 Yahoo, Inc.