SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Jupyter Notebooks
Workflow Building
Pipelines
Tools
Serving
Metadata
Kale
Fairing
TFX
KF Pipelines
HP Tuning
Tensorboard
KFServing
Seldon Core
TFServing, + Training Operators
Pytorch
XGBoost, +
Tensorflow
Prometheus
Kubeflow - Distributed Training and HPO
Andrew Butler, Qianyang Yu, Tommy Li, Animesh Singh
MPI
MXNet
Distributed Model Training and HPO (TFJob, PyTorch Job, MPI
Job, Katib, …)
•  Addresses	One	of	the	key	goals	for	model	builder	persona:		
	
Distributed	Model	Training	and	Hyper	parameter	
optimization	for	Tensorflow,	PyTorch	etc.	
	
Common	problems	in	HP	optimization	
•  Overfitting	
•  Wrong	metrics	
•  Too	few	hyperparameters	
Katib:	a	fully	open	source,	Kubernetes-native	hyperparameter	
tuning	service	
•  Inspired	by	Google	Vizier	
•  Framework	agnostic	
•  Extensible	algorithms	
•  Simple	integration	with	other	Kubeflow	components	
Kubeflow	also	supports	distributed	MPI	based	training	using	
Horovod
Distributed Training Operators
3
Distributed
Training Operators
4
Traditional Model Training
Machine	1	
Source:	https://towardsdatascience.com/mnist-handwritten-digits-classification-using-a-convolutional-neural-network-cnn-af5fafbc35e9
Need for Distributed Training
•  Models	that	are	too	large	for	a	single	device	
	
•  Improved	parallelization	
	
Source:	http://www.juyang.co/distributed-model-training-ii-parameter-server-and-allreduce/
Distributed Model Training
Parameter Servers
•  Most	simple	form	of	distributed		
training	
•  One	centralized	parameter	server		
does	the	aggregation	job	of	collecting	
and	redistributing	results	of	each		
worker	node
Parameter Servers
AllReduce
•  Most	parallelized	form	of	distributed	
training	
•  There	are	many	different	styles	
of	AllReduce	with	each	having	
different	benefits	and	costs
AllReduce
Advantages of allreduce-style training
•  Each	worker	stores	a	complete	set	of	model	parameters,	so	adding	more	workers	is	easy	
•  Failures	among	workers	can	be	recovered	easily	by	just	restarting	the	failed	worker	and	
loading	the	model	from	an	existing	worker	
•  Models	can	be	updated	more	efficiently	by	leveraging	network	structure	
•  Scaling	up	and	down	workers	only	requires	reconstructing	the	underlying	allreduce	
communicator	and	re-assigning	the	ranks	among	the	workers
Distributed Training in Kubeflow
Kubernetes	 Kubeflow	
tf-operator	
pytorch-operator	
mpi-operator
MPI Operator
•  The	MPI	Operator	allows	for	running	allreduce-style	distributed	training	on	Kubernetes	
•  Provides	common	Custom	Resource	Definition	(CRD)	for	defining	training	jobs	
•  Unlike	other	operators,	such	as	the	TF	Operator	and	the	Pytorch	Operator,	the	MPI	
Operator	is	decoupled	from	one	machine	learning	framework.	This	allows	the	MPI	
Operator	to	work	with	many	machine	learning	frameworks	such	as	Tensorflow,	Pytorch,	
and	many	more
Design
•  When	a	new	MPIJob	is	created	the	MPIJob	Controller	goes	through	a	set	of	steps	
•  1.	Create	a	ConfigMap	
•  2.	Create	the	RBAC	resources	(Role,	Service	Account,	Role	Binding)	to	allow	remote	
execution	(pods/exec)	
•  3.	Create	the	worker	StatefulSet	
•  4.	Wait	for	worker	pods	to	be	ready	
•  5.	Create	the	Job	which	is	run	under	the	Service	
Account	(from	Step	2)
Example API Spec
MPI Operator Roadmap
MPI Operator Roadmap
TF Operator
•  TFJobs	are	Kubernetes	custom	resource	definitions	for	running	distributed	and		
non-distributed	Tensorflow	jobs	on	Kubernetes	
•  The	tf-operator	is	the	Kubeflow	implementation	of	TFJobs	
•  A	TFJob	is	a	collection	of	TfReplicas	where	each	TfReplica	corresponds	to	a	set	of	
Tensorflow	processes	performing	a	role	in	the	job
Design
•  A	distributed	Tensorflow	Job	is	collection	of	the	following	processes	
•  Chief	–	The	chief	is	responsible	for	orchestrating	training	and	performing	tasks	like	checkpointing	the	
model	
•  Ps	–	The	ps	are	parameters	servers;	the	servers	provide	a	distributed	data	store	for	the	model	
parameters	to	access	
•  Worker	–	The	workers	do	the	actual	work	of	training	the	model.	In	some	cases,	worker	0	might	also	
act	as	the	chief	
•  Evaluator	-		The	evaluators	can	be	used	to	compute	evaluation	metrics	as	the	model	is	trained
TFJob vs. MPIJob
TF Operator Roadmap
Pytorch Operator
•  Similar	to	TFJobs	and	MPIJobs,	PytorchJobs	are	Kubernetes	custom	resource	definitions	
for	running	distributed	and	non-distributed	PytorchJobs	on	Kubernetes	
•  The	pytorch-operator	is	the	Kubeflow	implementation	of	PytorchJobs		
•  There	are	a	number	of	metrics	that	can	be	monitored	for	each	component	container	of	
the	pytorch-operator	by	using	Prometheus	Montoring
Pytorch Operator monitoring
•  Prometheus	monitoring	for	pytorch	operator	makes	the	many	available	
metrics	easy	to	monitor		
•  There	are	metrics	for	each	component	container	for	the	pytorch	operator,	
such	as	CPU	usage,	GPU	usage,	Keep-Alive	check,	and	more	
•  There	are	also	metrics	for	reporting	PytorchJob	information	such	as	job	
creation,	successful	completions,	failed	jobs,	etc.
Katib
	
Introduction	to	Katib
Kubeflow-Katib
•  Motivation:	Automated	tuning	machine	learning	model’s	hyperparameters	and	neural	
architecture	search.		
•  Major	components:		
•  katib-db-manager:	GRPC	API	server	of	Katib	which	is	the	DB	Interface.	
•  katib-mysql:	Data	storage	backend	of	Katib	using	mysql.	
•  katib-ui:	User	interface	of	Katib.	
•  katib-controller:	Controller	for	Katib	CRDs	in	Kubernetes.	
•  Katib:	Kubernetes	Native	System	for	Hyperparameter	Turning	and	Neural	Architecture	Search
•  Github	Repository:		https://github.com/kubeflow/katib
AutoML workflows
	
	
Click to add text
Katib	is	a	scalable	Kubernetes-native	general	
AutoML	platform.		
Katib	integrate	hyper-parameter	turning	and	
NAS	into	one	flexible	frame-work.
Design of Katib
Note:	StudyJob	is	now	called	Experiment
Install Katib
•  To	install	Katib	as	part	of	Kubeflow	
	
•  To	install	Katib	separately	from	Kubeflow	
	
Details:	
https://www.kubeflow.org/docs/components/hyperparameter-tuning/
hyperparameter/#katib-setup	
	
In	this	example,	we	installed	Katib	as	part	of	Kubeflow	
https://www.kubeflow.org/docs/ibm/
Accessing the katib UI
•  Under	the	Kubeflow	web	UI,	click	the	Katib	on	the	left	side	bar.
First Example of Katib
•  We	are	using	random-example	from	Hyper-parameter	Turning
Deploy the random-example
•  Click	the	Deploy
Check experiment status
•  Click	the	Katib	tab,	then	choose	Monitor	under	HP	on	the	left	side
View Experiment
•  Click	the	experiment	name,	it	will	show	the	experiment	also	the	status	of	
trial
Check status from command line
•  At	your	K8S	cluster	command	line:
Check Experiment's CR
•  Get	the	experiment	CR	from	command	line
Conti. Experiment CR
	
	
•  Algorithm:	Katib	supports	random,	grid,	hyperband,	bayesian	optimization	and	tpe	algorithms.	
•  MaxFailedTrialCount:	specify	the	max	the	tuning	with	failed	status	
•  MaxTrialCount:	specify	the	limit	for	the	hyper-parameters	sets	can	be	generated.	
•  Objective:	Set	objetiveMetricName	and	additionalMetricNames.	
•  ParalleTrialCount:	how	many	set	of	hyper-parameter	to	be	tested	in	parallel.
Fields in Experiment's spec
•  TrialTemplate:	Your	model	should	be	packaged	by	image,	model's	hyper-parameter	
must	be	configurable	by	argument	or	environment	variable.	
•  Parameter:	defines	the	range	of	the	hyper-parameters	you	want	to	tune	your	model.	
•  MetricsCollectorSpec:	The	metric	collectors	for	stdout,	file	or	tfevent.	Metric	collecting	
will	run	as	a	sidecar	if	enabled.
Trial
•  Katib	internally	generate	a	Trial	CR,	it	is	for	internal	logic	control.
Suggestion
•  Katib	internally	create	a	suggestion	CR	for	each	experiment	CR.	It	includes	
hyper-parameter	algorithm	name	and	how	many	sets	of	hyper-parameter	
katib	is	asking	to	be	generated	by	requests	field.
Conti. Suggestion
Katib controller flow(step1 to 3)
	
	
1.			A	experiment	CR	is	submitted	to	K8S	API	server;	Katib	experiment	mutating	and	validating	webhook	will	
be	called	to	set	default	value	for	the	Experiment	CR	and	validate	the	CR.	
2.		Experiment	controller	create	a	suggested	CR	
3.		Suggestion	controller	create	the	algorithm	deployment	and	service	based	on	the	new	suggestion	CR
Katib controller flow(Step4 to 6)
	
	
4.				Suggestion	controller	verifies	the	algorithm	service	is	ready;	
generates	spec.request	-	len(status.suggestions)	and	append	them	into	status.suggestions	
5.			Experiment	controller	detects	the	suggestion	CR	has	been	updated,	generate	each	Trial	for	each	new	hyper-
parameters	set	
6.			Trial	controller	generates	job	based	on	runSpec	manifest	with	the	new	hyper-parameter	set.
Katib controller flow(Step7 to 9)
7.	Related	job	controller	(k8s	batch	job,	kubeflow	pytorchJob	or	Kubeflow	TFJob)	generated	Pods.	
8.	Katib	Pod	mutating	webhook	to	inject	metrics	collector	sidecar	container	to	the	candidate	Pod.	
9.	Metrics	collector	container	tries	to	collect	metrics	from	it	and	persists	them	into	Katib	DB	backend.
Katib controller flow(Step10 to 11)
	
	
10.	When	the	ML	model	job	ends,	Trial	controller	will	update	corresponding	Trial	CR's	status.	
11.	When	a	Trial	CR	goes	to	end,	Experiment	controller	will	increase	request	field	of	corresponding	
suggestion	CR,	then	go	to	step	4	again.	If	it	ends,	it	will	record	the	best	set	of	hyper-parameters	
in	.status.currentOptimalTrial	field.
Katib
46	
Think	2020	/	DOC	ID	/	Month	XX,	2020	/	©	2020	IBM	
Corporation
Thank you!
Further	Resources	
	
•  Distributed	Training:	
•  https://github.com/kubeflow/tf-operator	
•  https://github.com/kubeflow/pytorch-operator	
•  https://github.com/kubeflow/mpi-operator	
•  Katib	
•  https://github.com/kubeflow/katib

Contenu connexe

Tendances

Introduction to Helm
Introduction to HelmIntroduction to Helm
Introduction to HelmHarshal Shah
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overviewGabriel Carro
 
KFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesAnimesh Singh
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowDatabricks
 
Evolution of containers to kubernetes
Evolution of containers to kubernetesEvolution of containers to kubernetes
Evolution of containers to kubernetesKrishna-Kumar
 
K8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals TrainingK8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals TrainingPiotr Perzyna
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes IntroductionPeng Xiao
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesGabriel Carro
 
Getting Started with Kubernetes
Getting Started with Kubernetes Getting Started with Kubernetes
Getting Started with Kubernetes VMware Tanzu
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingTheofilos Papapanagiotou
 
(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive OverviewBob Killen
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingDatabricks
 

Tendances (20)

Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
DevOps with Kubernetes
DevOps with KubernetesDevOps with Kubernetes
DevOps with Kubernetes
 
Introduction to Helm
Introduction to HelmIntroduction to Helm
Introduction to Helm
 
Kubernetes a comprehensive overview
Kubernetes   a comprehensive overviewKubernetes   a comprehensive overview
Kubernetes a comprehensive overview
 
KFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow Pipelines
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Evolution of containers to kubernetes
Evolution of containers to kubernetesEvolution of containers to kubernetes
Evolution of containers to kubernetes
 
What Is Helm
 What Is Helm What Is Helm
What Is Helm
 
K8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals TrainingK8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals Training
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Getting Started with Kubernetes
Getting Started with Kubernetes Getting Started with Kubernetes
Getting Started with Kubernetes
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
 
(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview(Draft) Kubernetes - A Comprehensive Overview
(Draft) Kubernetes - A Comprehensive Overview
 
AKS
AKSAKS
AKS
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 

Similaire à Distributed Training and HPO with Kubeflow Operators

MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Akash Tandon
 
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondErik Krogen
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learningAntje Barth
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaLINE Corporation
 
from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018Chun-Yu Tseng
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDocker, Inc.
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
 
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesWhen HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesYong Feng
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning DevelopmentMatei Zaharia
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in ProductionMatthias Feys
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019Travis Oliphant
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computingThe BioTeam Inc.
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Intel IT Center
 
Hydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowHydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowRustem Zakiev
 

Similaire à Distributed Training and HPO with Kubeflow Operators (20)

MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE Fukuoka
 
from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesWhen HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computing
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
Deploying and Managing HPC Clusters with IBM Platform and Intel Xeon Phi Copr...
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
 
Hydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on KubeflowHydrosphere.io for ODSC: Webinar on Kubeflow
Hydrosphere.io for ODSC: Webinar on Kubeflow
 

Plus de Animesh Singh

Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)Animesh Singh
 
KFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AIKFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AIAnimesh Singh
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingAnimesh Singh
 
Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox Animesh Singh
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAnimesh Singh
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh
 
Trusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open SourceTrusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open SourceAnimesh Singh
 
AIF360 - Trusted and Fair AI
AIF360 - Trusted and Fair AIAIF360 - Trusted and Fair AI
AIF360 - Trusted and Fair AIAnimesh Singh
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAnimesh Singh
 
Fabric for Deep Learning
Fabric for Deep LearningFabric for Deep Learning
Fabric for Deep LearningAnimesh Singh
 
Microservices, Kubernetes and Istio - A Great Fit!
Microservices, Kubernetes and Istio - A Great Fit!Microservices, Kubernetes and Istio - A Great Fit!
Microservices, Kubernetes and Istio - A Great Fit!Animesh Singh
 
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...How to build a Distributed Serverless Polyglot Microservices IoT Platform us...
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...Animesh Singh
 
How to build an event-driven, polyglot serverless microservices framework on ...
How to build an event-driven, polyglot serverless microservices framework on ...How to build an event-driven, polyglot serverless microservices framework on ...
How to build an event-driven, polyglot serverless microservices framework on ...Animesh Singh
 
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAs a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAnimesh Singh
 
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...Animesh Singh
 
Finding and-organizing Great Cloud Foundry User Groups
Finding and-organizing Great Cloud Foundry User GroupsFinding and-organizing Great Cloud Foundry User Groups
Finding and-organizing Great Cloud Foundry User GroupsAnimesh Singh
 
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...Animesh Singh
 
Building a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStackBuilding a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStackAnimesh Singh
 
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source TriumvirateCloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source TriumvirateAnimesh Singh
 

Plus de Animesh Singh (20)

Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)Machine Learning Exchange (MLX)
Machine Learning Exchange (MLX)
 
KFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AIKFServing Payload Logging for Trusted AI
KFServing Payload Logging for Trusted AI
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model Inferencing
 
Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox Defend against adversarial AI using Adversarial Robustness Toolbox
Defend against adversarial AI using Adversarial Robustness Toolbox
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
 
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
 
Trusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open SourceTrusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open Source
 
AIF360 - Trusted and Fair AI
AIF360 - Trusted and Fair AIAIF360 - Trusted and Fair AI
AIF360 - Trusted and Fair AI
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with Knative
 
Fabric for Deep Learning
Fabric for Deep LearningFabric for Deep Learning
Fabric for Deep Learning
 
Microservices, Kubernetes and Istio - A Great Fit!
Microservices, Kubernetes and Istio - A Great Fit!Microservices, Kubernetes and Istio - A Great Fit!
Microservices, Kubernetes and Istio - A Great Fit!
 
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...How to build a Distributed Serverless Polyglot Microservices IoT Platform us...
How to build a Distributed Serverless Polyglot Microservices IoT Platform us...
 
How to build an event-driven, polyglot serverless microservices framework on ...
How to build an event-driven, polyglot serverless microservices framework on ...How to build an event-driven, polyglot serverless microservices framework on ...
How to build an event-driven, polyglot serverless microservices framework on ...
 
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons LearntAs a Service: Cloud Foundry on OpenStack - Lessons Learnt
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
 
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...
Introducing Cloud Native, Event Driven, Serverless, Micrsoservices Framework ...
 
Finding and-organizing Great Cloud Foundry User Groups
Finding and-organizing Great Cloud Foundry User GroupsFinding and-organizing Great Cloud Foundry User Groups
Finding and-organizing Great Cloud Foundry User Groups
 
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...
 
Building a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStackBuilding a PaaS Platform like Bluemix on OpenStack
Building a PaaS Platform like Bluemix on OpenStack
 
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source TriumvirateCloud foundry Docker Openstack - Leading Open Source Triumvirate
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
 

Dernier

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Distributed Training and HPO with Kubeflow Operators