SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Managing Cloud networking costs
for data-intensive applications
by provisioning
dedicated network links
Igor Sfiligoi, Frank Würthwein, Thomas Hutton
University of California San Diego
Michael Hare, David Schultz, Benedikt Riedel, Steve Barnet, Vladimir Brik
University of Wisconsin–Madison
Premise
• Commercial Clouds can provide large amounts of compute capacity
• And Cloud compute costs are acceptable when using spot instances
• Network-intensive applications may however experience
large networking bills
• While ingress is free, egress is a metered commodity
• Dedicated links can provide significant cost savings on egress
• Between 50% and 75%
We present our experience
egress-ing 130 TB in half a day
Doing real science
• Doing CS experiments is fun
• But demonstrating
infrastructure capabilities while
doing real science is even better
• We produced simulation data
for the IceCube experiment
• Using Cloud compute instances
• Storage was located
at UW Madison and UC San Diego
Optical Properties
• Combining all the possible information
• These features are included in simulation
• We’re always be developing them
Nature never tell us a perfect answer but obtained a
satisfactory agreement with data!
The need for calibration
• Natural medium
• Hard to calibrate properly
• Dropped a detector into a grey box
• The ice is very clear, but…
• Is it uniform?
• How has construction changed the ice?
• Drastic changes
in reconstructed position
with different ice models
• Too complex for a
parametrized approach
• Using ray-tracing on GPUs
Total:
9M hours
(Jul 2020 – Jul 2021)
https://gracc.opensciencegrid.org/d/000000118/gpu-payload-jobs-summary?orgId=1&var-ReportableVOName=icecube&var-interval=7d
On-prem, but globally distributed.
(weekly)
IceCube used to distributed computing
Credit: https://pixy.org/1338869/
gridFTP
UW Madison
(mostly)
Total:
9M hours
(Jul 2020 – Jul 2021)
https://gracc.opensciencegrid.org/d/000000118/gpu-payload-jobs-summary?orgId=1&var-ReportableVOName=icecube&var-interval=7d
On-prem, but globally distributed.
(weekly)
IceCube used to distributed computing
Credit: https://pixy.org/1338869/
gridFTP
UW Madison
(mostly)
Adding Cloud resources
thus relatively trivial
(we presented another Cloud run at PEARC20)
The networking cost issue
• IceCube needed data-heavy simulation
• About 500 MB produced per fp32 TFLOP-hour of compute
• Egress costs start to be comparable to compute costs!
• If one used ”standard networking”
• Dedicated network links promised significant reduction in cost
All prices were valid as of December 2020. Average per-job values.
The need for many dedicated links
• IceCube storage can sink 100 Gbps
• Over 80 Gbps at UW Madison
• Plus over 20 Gbps at UC San Diego
• Internet2 had mostly 2x 10 Gbps links with Cloud providers
• The only bright exception was the California link to Azure at 2x 100 Gbps
• The links are shared, so one can never get the whole link for itself
• 5 Gbps limit in AWS and GCP
• 10 Gbps limit in Azure
• The link speeds are rigidly defined
• 1, 2, 5, 10 Gbps
• To fill an (almost) empty 10 Gbps link, one needs three links: 5 + 2 + 2
The need for many dedicated links
• IceCube storage can sink 100 Gbps
• Over 80 Gbps at UW Madison
• Plus over 20 Gbps at UC San Diego
• No dedicated links
over 10 Gbps available
• With AWS and GCP limited to 5 Gbps
• We ended up with over 20 links
A multi-team effort
• Final users cannot provision a dedicated link on their own.
The process involves:
• The Cloud (final) user,
• The intermediate network provider, and
• The local on-prem networking team.
• Human interaction plays a significant role
in establishing and tearing down of dedicated links.
• Full automation virtually impossible in most circumstances.
On-site preparations
• Dedicated IPs are needed for the Cloud compute resources
• Not always trivial to get an allocation for local networking team
• We used private IPs, which are easier to get by
(IPv6 may also be an easy(er) option, but we did not consider it)
• Dedicated paths must be established to the Internet2 peering points
• UW chose to provision a set of BGP-based Layer 3 virtual private networks
(L3VPNs) to Internet2 via their regional aggregator, BTAA OmniPop.
• UCSD first provisioned a Layer 2 virtual private network (L2VPN) over their
regional provider, CENIC,
and then layered on top a BGP-based L3VPN with Internet2.
Very different provisioning in the 3 Clouds
• AWS the most complex
• And requires initiation by
on-prem network engineer
• Many steps after initial request
• Create VPC and subnets
• Accept connection request
• Create VPG
• Associate VPG with VPC
• Create DCG
• Create VIF
• Relay back to on-prem the BGP key
• Establish VPC -> VPG routing
• Associate DCG -> VPG
• And don’t forget the Internet routers
• GCP the simplest
• Create VPC and subnets
• Create Cloud Router
• Create Interconnect
• Provide key to on-prem
• Azure not much harder
• Create VN and subnets
• Make sure the VN has Gateway subnet
• Create ExpressRoute (ER)
• Provide key to on-prem
• Create VNG
• Create connection between ER and VNG
• Note: Azure comes with many more options
to choose from
The above steps must be performed by the final user.
Not meant for frequent use
• In summary, provisioning (and tearing down) of
dedicated network connections is hard
• Involves many parties and many steps
• Once established, it works fine
• But the provisioning overhead is definitely non-negligible!
• Only pays off for major endeavours
At 130 TB one looks at a
$5.5k saving, which
makes it worthwhile
IceCube spiky traffic
• IceCube network traffic
very spiky
• All egress happens
right after compute complete
• Having many jobs allows for
smoothing of traffic
• But not if they all start
at the same time!
Test provisioning run – no job startup control
Network congestion
Overprovisining
(low utilization)
Slow resource provisioning in main run
• Slow resource provisioning results in spread out of job startup times
• The strategy worked for the main Cloud run
• Ramping up to 80 PFLOPs for about 2 hours
Each color represents resources tied to one network link.
One of the network links.
Good, but not perfect execution
• Using only spot Cloud instances
• Reached region capacity
at different times in
different places
• With 20+ resource groups,
steering was challenging
• Over-provisioned a couple of regions
• Resulting in saturated network links
Each color represents one network link.
Overall a success
• Produced 130 TB of simulation data (54k files)
• Used 225 fp32 PFLOP-hours of GPU compute
• Note: Almost saturated the UW research network link!
Each color represents one network link.
UW Madison
monitoring
The incurred cost
• Cost of the main run:
• $31k total, all included
• Of that, $5.5k was spent on networking
• Network/data transfer cost analysis
• $5.5k / 130TB = $42/TB
• Without dedicated links we would have paid: $83/TB * 130TB = $11k
50% saving
Summary and conclusion
• Cloud egress costs can be substantial for data-intensive applications
• Dedicated links can provide savings between 50% and 75%
• We showed that one can provision 100 Gbps in aggregate bandwidth
through Internet2
• But capacity planning and workload steering can be challenging
• High network provisioning overhead
• Especially hard for spiky workloads, like the IceCube one
• All in all, a success
• And we produced valuable science
Acknowledgements
• This work was partially funded by
the US National Science Foundation (NSF) though grants
OAC-1941481, MPS-1148698, OAC-1841530,
OAC-1826967 and OPP-1600823.

Contenu connexe

Tendances

SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsIgor Sfiligoi
 
Characterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the CloudsCharacterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the Cloudsinside-BigData.com
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Toby Bloom
 
Automating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAutomating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAndrew Yongjoon Kong
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Amazon Web Services
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackAndrew Yongjoon Kong
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3Tim Bell
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERNArne Wiebalck
 
Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11ShapeBlue
 
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...OpenNebula Project
 
Integrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudIntegrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudArne Wiebalck
 
Operational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionOperational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionArne Wiebalck
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Frank Wuerthwein
 
Azure Functions - Get rid of your servers, use functions!
Azure Functions - Get rid of your servers, use functions!Azure Functions - Get rid of your servers, use functions!
Azure Functions - Get rid of your servers, use functions!QAware GmbH
 
Paul Angus - CloudStack Container Service
Paul  Angus - CloudStack Container ServicePaul  Angus - CloudStack Container Service
Paul Angus - CloudStack Container ServiceShapeBlue
 
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSSave 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSMayaData Inc
 

Tendances (20)

SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
openstack, devops and people
openstack, devops and peopleopenstack, devops and people
openstack, devops and people
 
Embracing clouds
Embracing cloudsEmbracing clouds
Embracing clouds
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
 
Characterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the CloudsCharacterizing Network Paths in and out of the Clouds
Characterizing Network Paths in and out of the Clouds
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
 
Automating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestratorAutomating auto-scaled load balancer based on linux and vm orchestrator
Automating auto-scaled load balancer based on linux and vm orchestrator
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERN
 
Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11
 
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
 
Integrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudIntegrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private Cloud
 
Operational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionOperational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in Production
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Azure Functions - Get rid of your servers, use functions!
Azure Functions - Get rid of your servers, use functions!Azure Functions - Get rid of your servers, use functions!
Azure Functions - Get rid of your servers, use functions!
 
Paul Angus - CloudStack Container Service
Paul  Angus - CloudStack Container ServicePaul  Angus - CloudStack Container Service
Paul Angus - CloudStack Container Service
 
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBSSave 60% of Kubernetes storage costs on AWS & others with OpenEBS
Save 60% of Kubernetes storage costs on AWS & others with OpenEBS
 

Similaire à Managing Cloud networking costs for data-intensive applications by provisioning dedicated network links

Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
Highly Available Docker Networking With BGP
Highly Available Docker Networking With BGPHighly Available Docker Networking With BGP
Highly Available Docker Networking With BGPOpenDNS
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
IDERA Slides: Managing the Transition to Hybrid Cloud
IDERA Slides: Managing the Transition to Hybrid CloudIDERA Slides: Managing the Transition to Hybrid Cloud
IDERA Slides: Managing the Transition to Hybrid CloudDATAVERSITY
 
Flood modelling on the Cloud
Flood modelling on the CloudFlood modelling on the Cloud
Flood modelling on the Cloudasm100
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondStuart (Pid) Williams
 
ITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdfITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdfssuser2d7235
 
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...Amazon Web Services
 
Monitoring a virtual network infrastructure - An IaaS perspective
Monitoring a virtual network infrastructure - An IaaS perspectiveMonitoring a virtual network infrastructure - An IaaS perspective
Monitoring a virtual network infrastructure - An IaaS perspectiveAugusto Ciuffoletti
 
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps_Fest
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Kieran Kunhya
 
Multi cloud network leveraging sd-wan reference architecture
Multi cloud network leveraging sd-wan reference architectureMulti cloud network leveraging sd-wan reference architecture
Multi cloud network leveraging sd-wan reference architectureMatsuo Sawahashi
 
Ntc 362 effective communication uopstudy.com
Ntc 362 effective communication   uopstudy.comNtc 362 effective communication   uopstudy.com
Ntc 362 effective communication uopstudy.comULLPTT
 
Ntc 362 forecasting and strategic planning -uopstudy.com
Ntc 362 forecasting and strategic planning -uopstudy.comNtc 362 forecasting and strategic planning -uopstudy.com
Ntc 362 forecasting and strategic planning -uopstudy.comULLPTT
 
eHarmony in the Cloud
eHarmony in the CloudeHarmony in the Cloud
eHarmony in the CloudCraig Dickson
 
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloudLAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloudJisc
 
Cloud and Grid Integration OW2 Conference Nov10
Cloud and Grid Integration OW2 Conference Nov10Cloud and Grid Integration OW2 Conference Nov10
Cloud and Grid Integration OW2 Conference Nov10OW2
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
 
Virtualization in 4-4 1-4 Data Center Network.
Virtualization in 4-4 1-4 Data Center Network.Virtualization in 4-4 1-4 Data Center Network.
Virtualization in 4-4 1-4 Data Center Network.Ankita Mahajan
 

Similaire à Managing Cloud networking costs for data-intensive applications by provisioning dedicated network links (20)

Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
Highly Available Docker Networking With BGP
Highly Available Docker Networking With BGPHighly Available Docker Networking With BGP
Highly Available Docker Networking With BGP
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
IDERA Slides: Managing the Transition to Hybrid Cloud
IDERA Slides: Managing the Transition to Hybrid CloudIDERA Slides: Managing the Transition to Hybrid Cloud
IDERA Slides: Managing the Transition to Hybrid Cloud
 
Flood modelling on the Cloud
Flood modelling on the CloudFlood modelling on the Cloud
Flood modelling on the Cloud
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
ITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdfITN3052_04_Switched_Networks.pdf
ITN3052_04_Switched_Networks.pdf
 
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
AWS re:Invent 2016: Advanced Tips for Amazon EC2 Networking and High Availabi...
 
Monitoring a virtual network infrastructure - An IaaS perspective
Monitoring a virtual network infrastructure - An IaaS perspectiveMonitoring a virtual network infrastructure - An IaaS perspective
Monitoring a virtual network infrastructure - An IaaS perspective
 
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
DevOps Fest 2019. Stanislav Kolenkin. Сonnecting pool Kubernetes clusters: Fe...
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...
 
Multi cloud network leveraging sd-wan reference architecture
Multi cloud network leveraging sd-wan reference architectureMulti cloud network leveraging sd-wan reference architecture
Multi cloud network leveraging sd-wan reference architecture
 
Ntc 362 effective communication uopstudy.com
Ntc 362 effective communication   uopstudy.comNtc 362 effective communication   uopstudy.com
Ntc 362 effective communication uopstudy.com
 
Ntc 362 forecasting and strategic planning -uopstudy.com
Ntc 362 forecasting and strategic planning -uopstudy.comNtc 362 forecasting and strategic planning -uopstudy.com
Ntc 362 forecasting and strategic planning -uopstudy.com
 
eHarmony in the Cloud
eHarmony in the CloudeHarmony in the Cloud
eHarmony in the Cloud
 
Network
NetworkNetwork
Network
 
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloudLAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
LAN, WAN, SAN upgrades: hyperconverged vs traditional vs cloud
 
Cloud and Grid Integration OW2 Conference Nov10
Cloud and Grid Integration OW2 Conference Nov10Cloud and Grid Integration OW2 Conference Nov10
Cloud and Grid Integration OW2 Conference Nov10
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Virtualization in 4-4 1-4 Data Center Network.
Virtualization in 4-4 1-4 Data Center Network.Virtualization in 4-4 1-4 Data Center Network.
Virtualization in 4-4 1-4 Data Center Network.
 

Plus de Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorIgor Sfiligoi
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOIgor Sfiligoi
 
The Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAThe Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAIgor Sfiligoi
 
Using CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP ExperienceUsing CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP ExperienceIgor Sfiligoi
 
Kubernetes - Hosted OSG Services
Kubernetes - Hosted OSG ServicesKubernetes - Hosted OSG Services
Kubernetes - Hosted OSG ServicesIgor Sfiligoi
 

Plus de Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondor
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 
The Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMAThe Open Science Grid and how it relates to PRAGMA
The Open Science Grid and how it relates to PRAGMA
 
Using CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP ExperienceUsing CVMFS on a distributed Kubernetes cluster - The PRP Experience
Using CVMFS on a distributed Kubernetes cluster - The PRP Experience
 
Kubernetes - Hosted OSG Services
Kubernetes - Hosted OSG ServicesKubernetes - Hosted OSG Services
Kubernetes - Hosted OSG Services
 

Dernier

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Dernier (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Managing Cloud networking costs for data-intensive applications by provisioning dedicated network links

  • 1. Managing Cloud networking costs for data-intensive applications by provisioning dedicated network links Igor Sfiligoi, Frank Würthwein, Thomas Hutton University of California San Diego Michael Hare, David Schultz, Benedikt Riedel, Steve Barnet, Vladimir Brik University of Wisconsin–Madison
  • 2. Premise • Commercial Clouds can provide large amounts of compute capacity • And Cloud compute costs are acceptable when using spot instances • Network-intensive applications may however experience large networking bills • While ingress is free, egress is a metered commodity • Dedicated links can provide significant cost savings on egress • Between 50% and 75% We present our experience egress-ing 130 TB in half a day
  • 3. Doing real science • Doing CS experiments is fun • But demonstrating infrastructure capabilities while doing real science is even better • We produced simulation data for the IceCube experiment • Using Cloud compute instances • Storage was located at UW Madison and UC San Diego
  • 4. Optical Properties • Combining all the possible information • These features are included in simulation • We’re always be developing them Nature never tell us a perfect answer but obtained a satisfactory agreement with data! The need for calibration • Natural medium • Hard to calibrate properly • Dropped a detector into a grey box • The ice is very clear, but… • Is it uniform? • How has construction changed the ice? • Drastic changes in reconstructed position with different ice models • Too complex for a parametrized approach • Using ray-tracing on GPUs
  • 5. Total: 9M hours (Jul 2020 – Jul 2021) https://gracc.opensciencegrid.org/d/000000118/gpu-payload-jobs-summary?orgId=1&var-ReportableVOName=icecube&var-interval=7d On-prem, but globally distributed. (weekly) IceCube used to distributed computing Credit: https://pixy.org/1338869/ gridFTP UW Madison (mostly)
  • 6. Total: 9M hours (Jul 2020 – Jul 2021) https://gracc.opensciencegrid.org/d/000000118/gpu-payload-jobs-summary?orgId=1&var-ReportableVOName=icecube&var-interval=7d On-prem, but globally distributed. (weekly) IceCube used to distributed computing Credit: https://pixy.org/1338869/ gridFTP UW Madison (mostly) Adding Cloud resources thus relatively trivial (we presented another Cloud run at PEARC20)
  • 7. The networking cost issue • IceCube needed data-heavy simulation • About 500 MB produced per fp32 TFLOP-hour of compute • Egress costs start to be comparable to compute costs! • If one used ”standard networking” • Dedicated network links promised significant reduction in cost All prices were valid as of December 2020. Average per-job values.
  • 8. The need for many dedicated links • IceCube storage can sink 100 Gbps • Over 80 Gbps at UW Madison • Plus over 20 Gbps at UC San Diego • Internet2 had mostly 2x 10 Gbps links with Cloud providers • The only bright exception was the California link to Azure at 2x 100 Gbps • The links are shared, so one can never get the whole link for itself • 5 Gbps limit in AWS and GCP • 10 Gbps limit in Azure • The link speeds are rigidly defined • 1, 2, 5, 10 Gbps • To fill an (almost) empty 10 Gbps link, one needs three links: 5 + 2 + 2
  • 9. The need for many dedicated links • IceCube storage can sink 100 Gbps • Over 80 Gbps at UW Madison • Plus over 20 Gbps at UC San Diego • No dedicated links over 10 Gbps available • With AWS and GCP limited to 5 Gbps • We ended up with over 20 links
  • 10. A multi-team effort • Final users cannot provision a dedicated link on their own. The process involves: • The Cloud (final) user, • The intermediate network provider, and • The local on-prem networking team. • Human interaction plays a significant role in establishing and tearing down of dedicated links. • Full automation virtually impossible in most circumstances.
  • 11. On-site preparations • Dedicated IPs are needed for the Cloud compute resources • Not always trivial to get an allocation for local networking team • We used private IPs, which are easier to get by (IPv6 may also be an easy(er) option, but we did not consider it) • Dedicated paths must be established to the Internet2 peering points • UW chose to provision a set of BGP-based Layer 3 virtual private networks (L3VPNs) to Internet2 via their regional aggregator, BTAA OmniPop. • UCSD first provisioned a Layer 2 virtual private network (L2VPN) over their regional provider, CENIC, and then layered on top a BGP-based L3VPN with Internet2.
  • 12. Very different provisioning in the 3 Clouds • AWS the most complex • And requires initiation by on-prem network engineer • Many steps after initial request • Create VPC and subnets • Accept connection request • Create VPG • Associate VPG with VPC • Create DCG • Create VIF • Relay back to on-prem the BGP key • Establish VPC -> VPG routing • Associate DCG -> VPG • And don’t forget the Internet routers • GCP the simplest • Create VPC and subnets • Create Cloud Router • Create Interconnect • Provide key to on-prem • Azure not much harder • Create VN and subnets • Make sure the VN has Gateway subnet • Create ExpressRoute (ER) • Provide key to on-prem • Create VNG • Create connection between ER and VNG • Note: Azure comes with many more options to choose from The above steps must be performed by the final user.
  • 13. Not meant for frequent use • In summary, provisioning (and tearing down) of dedicated network connections is hard • Involves many parties and many steps • Once established, it works fine • But the provisioning overhead is definitely non-negligible! • Only pays off for major endeavours At 130 TB one looks at a $5.5k saving, which makes it worthwhile
  • 14. IceCube spiky traffic • IceCube network traffic very spiky • All egress happens right after compute complete • Having many jobs allows for smoothing of traffic • But not if they all start at the same time! Test provisioning run – no job startup control Network congestion Overprovisining (low utilization)
  • 15. Slow resource provisioning in main run • Slow resource provisioning results in spread out of job startup times • The strategy worked for the main Cloud run • Ramping up to 80 PFLOPs for about 2 hours Each color represents resources tied to one network link. One of the network links.
  • 16. Good, but not perfect execution • Using only spot Cloud instances • Reached region capacity at different times in different places • With 20+ resource groups, steering was challenging • Over-provisioned a couple of regions • Resulting in saturated network links Each color represents one network link.
  • 17. Overall a success • Produced 130 TB of simulation data (54k files) • Used 225 fp32 PFLOP-hours of GPU compute • Note: Almost saturated the UW research network link! Each color represents one network link. UW Madison monitoring
  • 18. The incurred cost • Cost of the main run: • $31k total, all included • Of that, $5.5k was spent on networking • Network/data transfer cost analysis • $5.5k / 130TB = $42/TB • Without dedicated links we would have paid: $83/TB * 130TB = $11k 50% saving
  • 19. Summary and conclusion • Cloud egress costs can be substantial for data-intensive applications • Dedicated links can provide savings between 50% and 75% • We showed that one can provision 100 Gbps in aggregate bandwidth through Internet2 • But capacity planning and workload steering can be challenging • High network provisioning overhead • Especially hard for spiky workloads, like the IceCube one • All in all, a success • And we produced valuable science
  • 20. Acknowledgements • This work was partially funded by the US National Science Foundation (NSF) though grants OAC-1941481, MPS-1148698, OAC-1841530, OAC-1826967 and OPP-1600823.