SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
To Infiniband and Beyond: High
Speed Interconnects in Commodity
           HPC Clusters
           Teresa Kaltz, PhD
          Research Computing
           December 3, 2009


                                   1
Interconnect Types on Top 500




On the latest TOP500 list, there is exactly one 10 GigE deployment,
compared to 181 InfiniBand-connected systems.
Michael Feldman, HPCwire Editor


                                                                      2
Top 500 Interconnects 2002-2009

 500

 450

 400

 350

 300

                                                               Other
 250
                                                               Infiniband
 200                                                           Ethernet

 150

 100

  50

   0
       2002   2003   2004   2005   2006   2007   2008   2009




                                                                            3
What is Infiniband Anyway?

•  Open, standard interconnect architecture



  –  http://www.infinibandta.org/index.php
  –  Complete specification available for download
•  Complete "ecosystem"
  –  Both hardware and software
•  High bandwidth, low latency, switch-based
•  Allows remote direct memory access (RDMA)
                                                     4
Why Remote DMA?

•  TCP offload engines reduce overhead via
   offloading protocol processing like checksum
•  2 copies on receive: NIC  kernel  user
•  Solution is Remote DMA (RDMA)
        Per Byte               Percent Overhead
        User-system copy            16.5 %
        TCP Checksum                15.2 %
        Network-memory copy         31.8 %
        Per Packet
        Driver                      8.2 %
        TCP+IP+ARP protocols        8.2 %
        OS overhead                 19.8 %


                                                  5
What is RDMA?




                6
Infiniband Signalling Rate

•  Each link is a point to point serial connection
•  Usually aggregated into groups of four
•  Unidirectional effective bandwidth
   –  SDR 4X: 1 GB/s
   –  DDR 4X: 2 GB/s
   –  QDR 4X: 4 GB/s
•  Bidirectional bandwidth twice unidirectional
•  Many factors impact measured performance!


                                                     7
Infiniband Roadmap from IBTA




                               8
DDR 4X Unidirectional Bandwidth


                    •  Achieved bandwidth
                       limited by
                       PCIe 8x Gen 1

                    •  Current platforms
                      mostly ship with
                      PCIe Gen 2




                                           9
QDR 4X Unidirectional Bandwidth



                                              •  Still seem to
                                                have bottleneck
                                                at host if
                                                using QDR




   http://mvapich.cse.ohio-state.edu/performance/interNode.shtml   10
Latency Measurements: IB vs GbE




                                  11
Infiniband Latency Measurements




                                  12
Infiniband Silicon Vendors




•  Both switch and HCA parts
  –  Mellanox: Infiniscale, Infinihost
  –  Qlogic: Truescale, Infinipath
•  Many OEM's use their silicon
•  Large switches
  –  Parts arranged in fat tree topology

                                           13
Infiniband Switch Hardware

  24 port silicon product line at right
  Scales to thousands of ports                   288 Ports
  Host-based and hardware-
   based subnet management
  Current generation (QDR) based on            144 Ports
   36 port silicon
  Up to 864 ports in single                  96 Ports
   switch!!

                                             48 Ports
                                           24 Ports
                                                              14
Infiniband Topology

•  Infiniband uses credit-based flow control
   –  Need to avoid loops in topology that may produce
      deadlock

•  Common topology for small
   and medium size
   networks is tree (CLOS)
•  Mesh/torus more cost effective
   for large clusters (>2500 hosts)

                                                         15
Infiniband Routing

•  Infiniband is statically routed
•  Subnet management software discovers fabric
   and generates set of routing tables
  –  Most subnet managers support multiple routing
     algorithms
•  Tables updated with changes in topology only
•  Often cannot achieve theoretical bisection
   bandwidth with static routing
•  QDR silicon introduces adaptive routing

                                                     16
HPCC Random Ring Benchmark

                       1600

                       1400
Avg Bandwidth (MB/s)




                       1200

                       1000                          "Routing 1"
                                                     "Routing 2"
                       800
                                                     "Routing 3"
                       600                           "Routing 4"

                       400

                       200

                          0




                              Number of Enclosures




                                                                   17
Infiniband Specification for Software

•  IB specification does not define API
•  Actions are known as "verbs"
   –  Services provided to upper layer protocols
   –  Send verb, receive verb, etc
•  Community has standardized around open
   source distribution called OFED to provide verbs
•  Some Infiniband software is also available from
   vendors
   –  Subnet management

                                                   18
Application Support of Infiniband

•  All MPI implementations support native IB
   –  OpenMPI, MVAPICH, Intel MPI
•  Existing socket applications
   –  IP over IB
   –  Sockets direct protocol (SDP)
      •  Does NOT require re-link of application
•  Oracle uses RDS (reliable datagram sockets)
   –  First available in Oracle 10g R2
•  Developer can program to "verbs" layer

                                                   19
Infiniband Software Layers




                             20
OFED Software

•  Openfabrics Enterprise Distribution software
   from Openfabrics Alliance
   –  http://www.openfabrics.org/
•  Contains everything needed to run Infiniband
   –  HCA drivers
   –  verbs implementation
   –  subnet management
   –  diagnostic tools
•  Versions qualified together

                                                  21
Openfabrics Software Components




                                  22
"High Performance" Ethernet

•  1 GbE cheap and ubiquitous
  –  hardware acceleration
  –  multiple multiport NIC's
  –  supported in kernel
•  10 GbE still used primarily as uplinks from edge
   switches and as backbone
•  Some vendors providing 10 GbE to server
  –  low cost NIC on motherboard
  –  HCA's with performance proportional to cost

                                                   23
RDMA over Ethernet

•  NIC capable of RDMA is called RNIC
•  RDMA is primary method of reducing latency on
   host side
•  Multiple vendors have RNIC's
  –  Mainstream: Broadcom, Intel, etc.
  –  Boutique: Chelsio, Mellanox, etc.
•  New Ethernet standards
  –  "Data Center Bridging"; "Converged Enhanced
     Ethernet"; "Data Center Ethernet"; etc

                                                   24
What is iWarp?

•  RDMA consortium (RDMAC) standardized some
   protocols with are now part of the IETF Remote
   Data Direct Placement (RDDP) working group
•  http://www.rdmaconsortium.org/home
•  Also defined SRP, iSER in addition to verbs
•  iWARP supported in OFED
•  Most specification work complete in ~2003



                                                25
RDMA over Ethernet?

The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet),
is a working name.

You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or
any other of a host of equally obscure names.


Tom Talpey, Microsoft Corporation
Paul Grun, System Fabric Works
August 2009




                                                            26
The Future: InfiniFibreNet

•  Vendors moving towards "converged fabrics"
•  Using same "fabric" for both networking and
   storage
•  Storage protocols and IB over Ethernet
•  Storage protocols over Infiniband
  –  NFS over RDMA, lustre
•  Gateway switches and converged adapters
  –  Various combinations of Ethernet, IB and FC


                                                   27
Any Questions?




      THANK YOU!

(And no mention of The Cloud)




                                28

Contenu connexe

Tendances

Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015Bruno Teixeira
 
Cisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
Cisco Live! :: Introduction to IOS XR for Enterprises and Service ProvidersCisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
Cisco Live! :: Introduction to IOS XR for Enterprises and Service ProvidersBruno Teixeira
 
Brocade: Storage Networking For the Virtual Enterprise
Brocade: Storage Networking For the Virtual Enterprise Brocade: Storage Networking For the Virtual Enterprise
Brocade: Storage Networking For the Virtual Enterprise EMC
 
2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 Transition2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 TransitionJohnson Liu
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...gogo6
 
4.) switch performance (w features)
4.) switch performance (w features)4.) switch performance (w features)
4.) switch performance (w features)Jeff Green
 
Advances in IPv6 Mobile Access
Advances in IPv6 Mobile AccessAdvances in IPv6 Mobile Access
Advances in IPv6 Mobile AccessJohn Loughney
 
Advances in IPv6 in Mobile Networks Globecom 2011
Advances in IPv6 in Mobile Networks Globecom 2011Advances in IPv6 in Mobile Networks Globecom 2011
Advances in IPv6 in Mobile Networks Globecom 2011John Loughney
 
20.) physical (optics copper and power)
20.) physical (optics copper and power)20.) physical (optics copper and power)
20.) physical (optics copper and power)Jeff Green
 
IPv6 in 3G Core Networks
IPv6 in 3G Core NetworksIPv6 in 3G Core Networks
IPv6 in 3G Core NetworksJohn Loughney
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Mellanox Technologies
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPIJeff Squyres
 
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014Bruno Teixeira
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center supportKrunal Shah
 

Tendances (20)

Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
 
Brocade IP Quick Guide
Brocade IP Quick GuideBrocade IP Quick Guide
Brocade IP Quick Guide
 
Cisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
Cisco Live! :: Introduction to IOS XR for Enterprises and Service ProvidersCisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
Cisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
 
Brocade: Storage Networking For the Virtual Enterprise
Brocade: Storage Networking For the Virtual Enterprise Brocade: Storage Networking For the Virtual Enterprise
Brocade: Storage Networking For the Virtual Enterprise
 
2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 Transition2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 Transition
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
 
4.) switch performance (w features)
4.) switch performance (w features)4.) switch performance (w features)
4.) switch performance (w features)
 
Advances in IPv6 Mobile Access
Advances in IPv6 Mobile AccessAdvances in IPv6 Mobile Access
Advances in IPv6 Mobile Access
 
Advances in IPv6 in Mobile Networks Globecom 2011
Advances in IPv6 in Mobile Networks Globecom 2011Advances in IPv6 in Mobile Networks Globecom 2011
Advances in IPv6 in Mobile Networks Globecom 2011
 
20.) physical (optics copper and power)
20.) physical (optics copper and power)20.) physical (optics copper and power)
20.) physical (optics copper and power)
 
Deploying Carrier Ethernet features on ASR 9000
Deploying Carrier Ethernet features on ASR 9000Deploying Carrier Ethernet features on ASR 9000
Deploying Carrier Ethernet features on ASR 9000
 
IPv6 in 3G Core Networks
IPv6 in 3G Core NetworksIPv6 in 3G Core Networks
IPv6 in 3G Core Networks
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center support
 
10.) vxlan
10.) vxlan10.) vxlan
10.) vxlan
 
Cisco nx os
Cisco nx os Cisco nx os
Cisco nx os
 
Cisco data center training for ibm
Cisco data center training for ibmCisco data center training for ibm
Cisco data center training for ibm
 

Similaire à To Infiniband and Beyond

100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdfJunZhao68
 
InfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must KnowInfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must KnowMellanox Technologies
 
Building a cost-effective and high-performing public cloud
Building a cost-effective and high-performing public cloudBuilding a cost-effective and high-performing public cloud
Building a cost-effective and high-performing public cloudcloudprovider
 
Continuum PCAP
Continuum PCAP Continuum PCAP
Continuum PCAP rwachsman
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsFederica Pisani
 
IBTA Releases Updated Specification for RoCEv2
IBTA Releases Updated Specification for RoCEv2IBTA Releases Updated Specification for RoCEv2
IBTA Releases Updated Specification for RoCEv2inside-BigData.com
 
Platforms for Accelerating the Software Defined and Virtual Infrastructure
Platforms for Accelerating the Software Defined and Virtual InfrastructurePlatforms for Accelerating the Software Defined and Virtual Infrastructure
Platforms for Accelerating the Software Defined and Virtual Infrastructure6WIND
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane Michelle Holley
 
Ip over wdm
Ip over wdmIp over wdm
Ip over wdmzeedoui2
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Ethernetv infiniband
Ethernetv infinibandEthernetv infiniband
Ethernetv infinibandMason Mei
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK
 
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...Indonesia Network Operators Group
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Ceph Community
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Multi protocol label switching (mpls)
Multi protocol label switching (mpls)Multi protocol label switching (mpls)
Multi protocol label switching (mpls)Online
 

Similaire à To Infiniband and Beyond (20)

pps Matters
pps Matterspps Matters
pps Matters
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
InfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must KnowInfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must Know
 
Building a cost-effective and high-performing public cloud
Building a cost-effective and high-performing public cloudBuilding a cost-effective and high-performing public cloud
Building a cost-effective and high-performing public cloud
 
Continuum PCAP
Continuum PCAP Continuum PCAP
Continuum PCAP
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
 
IBTA Releases Updated Specification for RoCEv2
IBTA Releases Updated Specification for RoCEv2IBTA Releases Updated Specification for RoCEv2
IBTA Releases Updated Specification for RoCEv2
 
Platforms for Accelerating the Software Defined and Virtual Infrastructure
Platforms for Accelerating the Software Defined and Virtual InfrastructurePlatforms for Accelerating the Software Defined and Virtual Infrastructure
Platforms for Accelerating the Software Defined and Virtual Infrastructure
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
Ip over wdm
Ip over wdmIp over wdm
Ip over wdm
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Ethernetv infiniband
Ethernetv infinibandEthernetv infiniband
Ethernetv infiniband
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
 
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Multi protocol label switching (mpls)
Multi protocol label switching (mpls)Multi protocol label switching (mpls)
Multi protocol label switching (mpls)
 
ipv4 to 6
ipv4 to 6ipv4 to 6
ipv4 to 6
 
Infini Band
Infini BandInfini Band
Infini Band
 
L6 6 lowpan
L6 6 lowpanL6 6 lowpan
L6 6 lowpan
 

Plus de Boston Consulting Group

Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsBoston Consulting Group
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsBoston Consulting Group
 
Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Boston Consulting Group
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2Boston Consulting Group
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1Boston Consulting Group
 
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesreesBoston Consulting Group
 
2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesreesBoston Consulting Group
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBoston Consulting Group
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceBoston Consulting Group
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesreesBoston Consulting Group
 

Plus de Boston Consulting Group (16)

Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 
Anaconda Data Science Collaboration
Anaconda Data Science CollaborationAnaconda Data Science Collaboration
Anaconda Data Science Collaboration
 
Python Blaze Overview
Python Blaze OverviewPython Blaze Overview
Python Blaze Overview
 
Making Data Analytics Awesome
Making Data Analytics AwesomeMaking Data Analytics Awesome
Making Data Analytics Awesome
 
Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...
 
SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
 
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
 
2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees
 
Grid Computing Overview
Grid Computing OverviewGrid Computing Overview
Grid Computing Overview
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interface
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

To Infiniband and Beyond

  • 1. To Infiniband and Beyond: High Speed Interconnects in Commodity HPC Clusters Teresa Kaltz, PhD Research Computing December 3, 2009 1
  • 2. Interconnect Types on Top 500 On the latest TOP500 list, there is exactly one 10 GigE deployment, compared to 181 InfiniBand-connected systems. Michael Feldman, HPCwire Editor 2
  • 3. Top 500 Interconnects 2002-2009 500 450 400 350 300 Other 250 Infiniband 200 Ethernet 150 100 50 0 2002 2003 2004 2005 2006 2007 2008 2009 3
  • 4. What is Infiniband Anyway? •  Open, standard interconnect architecture –  http://www.infinibandta.org/index.php –  Complete specification available for download •  Complete "ecosystem" –  Both hardware and software •  High bandwidth, low latency, switch-based •  Allows remote direct memory access (RDMA) 4
  • 5. Why Remote DMA? •  TCP offload engines reduce overhead via offloading protocol processing like checksum •  2 copies on receive: NIC  kernel  user •  Solution is Remote DMA (RDMA) Per Byte Percent Overhead User-system copy 16.5 % TCP Checksum 15.2 % Network-memory copy 31.8 % Per Packet Driver 8.2 % TCP+IP+ARP protocols 8.2 % OS overhead 19.8 % 5
  • 7. Infiniband Signalling Rate •  Each link is a point to point serial connection •  Usually aggregated into groups of four •  Unidirectional effective bandwidth –  SDR 4X: 1 GB/s –  DDR 4X: 2 GB/s –  QDR 4X: 4 GB/s •  Bidirectional bandwidth twice unidirectional •  Many factors impact measured performance! 7
  • 9. DDR 4X Unidirectional Bandwidth •  Achieved bandwidth limited by PCIe 8x Gen 1 •  Current platforms mostly ship with PCIe Gen 2 9
  • 10. QDR 4X Unidirectional Bandwidth •  Still seem to have bottleneck at host if using QDR http://mvapich.cse.ohio-state.edu/performance/interNode.shtml 10
  • 13. Infiniband Silicon Vendors •  Both switch and HCA parts –  Mellanox: Infiniscale, Infinihost –  Qlogic: Truescale, Infinipath •  Many OEM's use their silicon •  Large switches –  Parts arranged in fat tree topology 13
  • 14. Infiniband Switch Hardware   24 port silicon product line at right   Scales to thousands of ports 288 Ports   Host-based and hardware- based subnet management   Current generation (QDR) based on 144 Ports 36 port silicon   Up to 864 ports in single 96 Ports switch!! 48 Ports 24 Ports 14
  • 15. Infiniband Topology •  Infiniband uses credit-based flow control –  Need to avoid loops in topology that may produce deadlock •  Common topology for small and medium size networks is tree (CLOS) •  Mesh/torus more cost effective for large clusters (>2500 hosts) 15
  • 16. Infiniband Routing •  Infiniband is statically routed •  Subnet management software discovers fabric and generates set of routing tables –  Most subnet managers support multiple routing algorithms •  Tables updated with changes in topology only •  Often cannot achieve theoretical bisection bandwidth with static routing •  QDR silicon introduces adaptive routing 16
  • 17. HPCC Random Ring Benchmark 1600 1400 Avg Bandwidth (MB/s) 1200 1000 "Routing 1" "Routing 2" 800 "Routing 3" 600 "Routing 4" 400 200 0 Number of Enclosures 17
  • 18. Infiniband Specification for Software •  IB specification does not define API •  Actions are known as "verbs" –  Services provided to upper layer protocols –  Send verb, receive verb, etc •  Community has standardized around open source distribution called OFED to provide verbs •  Some Infiniband software is also available from vendors –  Subnet management 18
  • 19. Application Support of Infiniband •  All MPI implementations support native IB –  OpenMPI, MVAPICH, Intel MPI •  Existing socket applications –  IP over IB –  Sockets direct protocol (SDP) •  Does NOT require re-link of application •  Oracle uses RDS (reliable datagram sockets) –  First available in Oracle 10g R2 •  Developer can program to "verbs" layer 19
  • 21. OFED Software •  Openfabrics Enterprise Distribution software from Openfabrics Alliance –  http://www.openfabrics.org/ •  Contains everything needed to run Infiniband –  HCA drivers –  verbs implementation –  subnet management –  diagnostic tools •  Versions qualified together 21
  • 23. "High Performance" Ethernet •  1 GbE cheap and ubiquitous –  hardware acceleration –  multiple multiport NIC's –  supported in kernel •  10 GbE still used primarily as uplinks from edge switches and as backbone •  Some vendors providing 10 GbE to server –  low cost NIC on motherboard –  HCA's with performance proportional to cost 23
  • 24. RDMA over Ethernet •  NIC capable of RDMA is called RNIC •  RDMA is primary method of reducing latency on host side •  Multiple vendors have RNIC's –  Mainstream: Broadcom, Intel, etc. –  Boutique: Chelsio, Mellanox, etc. •  New Ethernet standards –  "Data Center Bridging"; "Converged Enhanced Ethernet"; "Data Center Ethernet"; etc 24
  • 25. What is iWarp? •  RDMA consortium (RDMAC) standardized some protocols with are now part of the IETF Remote Data Direct Placement (RDDP) working group •  http://www.rdmaconsortium.org/home •  Also defined SRP, iSER in addition to verbs •  iWARP supported in OFED •  Most specification work complete in ~2003 25
  • 26. RDMA over Ethernet? The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet), is a working name. You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or any other of a host of equally obscure names. Tom Talpey, Microsoft Corporation Paul Grun, System Fabric Works August 2009 26
  • 27. The Future: InfiniFibreNet •  Vendors moving towards "converged fabrics" •  Using same "fabric" for both networking and storage •  Storage protocols and IB over Ethernet •  Storage protocols over Infiniband –  NFS over RDMA, lustre •  Gateway switches and converged adapters –  Various combinations of Ethernet, IB and FC 27
  • 28. Any Questions? THANK YOU! (And no mention of The Cloud) 28