SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Memory Aggregation For KVM
Hecatonchire Project

Benoit Hudzia; Sr. Researcher; SAP Research Belfast
With the contribution of Aidan Shribman, Roei Tell, Steve Walsh, Peter Izsak

November 2012
Agenda

•   Memory as a Utility
•   Raw Performance
•   First Use Case : Post Copy
•   Second Use case : Memory aggregation
•   Lego Cloud
•   Summary


© 2012 SAP AG. All rights reserved.        2
Memory as a Utility
How we Liquefied Memory Resources
The Idea: Turning memory into a distributed memory service




      Breaks memory from the bounds      Transparent deployment with
             of the physical box      performance at scale and Reliability

© 2012 SAP AG. All rights reserved.                                          4
High Level Principle

                                                     Memory                   Memory
                                                    Sponsor A                Sponsor B


                                                                Network



                          Memory Demander

                                                    Virtual Memory Address Space

                                            Memory Demanding Process




© 2012 SAP AG. All rights reserved.                                                      5
How does it work
(Simplified Version)

    Virtual                        MMU                       Physical                              MMU                    Physical
    Address                        (+ TLB)                   Address                               (+ TLB)                Address
                            Miss                  Update MMU                      Invalidate MMU                 Extract Page

                                   Page Table                                                      Page Table
                                   Entry                                                           Entry
          Remote PTE
                                                 PTE write                        Invalidate PTE
          (Custom Swap Entry)

                                   Coherency                                                       Coherency
                                   Engine                                                          Engine
                                                                                    Extract Page                   Prepare Page for RDMA
                                                                                                                   transfer
                                                     Page request
                                                                        Network
                                   RDMA Engine                                                     RDMA Engine
                                                                         Fabric   Page Response




Physical Node A                                                                                                             Physical Node B
© 2012 SAP AG. All rights reserved.                                                                                                           6
Reducing Effects of Network Bound Page Faults


Full Linux MMU integration (reducing the system-wide effects/cost of page fault)
 Enabling to perform page fault transparency (only pausing the requesting thread)

Low latency RDMA Engine and page transfer protocol (reducing latency/cost of page
faults)
 Implemented fully in kernel mode OFED VERBS
 Can use the fastest RDMA hardware available (IB, IWARP, RoCE)
 Tested with Software RDMA solution ( Soft IWARP and SoftRoCE) (NO SPECIAL HW REQUIRED)

Demand pre-paging (pre-fetching) mechanism (reducing the number of page faults)
 Currently only a simple fetching of pages surrounding page on which fault occurred



© 2012 SAP AG. All rights reserved.                                                        7
Transparent Solution
Minimal Modification of the kernel (simple and minimal intrusion)
•    4 Hooks in the static kernel , virtually no overhead when enabled for normal operation

Paging and memory Cgroup support (Transparent Tiered Memory)
• Page are pushed back to their sponsor when paging occurs or if they are local they can be
  swapped out normally

KVM Specific support (Virtualization Friendly)
• Shadow Page table (EPT / NPT )
• KVM Asynchronous Page Fault


© 2012 SAP AG. All rights reserved.                                                           8
Transparent Solution (cont.)
Scalable Active – Active Mode (Distributed Shared Memory)
   • Shared Nothing with distributed index
   • Write invalidate with distributed index (end of this year)


Library LibHeca (Ease of integration)
   • Simple API bootstrapping and synching all participating nodes


We also support:
   •     KSM
   •     Huge Page
   •     Discontinuous Shared Memory Region
   •     Multiple DSM / VM groups on the same physical node
© 2012 SAP AG. All rights reserved.                                  9
Raw Performance
How fast can we move memory around ?
Raw Bandwidth usage
HW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps
 Gb/s Sequential Walk over 1GB of shared RAM                      Bin split Walk over 1GB of shared RAM                  Random Walk over 1GB of shared RAM
 25
                                                                                                                                                            1 Thread
                                                                                                                                                            2 Threads
           Not enough core                                                                                                                                  3 Threads
                                                                                                                                                            4 Threads
            to saturate (?)                                                                                                                                 5 Threads
 20                                                                  No degradation                                                                         6 Threads
                                                                                                                                                            7 Threads
                                                                     under high load                                                                        8 Threads


 15         Maxing out
            Bandwidth                                                                                                                Software RDMA
                                                                                                                                     has significant
                                                                                                                                        overhead
 10


   5


   0
           Total Gbit/sec       Total Gbit/sec   Total Gbit/sec    Total Gbit/sec    Total Gbit/sec    Total Gbit/sec     Total Gbit/sec   Total Gbit/sec   Total Gbit/sec
            (SIW - Seq)           (IW-Seq)         (IB-Seq)       (SIW- Bin split)   (IW- Bin split)   (IB- Bin split)   (SIW- Random)     (IW- Random)     (IB- Random)


 © 2012 SAP AG. All rights reserved.                                                                                                                                         11
Hard Page Fault Resolution Performance

                                Resolution time   Time spend over the   Resolution time
                                 Average (μs)        wire one way          Best (μs)
                                                     Average (μs)
  SoftIwarp                           355                150 +                74
   (10 GbE)
     Iwarp                            48                  4-6                 28
   (10GbE)
  Infiniband                          29                  2-4                 16
  (40 Gbps)

© 2012 SAP AG. All rights reserved.                                                       12
Average Compounded Page Fault Resolution Time
(With Prefetch)

                         6
        Micro-seconds

                                                                                                     IW 10GE Sequential
                        5.5                                                                          IB 40 Gbps Sequential
                                                                                                     IW 10GE- Binary split
                         5                                                                           IB 40Gbps- Binary split
                                                                                                     IW 10GE- Random Walk
                        4.5                                                                          IB- Random Walk

                         4
                        3.5
                         3
                        2.5
                         2
                        1.5
                         1
                              1 Thread   2 Threads   3 Threads   4 Threads   5 Threads   6 Threads    7 Threads      8 Threads
© 2012 SAP AG. All rights reserved.                                                                                              13
Post-Copy Live Migration
Technology first Use Case
Post Copy – Pre Copy – Hybrid Comparison
                      4
                                Pre-copy (Forced after 60s)
                     3.5
                                Post-Copy
Downtime (seconds)




                      3
                                Hybrid - 3 seconds
                     2.5
                                Hybrid - 5 Seconds
                      2
                     1.5
                      1
                     0.5
                      0
                                     1 GB                         4 GB                        10 GB                       14 GB VM Ram
                           Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
                           Network : 10 GB Eth – Chelsio T422-CR IWARP
                           Workload App Mem Bench (~80% of the VM RAM) Dirtying Rate : 1GB/s (256k Page dirtied per seconds)
© 2012 SAP AG. All rights reserved.                                                                                                      15
Post Copy vs Pre copy under load


            100
                                                                                                                                                         Post Copy Dirtying Rate 1GB/s
                  90
                                                                                                                                                         Post Copy Dirtying Rate 5GB/s
                  80                                                                                                                                     Post Copy Dirtying Rate 25GB/s
Degradation (%)




                                                                                                                                                         Post Copy Dirtying Rate 50GB/s
                  70
                                                                                                                                                         Post Copy Dirtying Rate 100GB/s
                  60                                                                                                                                     Pre Copy Dirtying Rate 1GB/s
                  50                                                                                                                                     Pre Copy Dirtying Rate 5GB/s
                                                                                                                                                         Pre Copy Dirtying Rate 25GB/s
                  40
                                                                                                                                                         Pre Copy Dirtying Rate 50GB/s
                  30                                                                                                                                     Pre Copy Dirtying Rate 100GB/s
                  20

                  10

                  0
                       1   3   5   7   9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 Seconds

       Virtual Machine :                                                                       Hardware:
       •           1 GB RAM -1vCPU                                                                 Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
       •           Workload: App Mem Bench                                                     •   Network : 10 GB Eth Switch – NIC : Chelsio T422-CR (IWARP)
   © 2012 SAP AG. All rights reserved.                                                                                                                                                     16
Post Copy Migration of HANA DB

                                               Baseline       Pre-Copy                 Post-Copy
                         Downtime                N/A           7.47 s                   675 ms


                      Benchmark                      0%       Benchmark                       5%
                     Performance                                Failed
                     Degradation
 Virtual Machine:                                         Hardware:
    •    10 GB Ram , 4 vCPU                               •   Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
    •    Application : HANA ( In memory Database )        •   Fabric: 10 GB Ethernet Switch
    •    Workload : SAP-H ( TPC-H Variant)                •   NIC: Chelsio IWARP T422-CR

© 2012 SAP AG. All rights reserved.                                                                                        17
Memory Aggregation
Second use case: Scaling out Memory
Scaling Out Virtual Machine Memory
Business Problem
   • Heavy swap usage slows execution time for
                                                      Solution
     data intensive applications
Hecatonchire/ RRAIM Solution
   • Applications use memory mobility for high                       VM swaps to memory
                                                 Application               Cloud
     performance swap resource
         • Completely transparent                   RAM             Memory Cloud
         • No integration required
         • Act on results sooner
         • High reliability built in                                Compression /
                                                                 Deduplication / N-tiers
   • Enables iteration or additional data to                       storage / HR-HA
     improve results

© 2012 SAP AG. All rights reserved.                                                        19
Redundant Array of Inexpensive RAM: RRAIM




  1.     Memory region backed by two remote nodes. Remote page faults and
         swap outs initiated simultaneously to all relevant nodes.

  2.     No immediate effect on computation node upon failure of node.

  3.     When we a new remote enters the cluster, it       synchronizes with
         computation node and mirror node.
© 2012 SAP AG. All rights reserved.                                            20
Quicksort Benchmark with Memory Constraint
      Quicksort Benchmark 512 MB Dataset             Quicksort Benchmark 1GB Dataset                  Quicksort Benchmark 2GB Dataset




                                                                               10.00%
       Memory Ratio                   DSM Overhead   RRAIM Overhead             9.00%
     (constraint using cgroup)
                                                                                8.00%
              3:4                        2.08%             5.21%                7.00%
                                                                                6.00%
              1:2                        2.62%             6.15%                5.00%                                          DSM Overhead
                                                                                4.00%                                          RRAIM Overhead
              1:3                        3.35%             9.21%                3.00%
                                                                                2.00%
              1:4                        4.15%             8.68%                1.00%
                                                                                0.00%
              1:5                        4.71%             9.28%                        3:04   1:02    1:03    1:04   1:05

© 2012 SAP AG. All rights reserved.                                                                                                             21
Scaling out HANA

   Memory                         DSM     RRAIM                      1.60%


    Ratio                       Overhead Overhead                    1.40%

                                                                     1.20%

          1:2                         1%      0.887%                 1.00%

                                                                                                                         DSM Overhead
                                                                     0.80%
          1:3                         1.6%    1.548%                 0.60%
                                                                                                                         RRAIM Overhead



        2:1:1                         0.1%           -               0.40%

                                                                     0.20%

        1:1:1                         1.5%           -               0.00%
                                                                              1:02      1:03     2:01:01   1:01:01




 Virtual Machine:                                        Hardware:
    •    18 GB Ram , 4 vCPU                              •   Memory Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
    •    Application : HANA ( In memory Database )       •   Compute Host: Intel(R) Xeon(R) CPU X5650 @ 2.56GHz, 8 cores, 96GB RAM
                                                         •   Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2
    •    Workload : SAP-H ( TPC-H Variant)

© 2012 SAP AG. All rights reserved.                                                                                                       22
Transitioning to a Memory Cloud
(Ongoing work)
                     Memory VM                  Compute VM                           Combination VM
                    Memory Sponsor             Memory Demander                    Memory Sponsor & Demander




                             RAM          App                    RAM        App
                                                                                                       Memory
                                                                                                       memory
                             VM           VM                           VM                               Cloud


                                                                                                       RRAIM




                                                                                          Memory Cloud Management
                                                                                            Services (OpenStack)
                                       Many Physical Nodes
                                      Hosting a variety of VMs



© 2012 SAP AG. All rights reserved.                                                                                 23
Lego Cloud
Going beyond Memory
Virtual Distributed Shared Memory System
(Compute Cloud)
                                                                              Guests
Compute aggregation
 Idea : Virtual Machine compute and memory span
  Multiple physical Nodes                                                VM
                                                                                          VM           VM

                                                                        App               App          App

Challenges                                                                                OS
                                                                                                       OS
                                                             VM
                                                                        OS                             H/W

 Coherency Protocol
                                                             Ap
                                                             p

                                                            OS
                                                                                          H/W
 Granularity ( False sharing )                             H/W         H/W


Hecatonchire Value Proposition
                                                           Server #1          Server #2           Server #n
 Optimal price / performance by using commodity
  hardware                                                   CPUs               CPUs                CPUs
 Operational flexibility: node downtime without downing    Memory             Memory              Memory
  the cluster                                                     I/O            I/O                 I/O
 Seamless deployment within existing cloud
                                                                        Fast RDMA Communication



© 2012 SAP AG. All rights reserved.                                                                           25
Disaggregation of datacentre ( and cloud ) resources
(Our Aim)

Breaking out the functions of Memory ,Compute, I/O, and optimizing the delivery of each.

Disaggregation, provides three primary benefits:
• Better Performance:
    • Each function is isolated => limiting the scope of
       what each box must do
    • We can leverage dedicated hardware and software
       => increases performance.
• Superior Scalability:
    • Functions are isolated from each other => alter one
       function without impacting the others.
• Improved Economics:
    • cost-effective deployment of resource => improved
       provisioning and consolidation of disparate
       equipment

 © 2012 SAP AG. All rights reserved.                                                       26
Summary
Hecatonchire Project

 • Features:
  • Distributed Shared Memory
  • Memory extension via Memory Servers
  • HA features
  • Future :Distributed Workload executions
 • Use standard Cloud interface
 • Optimise Cloud infrastructure
 • Support COTS HW

© 2012 SAP AG. All rights reserved.           28
Key takeaways
•       Hecatonchire project aim at disaggregating
        datacentre resources

•       Hecatonchire Project currently deliver memory
        cloud capabilities

•       Enhancements to be released as open source under
        GPLv2 and LGPL licenses by the end of November
        2012

•       Hosted on GitHub, check: www.hecatonchire.com

•       Developed by SAP Research Technology
        Infrastructure (TI) Programme
    © 2012 SAP AG. All rights reserved.                    29
Thank you

Benoit Hudzia; Sr. Researcher;
SAP Research Belfast
benoit.hudzia@sap.com
Backup Slide
Instant Flash Cloning On-Demand

Business Problem
 Burst load / service usage that cannot be satisfied in time

Existing solutions
 Vendors: Amazon / VMWare/ rightscale
 Startup VM from disk image
 Requires full VM OS startup sequence

Hecatonchire Solution
 Go live after VM-state (MBs) and hot memory (<5%) cloning
 Use post-copy live-migration schema in background
 Complete background transfer and disconnect from source

Hecatonchire Value Proposition
 Just in time (sub-second) provisioning

© 2012 SAP AG. All rights reserved.                             32
DRAM Latency Has Remained Constant


CPU clock speed and memory bandwidth
increased steadily (at least until 2000)

But memory latency remained constant – so
local memory has gotten slower from the CPU
perspective




                                              Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010

© 2012 SAP AG. All rights reserved.                                                                         33
CPUs Stopped Getting Faster


Moore’s law prevailed until 2005 when core’s
speed hit a practical limit of about 3.4 GHz

Since 2005 you do get more cores but the
“single threaded free lunch” is over

                                               Source: http://www.intel.com/pressroom/kits/quickrefyr.htm
Effectively arbitrary sequential algorithms
have not gotten faster since




                                               Source: “The Free Lunch Is Over..” by Herb Sutter

© 2012 SAP AG. All rights reserved.                                                                         34
While … Interconnect Link Speed has Kept Growing




                                      Panda et al. Supercomputing 2009
© 2012 SAP AG. All rights reserved.                                      35
Result: Remote Nodes Have Gotten Closer


Accessing DRAM on a remote host via IB
interconnects is only 20x slower than local
DRAM

And remote DRAM has far better performance
than paging in from an SSD or HDD device

Fast interconnects have become a commodity
- moving out of the High Performance
Computing (HPC) niche




                                              HANA Performance Analysis, Chaim Bendelac, 2011
© 2012 SAP AG. All rights reserved.                                                             36
Post-Copy Live Migration (pre-migration)


                                      Guest VM




                                       Host A                                                       Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                         Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                      37
Post-Copy Live Migration (reservation)


                                      Guest VM                                                      Guest VM




                                       Host A                                                        Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                           Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                        38
Post-Copy Live Migration (stop and copy)


                                      Guest VM                                                      Guest VM




                                       Host A                                                        Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                           Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                        39
Post-Copy Live Migration (post-copy)


                                      Guest VM                                                      Guest VM




                                                                       Page fault
                                                                       Page push


                                       Host A                                                        Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                           Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                        40
Post-Copy Live Migration (commit)


                                                                                                   Guest VM




                                      Host A                                                        Host B


                                                         Stop              Page Pushing
                                                                                          Commit
                 Pre-migrate          Reservation        and                     1
                                                         Copy                 Round

                Live on A                            Downtime         Degraded on B                           Live on B

                                                    Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                       41

Contenu connexe

Tendances

Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)IBM Danmark
 
PRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e beneficiPRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e beneficiFSCitalia
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Next Gen Datacenter
Next Gen DatacenterNext Gen Datacenter
Next Gen DatacenterRui Lopes
 
Windows azure uk universities overview march 2012
Windows azure uk universities overview march 2012Windows azure uk universities overview march 2012
Windows azure uk universities overview march 2012Lee Stott
 
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...EMC Forum India
 
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...Juniper Networks
 
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure PlatformVitor Tomaz
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandFuenteovejuna
 
2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadis2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadisdandre
 
Developer's Most Frequent Hadoop Headaches & How to Address Them__HadoopSumm...
Developer's Most Frequent Hadoop Headaches &  How to Address Them__HadoopSumm...Developer's Most Frequent Hadoop Headaches &  How to Address Them__HadoopSumm...
Developer's Most Frequent Hadoop Headaches & How to Address Them__HadoopSumm...Yahoo Developer Network
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage GridsDB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage GridsLuís Ganhão
 
Dell high density GPU solution
Dell high density GPU solutionDell high density GPU solution
Dell high density GPU solutionClayton Li
 
Lync Server 2010: High Availability [I3004]
Lync Server 2010: High Availability [I3004] Lync Server 2010: High Availability [I3004]
Lync Server 2010: High Availability [I3004] Fabrizio Volpe
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High AvailabilityHarold Wong
 

Tendances (20)

Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)
 
PRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e beneficiPRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e benefici
 
Ta3
Ta3Ta3
Ta3
 
Ibm power7
Ibm power7Ibm power7
Ibm power7
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Next Gen Datacenter
Next Gen DatacenterNext Gen Datacenter
Next Gen Datacenter
 
Hana Offerings Engl
Hana Offerings EnglHana Offerings Engl
Hana Offerings Engl
 
Windows azure uk universities overview march 2012
Windows azure uk universities overview march 2012Windows azure uk universities overview march 2012
Windows azure uk universities overview march 2012
 
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...
 
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...
 
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
 
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
 
2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadis2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadis
 
Qf deck
Qf deckQf deck
Qf deck
 
Developer's Most Frequent Hadoop Headaches & How to Address Them__HadoopSumm...
Developer's Most Frequent Hadoop Headaches &  How to Address Them__HadoopSumm...Developer's Most Frequent Hadoop Headaches &  How to Address Them__HadoopSumm...
Developer's Most Frequent Hadoop Headaches & How to Address Them__HadoopSumm...
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage GridsDB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids
 
Dell high density GPU solution
Dell high density GPU solutionDell high density GPU solution
Dell high density GPU solution
 
Lync Server 2010: High Availability [I3004]
Lync Server 2010: High Availability [I3004] Lync Server 2010: High Availability [I3004]
Lync Server 2010: High Availability [I3004]
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High Availability
 

Similaire à Hecatonchire kvm forum_2012_benoit_hudzia

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloudaidanshribman
 
The Value of NetApp with VMware
The Value of NetApp with VMwareThe Value of NetApp with VMware
The Value of NetApp with VMwareCapito Livingstone
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
 
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.Takatoshi Matsuo
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Memcached, presented to LCA2010
Memcached, presented to LCA2010Memcached, presented to LCA2010
Memcached, presented to LCA2010Mark Atwood
 
Scalability
ScalabilityScalability
Scalabilityfelho
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase
 
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireCarter Shanklin
 
Practical Intro Merb
Practical Intro MerbPractical Intro Merb
Practical Intro MerbPaul Pajo
 
Practical Intro Merb
Practical Intro MerbPractical Intro Merb
Practical Intro MerbPaul Pajo
 
Netapp 1229343173196796-1
Netapp 1229343173196796-1Netapp 1229343173196796-1
Netapp 1229343173196796-1Newlink
 
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...AOE
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataAsis Mohanty
 
Florian adler minute project
Florian adler   minute projectFlorian adler   minute project
Florian adler minute projectDmitry Buzdin
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkSteve Loughran
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integrationprajods
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 
Migration of a computation cluster to Debian
Migration of a computation cluster to DebianMigration of a computation cluster to Debian
Migration of a computation cluster to DebianLogilab
 

Similaire à Hecatonchire kvm forum_2012_benoit_hudzia (20)

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
The Value of NetApp with VMware
The Value of NetApp with VMwareThe Value of NetApp with VMware
The Value of NetApp with VMware
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Memcached, presented to LCA2010
Memcached, presented to LCA2010Memcached, presented to LCA2010
Memcached, presented to LCA2010
 
Scalability
ScalabilityScalability
Scalability
 
Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011
 
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFire
 
Practical Intro Merb
Practical Intro MerbPractical Intro Merb
Practical Intro Merb
 
Practical Intro Merb
Practical Intro MerbPractical Intro Merb
Practical Intro Merb
 
Netapp 1229343173196796-1
Netapp 1229343173196796-1Netapp 1229343173196796-1
Netapp 1229343173196796-1
 
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 
Florian adler minute project
Florian adler   minute projectFlorian adler   minute project
Florian adler minute project
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talk
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integration
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 
Migration of a computation cluster to Debian
Migration of a computation cluster to DebianMigration of a computation cluster to Debian
Migration of a computation cluster to Debian
 

Dernier

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Dernier (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Hecatonchire kvm forum_2012_benoit_hudzia

  • 1. Memory Aggregation For KVM Hecatonchire Project Benoit Hudzia; Sr. Researcher; SAP Research Belfast With the contribution of Aidan Shribman, Roei Tell, Steve Walsh, Peter Izsak November 2012
  • 2. Agenda • Memory as a Utility • Raw Performance • First Use Case : Post Copy • Second Use case : Memory aggregation • Lego Cloud • Summary © 2012 SAP AG. All rights reserved. 2
  • 3. Memory as a Utility How we Liquefied Memory Resources
  • 4. The Idea: Turning memory into a distributed memory service Breaks memory from the bounds Transparent deployment with of the physical box performance at scale and Reliability © 2012 SAP AG. All rights reserved. 4
  • 5. High Level Principle Memory Memory Sponsor A Sponsor B Network Memory Demander Virtual Memory Address Space Memory Demanding Process © 2012 SAP AG. All rights reserved. 5
  • 6. How does it work (Simplified Version) Virtual MMU Physical MMU Physical Address (+ TLB) Address (+ TLB) Address Miss Update MMU Invalidate MMU Extract Page Page Table Page Table Entry Entry Remote PTE PTE write Invalidate PTE (Custom Swap Entry) Coherency Coherency Engine Engine Extract Page Prepare Page for RDMA transfer Page request Network RDMA Engine RDMA Engine Fabric Page Response Physical Node A Physical Node B © 2012 SAP AG. All rights reserved. 6
  • 7. Reducing Effects of Network Bound Page Faults Full Linux MMU integration (reducing the system-wide effects/cost of page fault)  Enabling to perform page fault transparency (only pausing the requesting thread) Low latency RDMA Engine and page transfer protocol (reducing latency/cost of page faults)  Implemented fully in kernel mode OFED VERBS  Can use the fastest RDMA hardware available (IB, IWARP, RoCE)  Tested with Software RDMA solution ( Soft IWARP and SoftRoCE) (NO SPECIAL HW REQUIRED) Demand pre-paging (pre-fetching) mechanism (reducing the number of page faults)  Currently only a simple fetching of pages surrounding page on which fault occurred © 2012 SAP AG. All rights reserved. 7
  • 8. Transparent Solution Minimal Modification of the kernel (simple and minimal intrusion) • 4 Hooks in the static kernel , virtually no overhead when enabled for normal operation Paging and memory Cgroup support (Transparent Tiered Memory) • Page are pushed back to their sponsor when paging occurs or if they are local they can be swapped out normally KVM Specific support (Virtualization Friendly) • Shadow Page table (EPT / NPT ) • KVM Asynchronous Page Fault © 2012 SAP AG. All rights reserved. 8
  • 9. Transparent Solution (cont.) Scalable Active – Active Mode (Distributed Shared Memory) • Shared Nothing with distributed index • Write invalidate with distributed index (end of this year) Library LibHeca (Ease of integration) • Simple API bootstrapping and synching all participating nodes We also support: • KSM • Huge Page • Discontinuous Shared Memory Region • Multiple DSM / VM groups on the same physical node © 2012 SAP AG. All rights reserved. 9
  • 10. Raw Performance How fast can we move memory around ?
  • 11. Raw Bandwidth usage HW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM 25 1 Thread 2 Threads Not enough core 3 Threads 4 Threads to saturate (?) 5 Threads 20 No degradation 6 Threads 7 Threads under high load 8 Threads 15 Maxing out Bandwidth Software RDMA has significant overhead 10 5 0 Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec (SIW - Seq) (IW-Seq) (IB-Seq) (SIW- Bin split) (IW- Bin split) (IB- Bin split) (SIW- Random) (IW- Random) (IB- Random) © 2012 SAP AG. All rights reserved. 11
  • 12. Hard Page Fault Resolution Performance Resolution time Time spend over the Resolution time Average (μs) wire one way Best (μs) Average (μs) SoftIwarp 355 150 + 74 (10 GbE) Iwarp 48 4-6 28 (10GbE) Infiniband 29 2-4 16 (40 Gbps) © 2012 SAP AG. All rights reserved. 12
  • 13. Average Compounded Page Fault Resolution Time (With Prefetch) 6 Micro-seconds IW 10GE Sequential 5.5 IB 40 Gbps Sequential IW 10GE- Binary split 5 IB 40Gbps- Binary split IW 10GE- Random Walk 4.5 IB- Random Walk 4 3.5 3 2.5 2 1.5 1 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads © 2012 SAP AG. All rights reserved. 13
  • 15. Post Copy – Pre Copy – Hybrid Comparison 4 Pre-copy (Forced after 60s) 3.5 Post-Copy Downtime (seconds) 3 Hybrid - 3 seconds 2.5 Hybrid - 5 Seconds 2 1.5 1 0.5 0 1 GB 4 GB 10 GB 14 GB VM Ram Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM Network : 10 GB Eth – Chelsio T422-CR IWARP Workload App Mem Bench (~80% of the VM RAM) Dirtying Rate : 1GB/s (256k Page dirtied per seconds) © 2012 SAP AG. All rights reserved. 15
  • 16. Post Copy vs Pre copy under load 100 Post Copy Dirtying Rate 1GB/s 90 Post Copy Dirtying Rate 5GB/s 80 Post Copy Dirtying Rate 25GB/s Degradation (%) Post Copy Dirtying Rate 50GB/s 70 Post Copy Dirtying Rate 100GB/s 60 Pre Copy Dirtying Rate 1GB/s 50 Pre Copy Dirtying Rate 5GB/s Pre Copy Dirtying Rate 25GB/s 40 Pre Copy Dirtying Rate 50GB/s 30 Pre Copy Dirtying Rate 100GB/s 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 Seconds Virtual Machine : Hardware: • 1 GB RAM -1vCPU Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Workload: App Mem Bench • Network : 10 GB Eth Switch – NIC : Chelsio T422-CR (IWARP) © 2012 SAP AG. All rights reserved. 16
  • 17. Post Copy Migration of HANA DB Baseline Pre-Copy Post-Copy Downtime N/A 7.47 s 675 ms Benchmark 0% Benchmark 5% Performance Failed Degradation Virtual Machine: Hardware: • 10 GB Ram , 4 vCPU • Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Application : HANA ( In memory Database ) • Fabric: 10 GB Ethernet Switch • Workload : SAP-H ( TPC-H Variant) • NIC: Chelsio IWARP T422-CR © 2012 SAP AG. All rights reserved. 17
  • 18. Memory Aggregation Second use case: Scaling out Memory
  • 19. Scaling Out Virtual Machine Memory Business Problem • Heavy swap usage slows execution time for Solution data intensive applications Hecatonchire/ RRAIM Solution • Applications use memory mobility for high VM swaps to memory Application Cloud performance swap resource • Completely transparent RAM Memory Cloud • No integration required • Act on results sooner • High reliability built in Compression / Deduplication / N-tiers • Enables iteration or additional data to storage / HR-HA improve results © 2012 SAP AG. All rights reserved. 19
  • 20. Redundant Array of Inexpensive RAM: RRAIM 1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes. 2. No immediate effect on computation node upon failure of node. 3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node. © 2012 SAP AG. All rights reserved. 20
  • 21. Quicksort Benchmark with Memory Constraint Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset 10.00% Memory Ratio DSM Overhead RRAIM Overhead 9.00% (constraint using cgroup) 8.00% 3:4 2.08% 5.21% 7.00% 6.00% 1:2 2.62% 6.15% 5.00% DSM Overhead 4.00% RRAIM Overhead 1:3 3.35% 9.21% 3.00% 2.00% 1:4 4.15% 8.68% 1.00% 0.00% 1:5 4.71% 9.28% 3:04 1:02 1:03 1:04 1:05 © 2012 SAP AG. All rights reserved. 21
  • 22. Scaling out HANA Memory DSM RRAIM 1.60% Ratio Overhead Overhead 1.40% 1.20% 1:2 1% 0.887% 1.00% DSM Overhead 0.80% 1:3 1.6% 1.548% 0.60% RRAIM Overhead 2:1:1 0.1% - 0.40% 0.20% 1:1:1 1.5% - 0.00% 1:02 1:03 2:01:01 1:01:01 Virtual Machine: Hardware: • 18 GB Ram , 4 vCPU • Memory Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Application : HANA ( In memory Database ) • Compute Host: Intel(R) Xeon(R) CPU X5650 @ 2.56GHz, 8 cores, 96GB RAM • Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • Workload : SAP-H ( TPC-H Variant) © 2012 SAP AG. All rights reserved. 22
  • 23. Transitioning to a Memory Cloud (Ongoing work) Memory VM Compute VM Combination VM Memory Sponsor Memory Demander Memory Sponsor & Demander RAM App RAM App Memory memory VM VM VM Cloud RRAIM Memory Cloud Management Services (OpenStack) Many Physical Nodes Hosting a variety of VMs © 2012 SAP AG. All rights reserved. 23
  • 25. Virtual Distributed Shared Memory System (Compute Cloud) Guests Compute aggregation  Idea : Virtual Machine compute and memory span Multiple physical Nodes VM VM VM App App App Challenges OS OS VM OS H/W  Coherency Protocol Ap p OS H/W  Granularity ( False sharing ) H/W H/W Hecatonchire Value Proposition Server #1 Server #2 Server #n  Optimal price / performance by using commodity hardware CPUs CPUs CPUs  Operational flexibility: node downtime without downing Memory Memory Memory the cluster I/O I/O I/O  Seamless deployment within existing cloud Fast RDMA Communication © 2012 SAP AG. All rights reserved. 25
  • 26. Disaggregation of datacentre ( and cloud ) resources (Our Aim) Breaking out the functions of Memory ,Compute, I/O, and optimizing the delivery of each. Disaggregation, provides three primary benefits: • Better Performance: • Each function is isolated => limiting the scope of what each box must do • We can leverage dedicated hardware and software => increases performance. • Superior Scalability: • Functions are isolated from each other => alter one function without impacting the others. • Improved Economics: • cost-effective deployment of resource => improved provisioning and consolidation of disparate equipment © 2012 SAP AG. All rights reserved. 26
  • 28. Hecatonchire Project • Features: • Distributed Shared Memory • Memory extension via Memory Servers • HA features • Future :Distributed Workload executions • Use standard Cloud interface • Optimise Cloud infrastructure • Support COTS HW © 2012 SAP AG. All rights reserved. 28
  • 29. Key takeaways • Hecatonchire project aim at disaggregating datacentre resources • Hecatonchire Project currently deliver memory cloud capabilities • Enhancements to be released as open source under GPLv2 and LGPL licenses by the end of November 2012 • Hosted on GitHub, check: www.hecatonchire.com • Developed by SAP Research Technology Infrastructure (TI) Programme © 2012 SAP AG. All rights reserved. 29
  • 30. Thank you Benoit Hudzia; Sr. Researcher; SAP Research Belfast benoit.hudzia@sap.com
  • 32. Instant Flash Cloning On-Demand Business Problem  Burst load / service usage that cannot be satisfied in time Existing solutions  Vendors: Amazon / VMWare/ rightscale  Startup VM from disk image  Requires full VM OS startup sequence Hecatonchire Solution  Go live after VM-state (MBs) and hot memory (<5%) cloning  Use post-copy live-migration schema in background  Complete background transfer and disconnect from source Hecatonchire Value Proposition  Just in time (sub-second) provisioning © 2012 SAP AG. All rights reserved. 32
  • 33. DRAM Latency Has Remained Constant CPU clock speed and memory bandwidth increased steadily (at least until 2000) But memory latency remained constant – so local memory has gotten slower from the CPU perspective Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010 © 2012 SAP AG. All rights reserved. 33
  • 34. CPUs Stopped Getting Faster Moore’s law prevailed until 2005 when core’s speed hit a practical limit of about 3.4 GHz Since 2005 you do get more cores but the “single threaded free lunch” is over Source: http://www.intel.com/pressroom/kits/quickrefyr.htm Effectively arbitrary sequential algorithms have not gotten faster since Source: “The Free Lunch Is Over..” by Herb Sutter © 2012 SAP AG. All rights reserved. 34
  • 35. While … Interconnect Link Speed has Kept Growing Panda et al. Supercomputing 2009 © 2012 SAP AG. All rights reserved. 35
  • 36. Result: Remote Nodes Have Gotten Closer Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM And remote DRAM has far better performance than paging in from an SSD or HDD device Fast interconnects have become a commodity - moving out of the High Performance Computing (HPC) niche HANA Performance Analysis, Chaim Bendelac, 2011 © 2012 SAP AG. All rights reserved. 36
  • 37. Post-Copy Live Migration (pre-migration) Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 37
  • 38. Post-Copy Live Migration (reservation) Guest VM Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 38
  • 39. Post-Copy Live Migration (stop and copy) Guest VM Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 39
  • 40. Post-Copy Live Migration (post-copy) Guest VM Guest VM Page fault Page push Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 40
  • 41. Post-Copy Live Migration (commit) Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 41

Notes de l'éditeur

  1. Walk : sequential =&gt; each thread start reading from 256k / nb threads * threads id and ends when it reach the start of the following threadsBinary split : we split the memory in Nb threads regions . Each threads will then do a binary split walk within each regionRandom walk : each thread will read a page randomly chosen within the overall memory region ( no duplicate)We Are maxing out the 10GbE bandwith with IWARPWe suspect that we do not have enough core to saturate the QDR linkWe have almost no noticeable degradation when we have Threads &gt; CoresSoftIwarp has a significant overhead ( CPU – latency- Memory use)