Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 73 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à RSS++ (20)

Publicité

Plus récents (20)

RSS++

  1. 1. RSS++: load and state-aware receive side scaling Tom Barbette, Georgios P. Katsikas, Gerald Q. Maguire Jr., and Dejan Kostić CORE 1 CORE 1 CORE 1 CORE 1 CORE 2 CORE 3 CORE 2 CORE 3 CORE 2 100G
  2. 2. Networking today 1/2/2020 2 0 100 200 300 400 1980 1990 2000 2010 2020 EthernetStandard Speed(Gbps) Year 1980 1990 2000 2010 2020 0 100 200 300 Years Cores 100GHow to dispatch dozens of millions of packets per seconds to many cores? Data from Karl Rupp / Creative Commons Attribution 4.0 International Public License
  3. 3. 2020-01-02 3 Sharding Key-Value Stores Minos [NSDI’19] Herd [SIGCOMM’14] MICA [NSDI’14] Chronos [SoCC‘12] CPHASH [PPoPP‘12] Packet Processing / NFV Metron [NSDI’18] NetBricks [OSDI’16] SNF [PeerJ’16] FastClick [ANCS’15] Megapipe [OSDI’12] ShardingNetwork Stacks ClickNF [ATC’18] StackMap [ATC’16] mTCP [NSDI’14] F-Stack [Tencent Cloud 13] Affinity-Accept [EuroSys’12] Sharding Hello SoTA ! How to dispatch dozens of millions of packets per seconds to many cores?
  4. 4. Ubuntu 18.04 A sharded testbed 2020-01-02 4 Core 1 Core 2 Core 18 RSS 100G iPerf 2 iPerf 2 iPerf 2 iPerf 2 -c 100 TCP flows
  5. 5. Sharding’s problem : high imbalance 2020-01-02 5  Underutilization and high tail latency
  6. 6. RSS++ : Rebalance groups of flow from time to time 2020-01-02 6 • Much better load spreading • Much lower latency  Latency reduced by 30%  Tail latency reduced by 5X
  7. 7. RSS++ : Rebalance groups of flow from time to time 2020-01-02 7 • Much better load spreading • Much lower latency • Opportunity to release 6 cores for other applications  1/3 resources freed
  8. 8. Receive Side Scaling (RSS) 2020-01-02 8 Hash 1 2 1 2 1 … Indirection table Core 1 Core 2
  9. 9. Receive Side Scaling (RSS) 2020-01-02 9 Hash 2 1 2 1 … Indirection table 1 Hashing (≠uniform spreading) on mice and elephants  High load imbalance Flow- awareness Load balancing Core 1 Core 2
  10. 10. An opposite approach Packet-based load-balacing 2020-01-02 10 Core 1 Core 2 Flow- awareness Fine-grained load balancing
  11. 11. Flow-awareness 2020-01-02 11 Fine-grained load balancing
  12. 12. Flow-awareness 2020-01-02 12 Fine-grained load balancing
  13. 13. Flow-awareness ++’s challenge 2020-01-02 13 Fine-grained load balancing RSS++ strikes for the right balance between perfect load spreading and sharding RSS
  14. 14. RSS++ 2020-01-02 14 Hash 2 1 2 1 … Indirection table 12 Core 1 Core 2 Rebalance some RSS buckets from time to time
  15. 15. RSS++ 2020-01-02 15 Rebalance some RSS buckets from time to time RSS++ strikes for the right balance between perfect load spreading and sharding By migrating the RSS indirection buckets based upon the output of an optimization algorithm to even the load
  16. 16. RSS++ 2020-01-02 16 Handle stateful use-cases with a new per-bucket flow table algorithm that migrates the state with the buckets RSS++ strikes for the right balance between perfect load spreading and sharding By migrating the RSS indirection buckets based upon the output of an optimization algorithm to even the load
  17. 17. 2020-01-02 17 RSS++
  18. 18. RSS++ overview 2020-01-02 18 Hash 2 2 1 2 1 … Indirection table Core 1 Core 2
  19. 19. RSS++ overview 2020-01-02 19 Hash 2 2 1 2 1 … Indirection table Core 2 Core 1 Counter Tables RSS++ Balancing Timer 10Hz ~ 1Hz Ethtool API DPDK APIs 2421 2622 1231 … 502 … 3112 90% 40% CPU Load LINUX XDP [CoNEXT’18] BPF program In-app function call DPDK Kernel CPU load Useful cycles / Application cycles 12% 27% 8% 46% 36% Greedy iterative approach In 85% of the cases, a single run is enough to be in a 0,5% squared imbalance margin, in 25usec
  20. 20. Stateful use-cases: state migration 2020-01-02 20 • RSS++ migrates some RSS buckets  Packets from migrated flows need to find their state 20 Core 1 Core 2 Flow table #1 Flow table #2 ???
  21. 21. Stateful use-cases: state migration 2020-01-02 21 • RSS++ migrates some RSS buckets  Packets from migrated flows need to find their state • Possible approach: a shared flow table 21
  22. 22. • RSS++ migrates some RSS buckets  Packets from migrated flows need to find their state • Possible approach: a shared flow table • RSS++ (DPDK implementation only): Stateful use-cases: state migration 2020-01-02 22 1 2 1 2 1 … Indirection table Hash-table #1 Hash-table #2 Hash-table #3 … … Flow Ptr table Hash 3 (nearly never) QUEUE Until previous core finished handling all packets of bucket #2 Core 2 Core 3 RSS++
  23. 23. 2020-01-02 23 Evaluation
  24. 24. Evaluation Load imbalance 2020-01-02 24 Nmost loaded – Nleast loaded Nleast loaded 15Gbps trace (~80K active flows/s) replayed towards the DUT
  25. 25. Evaluation Load imbalance of packet-based methods 2020-01-02 25 Packet-based method have a very good balance ! 15Gbps trace (~80K active flows/s) replayed towards the DUT Flow- awareness Fine-grained load balancing
  26. 26. Evaluation Load imbalance of RSS 2020-01-02 2615Gbps trace (~80K active flows/s) replayed towards the DUT Flow- awareness Fine-grained load balancing
  27. 27. Evaluation Load imbalance of stateful methods 2020-01-02 27 Without migration, other approaches cannot really do anything good! 15Gbps trace (~80K active flows/s) replayed towards the DUT Flow- awareness Fine-grained load balancing
  28. 28. X12 (Avg. ~X5) Evaluation: Load imbalance of RSS++ 2020-01-02 2815Gbps trace (~80K active flows/s) replayed towards the DUT Flow- awareness Fine-grained load balancing
  29. 29. Service chain at 100G: FW+NAT 2020-01-02 2915Gbps trace (~80K active flows/s) trace accelerated up to 100 Gbps, 39K rules FW RSS is not able to fully utilize new cores
  30. 30. Service chain at 100G: FW+NAT 2020-01-02 3015Gbps trace (~80K active flows/s) trace accelerated up to 100 Gbps, 39K rules FW RSS++ shows linear improvement with the number of cores RSS is not able to fully utilize new cores
  31. 31. Service chain at 100G: FW+NAT 2020-01-02 3115Gbps trace (~80K active flows/s) trace accelerated up to 100 Gbps, 39K rules FW Sharing state between cores leads to poor performance RSS++ shows linear improvement with the number of cores RSS is not able to fully utilize new cores
  32. 32. Conclusion 2020-01-02 33 State-aware NIC-assisted scheduling to solve a problem that will only get worst – No dispatching cores – Sharded approach (no OS scheduling) A new state migration technique – Minimal state “transfer” – No lock in the datapath up to 14x lower 95th lower latency, no drops and with 25%- 37% less cores Linux (via Kernel API + small patch) and DPDK implementation, fully available, with all experiment scripts
  33. 33. Thanks ! 2020-01-02 34 github.com/rsspp/ In the paper: – How the solver works – More evaluations > Particularly tail latency studies > Comparison with Metron’s traffic-class dispatching – More state of the art – Future work – Discussions about use in other contexts: > KVS load-balancing > Dispatching using multiple cores in a pipeline > NUMA – Trace analysis This work is supported by SSF and ERC
  34. 34. 2020-01-02 35
  35. 35. Backup slides 2020-01-02 36 SOTA
  36. 36. Solutions for RSS’s imbalance 2020-01-02 37 • Sprayer [Hotnets’18] / RPCValet [SOSP’19] – Forget about flows, do per-packet dispatching  Stateful use case dead  Even stateless sometimes inefficients • Metron [NSDI’18] – Compute traffic classes, and split/merge classes among cores  Miss load-awareness. Traffic classes may not be uniform hashing. • Affinity-Accept [EuroSys’12] – Redirect connections in software to other cores, and re-program some RSS entries when they contain mostly redirected connections  Load imbalance as best as good as « SW Stateful Load »  We need migration.  Software dispatching to some extent
  37. 37. SOTA : Intra-server LB 2020-01-02 38 Dispatchers cores Shinjuku*, Shenango Still need RSS++ to dispatch to the many dispatching cores needed for 100G Inefficient Shuffling layer ZygOS*, Affinity-Accept, Linux Why pay for cache misses when the NIC can do it? Do not support migration  high imbalance *BUT we miss the mixing of multiple applications on a single core
  38. 38. Our contributions 2020-01-02 39 • We solve the packet dispatching problem by migrating the RSS indirection buckets between shards based upon the output of an optimization algorithm – Without the need for dispatching cores • Dynamically scale the number of cores  Avoids the typical 25% over-provisioning  Order of magnitude lower tail latency • Compensate for occasional state migration with a new stateful per-bucket flow table algorithm: – Prevents packet reordering during migration – 20% more efficient than a shared flow table  Stateful near-perfect intra-server load-balancing, even at the speed of 100 Gbps links
  39. 39. Backup slides 2020-01-02 40 RSS++ Algorithm
  40. 40. RSS++ algorithm 2020-01-02 41 CPU 2 CPU 1 3112 2421 2622 1231 Counting Table 502 Counting Table 90% 40% CPU Load
  41. 41. RSS++ algorithm 2020-01-02 42 CPU 2 CPU 1 90% 40% 3112 2421 2622 1231 Counting Table 502 Counting Table Buckets fractional load Bucket #1 load : 1231 / (1231 + 2622) = 31% 31% * 40% = 12% 12% 27% 8% 46% 36% Bucket #1 fractional load : 65% Average CPU load 90% 40% + 25% - 25% 2 2 1 2 1 Indirection table RSS++ Problem Solver 82% 48% + 17% - 17% 1
  42. 42. RSS++ algorithm 2020-01-02 43 CPU 2 CPU 1 90% 40% 3112 2421 2622 1231 Counting Table 502 Counting Table Buckets fractional load 31% * 40% = 12% 12% 27% 8% 46% 36% 65% Average CPU load 90% 40% + 25% - 25% 2 2 1 2 1 Indirection table RSS++ Problem Solver 54% 76% - 11% +11% 1 In 85% of the cases, a single run is enough to be in a 0,5% squared imbalance margin, in 25usec
  43. 43. Solver 1/2/2020 44 If you like math, go to the paper.  We use a greedy, non-optimal, approach because:  We don’t care about the optimal solution  State of the art showed too slow resolution time of for multi-way number partitioning
  44. 44. Greedy approach 1/2/2020 45 1. Sort buckets by descending fractional load 2. Sort underloaded cores by ascending load 3. Dispatch most loaded buckets to underloaded cores, allowing over-moves by a threshold 4. Restart up to 10 times using different threshold to find an inflection point In 85% of the cases, a single run is enough to be in a 0,5% squared imbalance margin, in 25usec
  45. 45. Stateful use-cases: state migration 2020-01-02 46 • RSS++ migrates some RSS buckets  Packets from migrated flows need to find their state • Possible approach: a shared, as efficient-as-possible hash-table 2 2 1 2 1 … Indirection table CPU 1 CPU 2 Hash-table BANG
  46. 46. RSS++ : Rebalance some RSS buckets from time to time 2020-01-02 47 30% lower average latency 4~5x lower standard deviation and tail latency
  47. 47. Backup slides 2020-01-02 48 RSS++ Implementation
  48. 48. LibNICScheduler 2020-01-02 49
  49. 49. Backup slides 2020-01-02 50 Evaluation
  50. 50. 2020-01-02 51
  51. 51. 2020-01-02 52
  52. 52. 2020-01-02 53
  53. 53. 2020-01-02 54
  54. 54. Evaluation: Load imbalance of Metron 2020-01-02 55 [Graph of Load imbalance with RSS and RSS++ RR and Sprayer + Stateful methods]
  55. 55. 2020-01-02 56 CPU frequency fixed at 1GHz, doing some fixed artificial per-packet workload
  56. 56. Evaluation: State migration 2020-01-02 57 Forwarding 1500 bytes UDP packets from 1024 concurrent flows of 1000 packets, classified in either a unique thread-safe Cuckoo hash-table or in a per-bucket hash-table
  57. 57. Evaluation: Firewall only 2020-01-02 58Trace accelerated up to 100 Gbps • RSS cannot always fully utilize more cores due to load imbalance • Even for a stateless case, packet-based approach is harmful to cache
  58. 58. Evaluation: 39K rules firewall at 100G 2020-01-02 59Trace accelerated up to 100 Gbps • Even for a stateless case, packet-based approach is harmful to cache • We need hardware dispatching
  59. 59. Stateful evaluation at 100G: FW+NAT+DPI 2020-01-02 60 Hyperscan [Wang 2019]
  60. 60. NFV Evaluation 2020-01-02 61
  61. 61. Backup slides 2020-01-02 62 RSS Video
  62. 62. Why does RSS++ work? 2020-01-02 63 Hash 2 2 1 2 1 … Indirection table CPU 1 CPU 2
  63. 63. Why does RSS++ work? 2020-01-02 64 1 2 1 2 1 … Indirection table
  64. 64. Why does RSS++ work? 2020-01-02 65 1 2 1 2 1 …
  65. 65. Why does RSS++ work? 2020-01-02 66 1 2 1 2 1 …
  66. 66. Why does RSS++ work? 2020-01-02 67 5 4 3 2 1 8 7 6 3 2 1 6 5 4 1 8 7 4 3 2 7 6 5 2 1 8 … Numberofpackets Bucket index
  67. 67. 2020-01-02 68 Watch RSS live ! The internet is not random  • Buckets have up to x1000 imbalance • There is a stickiness over time Solution at t0 is mostly valid for t1
  68. 68. Backup slides 2020-01-02 69 Discussion
  69. 69. Why Sharding in Linux ? 2020-01-02 70 • Unpublished result of a 3seconds sampling • Still much to do to take real advantage of sharding
  70. 70. Multiple applications 2020-01-02 71 • To keep all advantage of sharding, one should slightly modify our implementation to use a set of RSS queues per application, and exchange cores through a common pool of available cores • Another idea would be to combine slow applications on one core, and reduce the problem of polling
  71. 71. Multiple NICs 2020-01-02 72 • One should devise how much of the actual load is due to which input
  72. 72. Background noise 2020-01-02 73 • A small background noise will make the load go higher and therefore buckets will get evicted • A high background noise would need a modification to the algorithm to take it out from the capacity of a CPU : note that some CPU is at 60% out of 70% of load, or else the « bucket fractional load » will be disproportional from the load of other cores.
  73. 73. Oscillation 2020-01-02 74 • We don’t care

Notes de l'éditeur

  • (Do not say the names if chair introduce me)
    (Else after joint work do not say myself )
  • Hundred gigabit NICs are becoming a commodity in datacenters.
    Those NICs have to dispatch dozen of million of packets to many-cores CPUs.
    CLICK
    And both those numbers, the ethernet speeds, and the number of cores, are increasing dramatically.
    So the question that I’ll address in this talk, [how to …], which is already a problem today, will even be more of a problem tomorrow.
  • If we look at the recent SOTA in high speed software networking, a lot of recent works in Key value-stores CLICK and packet processing and network function virtualization advocates for the use of sharding, as well as all recent networks stacks, which are sharded. CLICK
  • So what is this sharding about? To answer that, I’ll show you our sharded testbed. We have a computer with 18 cores, and a hundred gigabit NIC. We configure the NIC so it dispatches packets to 18 queues, one per core.
    On each core, we run an instance of the application, in our case, iperf 2. The application is pinned to the core, and that’s the idea of sharding. The computer is divided into independent shardsone can almost consider each core as a different server. The advantage of this is that we avoid any shared data structure, any contention between CPU cores.
    If there was no problem with sharding ,we would not have a paper today. CLICK
    So to showcase the problem, we run an iperf client that will request 100 tcp flows. CLICK
    One important point, the NIC dispatches packets to the cores using RSS, basically hashing packets so packets of the same flow go to the same core.
  • Sprayer[Hugo Sadok 2018] HotNets.
  • Sprayer[Hugo Sadok 2018] HotNets.
  • We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS
  • We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS
  • We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS
  • We see again now that the load is higher that RSS is still not able to utilize fully new cores and not able, even with 6 more cores than RSS




  • 20% more efficient, an order of magnitude lower latency with a high number of core
  • With this I will thank you for listening and be happy to take any question you may have

  • 25: no joke
  • Do this in an animation
  • One library for NIC-driven scheduling:
    With multiple scheduling strategies, one of them being RSS++
    Two « integrations » :
    Linux, reading packets using a XDP BPF program, and writing the indirection table using the ethtool API
    DPDK, counting packets through function calls and programing the NIC with DPDK’s API
  • 20% more efficient
    Order of magnitude better latency
  • « Controlling Parallelism in a Multicore Software Router “

    TODO : limit at 100G
  • « Controlling Parallelism in a Multicore Software Router “

    TODO : limit at 100G
  • TODO : make in multiple graphs
    TODO : numbers
  • If we look at the number of packets received by each buckets, and map it as per a default indirection table, we can see the number of packets received by each core is very disproportional.
    Moreover, we see the load of each buckets is not completely random, some buckets tend to be highly loaded, or stay loaded for some time.
     So what we propose in RSS++ is to migrate, a few of those overloaded buckets from time to time, to even the load between all CPUs

×