Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Is OpenStack Neutron production ready for large scale deployments?

669 vues

Publié le

OpenStack Neutron with ML2 OVS has always been a challenging component in terms of performance and scalability. However, in recent releases, several enhancements and bug-fixes have resulted in significant improvements in overall reliability, performance and scalability of Neutron. In this presentation, we will share the results of our testing (both control-plane and data-plane) at large scale and provide a detailed data-driven analysis that explores the true scale limits and bottlenecks of Neutron.

Publié dans : Données & analyses
  • Soyez le premier à commenter

Is OpenStack Neutron production ready for large scale deployments?

  1. 1. Copyright © 2016 Mirantis, Inc. All rights reserved www.mirantis.com Is OpenStack Neutron production ready for large scale deployments? Oleg Bondarev, Senior Software Engineer, Mirantis Elena Ezhova, Software Engineer, Mirantis
  2. 2. Copyright © 2016 Mirantis, Inc. All rights reserved Why are we here? “We've learned from experience that the truth will come out.” Richard Feynman
  3. 3. Copyright © 2016 Mirantis, Inc. All rights reserved Key highlights (Spoilers!) Mitaka-based OpenStack deployed by Fuel 2 hardware labs were used for testing 378 nodes was the size of the largest lab Line-rate throughput was achieved Over 24500 VMs were launched on a 200-node lab ...and yes, Neutron works at scale!
  4. 4. Copyright © 2016 Mirantis, Inc. All rights reserved Agenda Labs overview & tools Testing methodology Results and analysis Issues Outcomes
  5. 5. Copyright © 2016 Mirantis, Inc. All rights reserved Deployment description Mirantis OpenStack with Mitaka-based Neutron ML2 OVS VxLAN/L2 POP DVR rootwrap-daemon ON ovsdb native interface OFF ofctl native interface OFF
  6. 6. Copyright © 2016 Mirantis, Inc. All rights reserved Environment description. 200 node lab 3 controllers, 196 computes, 1 node for Grafana/Prometheus CPU 2x CPU Intel Xeon E5-2650v3,Socket 2011, 2.3 GHz, 25MB Cache, 10 core, 105 W RAM 8x 16GB Samsung M393A2G40DB0-CPB DDR-IV PC4-2133P ECC Reg. CL13 Networ k 2x Intel Corporation I350 Gigabit Network Connection (public network) 2x Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Controllers Computes CPU 1x INTEL XEON Ivy Bridge 6C E5-2620 V2 2.1G 15M 7.2GT/s QPI 80w SOCKET 2011R 1600 RAM 4x Samsung DDRIII 8GB DDR3-1866 1Rx4 ECC REG RoHS M393B1G70QH0-CMA Network 1x AOC-STGN-i2S - 2-port 10 Gigabit Ethernet SFP+
  7. 7. Copyright © 2016 Mirantis, Inc. All rights reserved Environment description. 378 node lab 3 controllers, 375 computes Model Dell PowerEdge R63 CPU 2x Intel, E5-2680 v3, 2.5 GHz, 12 core RAM 256 GB RAM, Samsung, M393A2G40DB0- CPB Networ k 2x Intel X710 Dual Port, 10-Gigabit Storage 3.6 TB, SSD, raid1 - Dell, PERC H730P Mini, 2 disks Intel S3610 Model Lenovo RD550-1U CPU 2x E5-2680v3, 12-core CPUs RAM 256GB RAM Network 2x Intel X710 Dual Port, 10-Gigabit Storage 2x Intel S3610 800GB SSD 2x DP and 3Yr Standard Support 23 176 RD650-2
  8. 8. Copyright © 2016 Mirantis, Inc. All rights reserved Tools Control plane testing Rally Data plane testing Shaker Density testing Heat Custom (ancillary) scripts System resource monitoring Grafana/Prometheus Additionally
  9. 9. Copyright © 2016 Mirantis, Inc. All rights reserved Integrity test Control group of resources that must stay persistent no matter what other operations are performed on the cluster. 2 server groups of 10 instances 2 subnets connected by router Connectivity checks by floating IPs and fixed IPs Checks are run between other tests to ensure dataplane operability
  10. 10. Copyright © 2016 Mirantis, Inc. All rights reserved Integrity test ● From fixed IP to fixed IP in the same subnet ● From fixed IP to fixed IP in different subnets
  11. 11. Copyright © 2016 Mirantis, Inc. All rights reserved Integrity test ● From floating IP to floating IP ● From fixed IP to floating IP
  12. 12. Copyright © 2016 Mirantis, Inc. All rights reserved Rally control plane tests Basic Neutron test suite Tests with increased number of iterations and concurrency Neutron scale test with many servers/networks
  13. 13. Copyright © 2016 Mirantis, Inc. All rights reserved Rally basic Neutron test suite create_and_update_ create_and_list_ create_and_delete_ ● floating_ips ● networks ● subnets ● security_groups ● routers ● ports Verify that cloud is healthy, Neutron services up and running
  14. 14. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Concurrency 50-100 Iterations 2000-5000 API tests create-and-list-networks create-and-list-ports create-and-list-routers create-and-list-security-groups create-and-list-subnets Boot VMs tests boot-and-list-server boot-and-delete-server-with-secgroups boot-runcommand-delete
  15. 15. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency All test runs were successful, no errors. Results on Lab 378 slightly better than on Lab 200. API tests create-and-list-networks create-and-list-ports create-and-list-routers create-and-list-security-groups create-and-list-subnets Boot VMs tests boot-and-list-server boot-and-delete-server-with-secgroups boot-runcommand-delete Scenario Iterations/ Concurrency Time Lab 200 Lab 378 create-and-list-routers 2000/50 avg 15.59 max 29.00 avg 12.942 max 19.398 create-and-list-subnets 2000/50 avg 25.973 max 64.553 avg 17.415 max 50.41
  16. 16. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency First run on Lab 200: ● 7.75% failures, concurrency 100 ● 1.75% failures, concurrency 15 Fixes applied on Lab 378: ● 0% failures, concurrency 100 ● 0% failures, concurrency 50 API tests create-and-list-networks create-and-list-ports create-and-list-routers create-and-list-security-groups create-and-list-subnets Boot VMs tests boot-and-list-server boot-and-delete-server-with-secgroups boot-runcommand-delete
  17. 17. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - slow linear growth ● list - linear growth
  18. 18. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_networks trends create network list networks
  19. 19. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations)
  20. 20. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_routers trends create router list routers
  21. 21. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations) create_and_list_subnets ● create - slow linear growth ● list - linear growth (20 times in 2000 iterations)
  22. 22. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_subnets trends create subnet list subnets
  23. 23. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations) create_and_list_subnets ● create - low linear growth ● list - linear growth (20 times in 2000 iterations) create_and_list_ports
  24. 24. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_ports trends average load
  25. 25. Copyright © 2016 Mirantis, Inc. All rights reserved Rally high load tests, increased iterations/concurrency Trends create_and_list_networks ● create - stable ● list - linear growth create_and_list_routers ● create - stable ● list - linear growth (6.5 times in 2000 iterations) create_and_list_subnets ● create - low linear growth ● list - linear growth (20 times in 2000 iterations) create_and_list_ports ● gradual growth create_and_list_secgroups ● create 10 sec groups - stable, with peaks ● list - rapid growth rate by 17.2 times
  26. 26. Copyright © 2016 Mirantis, Inc. All rights reserved create_and_list_secgroups trends create 10 security groups list security groups
  27. 27. Copyright © 2016 Mirantis, Inc. All rights reserved Rally scale with many networks 100 networks per iteration 1 VM per network Iterations 20, concurrency 3
  28. 28. Copyright © 2016 Mirantis, Inc. All rights reserved Rally scale with many VMs 1 network per iteration 100 VMs per network Iterations 20, concurrency 3
  29. 29. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Architecture Shaker is a distributed data- plane testing tool for OpenStack.
  30. 30. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: L2 scenario Tests the bandwidth between pairs of instances on different nodes in the same virtual network.
  31. 31. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: L3 East-West scenario Tests the bandwidth between pairs of instances on different nodes deployed in different virtual networks plugged into the same router.
  32. 32. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: L3 North-South scenario Tests the bandwidth between pairs of instances on different nodes deployed in different virtual networks.
  33. 33. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 200, MTU 1500 Standard configuration Bi-directional L3 East-West scenario: ● 561 Mbits/sec upload, 528 Mbits/sec download Intel 82599ES 10-Gigabit
  34. 34. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 200, MTU 9000 Enabled jumbo frames Bi-directional L3 East-West scenario: ● 3615 Mbits/sec upload, 3844 Mbits/sec download x7 increase in throughput Intel 82599ES 10-Gigabit
  35. 35. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Bi-directional test HW offloads-capable NIC Hardware offloads boost with small MTU (1500): ● x3.5 throughput increase in bi-directional test Increasing MTU from 1500 to 9000 also gives a significant boost: ● 75% throughput increase in bi-directional test (offloads on) Intel X710 Dual Port 10-Gigabit
  36. 36. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Download test HW offloads-capable NIC Hardware offloads boost with small MTU (1500): ● x2.5 throughput increase in download Increasing MTU from 1500 to 9000 also gives a significant boost: ● 41% throughput increase in download test (offloads on) Intel X710 Dual Port 10-Gigabit
  37. 37. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Download test Near line-rate results in L2 and L3 east-west Shaker tests even with concurrency >50: ● 9800 Mbits/sec in download/upload tests ● 6100 Mbits/sec each direction in bi-directional tests Intel X710 Dual Port 10-Gigabit
  38. 38. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, Full L2 Download test Intel X710 Dual Port 10-Gigabit
  39. 39. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-West Download test Intel X710 Dual Port 10-Gigabit
  40. 40. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, Full L3 North-South Download test Intel X710 Dual Port 10-Gigabit
  41. 41. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-west Bi-directional test Intel X710 Dual Port 10-Gigabit
  42. 42. Copyright © 2016 Mirantis, Inc. All rights reserved Shaker: Lab 378, L3 East-west Bi-directional test Intel X710 Dual Port 10-Gigabit
  43. 43. Copyright © 2016 Mirantis, Inc. All rights reserved Dataplane testing outcomes Neutron DVR+VxLAN+L2pop installations are capable of almost line- rate performance. Main bottlenecks: hardware configuration and MTU settings. Solution: 1. Use HW offloads-capable NICs 2. Enable jumbo frames North-South scenario needs improvement
  44. 44. Copyright © 2016 Mirantis, Inc. All rights reserved Density test Aim: Boot the maximum number of VMs the cloud can manage. Make sure VMs are properly wired and have access to the external network. Verify that data-plane is not affected by high load on the cloud.
  45. 45. Copyright © 2016 Mirantis, Inc. All rights reserved Environment description. 200 node lab 3 controllers, 196 computes, 1 node for Grafana/Prometheus CPU 20 core RAM 128 GB Networ k 2x Intel Corporation I350 Gigabit Network Connection (public network) 2x Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Controllers Computes CPU 6 core RAM 32 GB Network 1x AOC-STGN-i2S - 2-port 10 Gigabit Ethernet SFP+
  46. 46. Copyright © 2016 Mirantis, Inc. All rights reserved Density test process Heat used for creating 1 network with a subnet, 1 DVR router, and 1 cirros VM per compute node. 1 Heat stack == 196 VMs Upon spawn VMs get their IPs from metadata and send them to the external HTTP server Iteration 1
  47. 47. Copyright © 2016 Mirantis, Inc. All rights reserved Density test process Heat stacks were created in batches of 1 to 5 (5 most of the times) 1 iteration == 196*5 VMs Integrity test was ran periodically Constant monitoring of lab status using Grafana dashboard Iteration k
  48. 48. Copyright © 2016 Mirantis, Inc. All rights reserved Density test results 125 Heat stacks were created Total 24500 VMs on a cluster Number of bugs filed and fixed: 8 Days spent: 3 People involved: 12 Data-plane connectivity lost: 0 times
  49. 49. Copyright © 2016 Mirantis, Inc. All rights reserved Grafana dashboard during density test
  50. 50. Copyright © 2016 Mirantis, Inc. All rights reserved Density test load analysis
  51. 51. Copyright © 2016 Mirantis, Inc. All rights reserved Issues faced ● Ceph failure! ● Bugs ● LP #1614452 Port create time grows at scale due to dvr arp update ● LP #1610303 l2pop mech fails to update_port_postcommit on a loaded cluster ● LP #1528895 Timeouts in update_device_list (too slow with large # of VIFs) ● LP #1606827 Agents might be reported as down for 10 minutes after all controllers restart ● LP #1606844 L3 agent constantly resyncing deleted router ● LP #1549311 Unexpected SNAT behavior between instances with DVR+floating ip ● LP #1609741 oslo.messaging does not redeclare exchange if it is missing ● LP #1606825 nova-compute hangs while executing a blocking call to librbd ● Limits ● ARP table size on nodes ● cpu_allocation_ratio
  52. 52. Copyright © 2016 Mirantis, Inc. All rights reserved Outcomes ● No major issues in Neutron ● No threatening trends in control-plane tests ● Data-plane tests showed stable performance on all hardware ● Data-plane does not suffer from control-plane failures ● 24K+ VMs on 200 nodes without serious performance degradation ● Neutron is ready for large-scale production deployments on 350+ nodes
  53. 53. Copyright © 2016 Mirantis, Inc. All rights reserved Links http://docs.openstack.org/developer/performance- docs/test_plans/neutron_features/vm_density/plan.html http://docs.openstack.org/developer/performance- docs/test_results/neutron_features/vm_density/results.h tml
  54. 54. Copyright © 2016 Mirantis, Inc. All rights reserved Thank you for your time

×