This document discusses Cloudwatt's experience deploying and upgrading OpenContrail. It started with Contrail 1.06 in June 2014 running on a Cisco Nexus fabric with Juniper MX routers terminating L2VPN tunnels. Issues were found with 1.06 related to operations, Neutron integration, and analytics. It was upgraded to 1.10 over two steps - the control plane in one night, and compute nodes over days. Bugs were encountered during and after the upgrade. Ongoing work includes improving Neutron integration, upgrading to the 2.x branch, adding continuous integration, and integrating L3VPNs with OpenStack.
2. About me
● Network engineer since 2006
● Working on OpenStack since the beginning
2010
● Working on OpenContrail since a year as a
developer and integrator
3. Cloudwatt IaaS
● French public cloud provider
● 3 years experience with OpenStack
● 1 year experience with OpenContrail
○ 1 data center
■ 200 compute nodes
■ 3 peta of raw swift storage
○ OpenStack IceHouse release
4. Contrail in Cloudwatt
● Started with Contrail release 1.06 in June
2014
● Run onto a Cisco Nexus fabricpath
● Terminate l2vpn tunnel with two Juniper MX
7. Contrail in Cloudwatt
● 2 Neutron API: neutron server with Contrail
plugin
● 2 config nodes: discovery, API, SVC
monitor, schema, IF-MAP server
● 2 control nodes
● 2 analytics nodes
● 2 webUI nodes
8. Contrail in Cloudwatt
Config Config
Neutron API Neutron API
Analytics Analytics
Control Control
vrouter vrouter vrouter
IF-MAP
IF-MAP
WebUI
WebUI
XMPP
9. Contrail in Cloudwatt
● Load balancing front of APIs and WebUI
● 2 Cassandra clusters of 3 nodes each
● RabbitMQ cluster of 2 nodes
● Cluster Zookeeper compose of 3 nodes
10. Contrail in Cloudwatt
Config Config
Neutron API Neutron API
Analytics Analytics
Control Control
vrouter vrouter vrouter
IF-MAP
XMPP
Cassandra
Cassandra
AMQP +
ZK
IF-MAP
WebUI
WebUI
11. Issue on 1.06
● Difficulty to operate it and upgrade/maintain
it without down time
● Stabilize/compatibility Neutron to Contrail
translator API
● Analytics does not work
● Some memories leak on the compute node
12. Upgrade to 1.10
● After nine month with 1.06
● New version to fix issues and bring new
features (SNAT/LBaaS)
● Following the upstream
14. Upgrade to 1.10
We deviced to do it in 2 steps:
1. Control plane (in a night)
○ Config (slave schema before)
○ Control
○ Analytics
○ WebUI
○ Neutron API
15. Upgrade to 1.10
2. Data plane (during few days)
○ upgrade/bootstrap spare compute node in 1.10 and
add them in the available compute pools
○ remove all running 1.06 compute nodes to the
available pool
○ let a time slot to clients on that 1.06 nodes to move
their VM before upgrade that node to 1.10 (no live
migration)
○ then open champagne bottles!
16. Bug met during the upgrade
● vrouter 1.06 cannot live with 1.10 with MPLSoUDP
encapsulation => pass to MPLSoGRE during the
cohabitation
● SNAT/LBaaS stuff does not take care of the vrouter
version
● Slow all the contrail API due to the move of the Neutron
Contrail plugin code from neutron-server to Contrail API
● Zookeeper timeout
17. Bug met after upgrade
● Data kernel module path memory leak
● Data kernel module path hold flows count
leak (workaround: restart the vrouter agent)
● 13 Cloudwatt patches added to the 1.10
upstream release:
https://review.opencontrail.org/#/q/status:
open+branch:R1.10,n,z
18. Bug still persist on 1.10
● Schema slave->master ~20 mins
● Logging stuff configuration
● Some 5xx error still appears on the Contrail
API
● Live upgrade a compute node without
downtime (do we need it?)
19. My wishlist to Santa SDN
● That people use more https://blueprints.
launchpad.net/opencontrail
● Stable master before pulling new branch
● Use http://semver.org to number releases
● The Contrail team to be more community
oriented
20. 2015S2 todo
● Improve Neutron Contrail plugin code
https://review.opencontrail.org/10123
● Upgrade to 2.x branch
● Build a CI/CD on master
○ build and deploy daily
○ run opencontrail sanity
○ run functional no-reg
○ run performance no-reg
● OpenStack L3VPN integration