Provisioning VPNs on an heterogeneous network with OpenDayLight and NETCONF_bcom
1. Gwenael LAMBROUIN
Santa Clara, CA, USA
2015-07-28
/ Provisioning VPNs on a
heterogeneous network
infrastructure with
OpenDaylight and NETCONF /
28/07/2015
2. 7/28/2015 / 2Diffusion : confidentiel7/28/2015 / 2Diffusion : confidentiel7/28/2015 / 2Diffusion : confidentiel / 2Diffusion : public
Contents
› Part 1
• The Problem to Solve
• Key Technical Issues
• Why OpenDaylight (ODL)?
› Part 2
• Our implementation
› Part 3
• Ongoing work with ODL
• Feedback on ODL
3. 7/28/2015 / 3Diffusion : confidentiel7/28/2015 / 3Diffusion : confidentiel7/28/2015 / 3Diffusion : confidentiel / 3Diffusion : public
›faster service delivery
& improved self-service
›through automation
›on an example: L2VPN
The Problem to Solve
4. 7/28/2015 / 4Diffusion : confidentiel7/28/2015 / 4Diffusion : confidentiel7/28/2015 / 4Diffusion : confidentiel / 4Diffusion : public
The Problem to Solve
5. 7/28/2015 / 5Diffusion : confidentiel7/28/2015 / 5Diffusion : confidentiel7/28/2015 / 5Diffusion : confidentiel / 5Diffusion : public
The Problem to Solve
6. 7/28/2015 / 6Diffusion : confidentiel7/28/2015 / 6Diffusion : confidentiel7/28/2015 / 6Diffusion : confidentiel / 6Diffusion : public
The Problem to Solve
8. 7/28/2015 / 8Diffusion : confidentiel7/28/2015 / 8Diffusion : confidentiel7/28/2015 / 8Diffusion : confidentiel / 8Diffusion : public
›Simplifies service logic
›Makes it possible to
change the hardware
›What is the good level?
Abstraction
9. 7/28/2015 / 9Diffusion : confidentiel7/28/2015 / 9Diffusion : confidentiel7/28/2015 / 9Diffusion : confidentiel / 9Diffusion : public
›Configurations can fail
›Errors must be detected
›A recovery plan is needed
Error Management
24. 7/28/2015 / 24Diffusion : confidentiel7/28/2015 / 24Diffusion : confidentiel7/28/2015 / 24Diffusion : confidentiel / 24Diffusion : public
›Error recovery strategies
•rollback on error
•continue on error
›What is the better strategy?
•It depends!
L2VPN Service Orchestrator
25. 7/28/2015 / 25Diffusion : confidentiel7/28/2015 / 25Diffusion : confidentiel7/28/2015 / 25Diffusion : confidentiel / 25Diffusion : public
Contents
› Part 1
• The Problem to Solve
• Key Technical Issues
• Why OpenDaylight (ODL)?
› Part 2
• Our implementation
› Part 3
• Ongoing work with ODL
• Feedback on ODL
26. 7/28/2015 / 26Diffusion : confidentiel7/28/2015 / 26Diffusion : confidentiel7/28/2015 / 26Diffusion : confidentiel / 26Diffusion : public
›Move orchestration to ODL
›Experiment with data
persistency
›Develop a CLI driver
Ongoing Work with ODL
27. 7/28/2015 / 27Diffusion : confidentiel7/28/2015 / 27Diffusion : confidentiel7/28/2015 / 27Diffusion : confidentiel / 27Diffusion : public
›Cons
•Hard to learn, hard to master
›Pros
•Stable
•Bug & patch processes work
•Friendly & responsive community
Feedback on ODL
28. 7/28/2015 / 28Diffusion : confidentiel7/28/2015 / 28Diffusion : confidentiel7/28/2015 / 28Diffusion : confidentiel / 28Diffusion : public
›ODL can manage network
devices
›We are ready to share our
code
›… if there is interest!
Conclusion
Gwenael Lambrouin is a R&D engineer member of b<>com.
b-com is a private research institute located in France, in Brittany (http://b-com.com/).
This talk tells how we used OpenDaylight (ODL) to automate the deployement of a network service (vpn) over a heterogeneous infra of routers
using the management plane and the standard Network Configuration Protocol.
We did this on a proof of concept platform.
Today, from command to delivery, it can take days for a network operator to deploy a network service for a client.
Automation is an enabler:
to speed up service delivery: from days to seconds or minutes
to make self service more responsive: the client can command its service online, and the service can be deployed quickly
Our starting point: a shared network infrastructure spanning several geographical sites.
Can be managed by the IT services of the company or a network operator.
Heterogeneous infrastructure:
multi-vendor
with both virtual & physical devices (on the way to network virtualization)
On the drawing:
sites or interconnected across an IP network (the cloud in the middle)
the routers: carrier grade devices manufacturers, VPN support, NETCONF
multi-vendor (“light blue” is Cisco, “dark blue” is Juniper)
physical & virtual (v in a circle = virtual) => vendor supplied VM running the router firmware over commodity hardware (x86 servers)
on each router: there are some ports available to create VPNs (the management ports & interconnection ports are not shown here)
On top of the network: the users can build layer 2 VPNs that can be seen as logical switches.
Here: user orange creates a VPN between site 1 & site 2, with 1 port on site 1 and 1 port on site 2
Machines plugged to the ports of the logical switch will belong to the same broadcast domain (or same LAN): the user will do what he wants in that domain: manage its IP address plan, use IPv6, deploy DHCP servers, test new protocols, …
Under the hood: the layer 2 VPNs are implemented using VPLS, the Virtual Private LAN Service
User purple: creates a VPN between the 3 sites with 2 ports on site 1, 1 port on site 2 and 1 port on site 3.
He get its own, private, virtual network: its network traffic is isolated from the orange user’s traffic.
two of the main technical issues we faced: abstraction and error management.
there are others, not described in this presentation but occasionally mentioned: resource allocation/management, orchestration
We want to make the service logic
(in our case: the application that will automate the management of the VPNs)
independent of the hardware and of the technical details of the management plane
Why:
[1] to simplify the development & maintenance of service management applications
so that the service app developers can focus on service management: resource allocation, service lifecycle
[2] to decouple the app logic and the network infra.
Then it becomes possible to change the network infra (vendors, management plane protocol, …) without changing the app
What is the good level of abstraction? a compromise must be found:
too high level of abstraction: some useful features will not be exposed
too low level of abstraction: loss of hardware independence with device specific features; too much details to handle by the programmer
(anyway: device specific features must be handled with capabilities aka optional features)
The good level of abstraction depends on the service: you need just enough details to be able to configuration your service on all your devices
A device configuration can fail during an automated process:
error reported by the device (bad request, device in a wrong state, device locked, …)
device unreachable (device crash, physical network problem, device misconfiguration, access rights changed, bad firewall configuration change, …)
A service automation application must be ready to deal with this. Can be complex and add a lot of code.
What is needed:
error detection (error reported by device, connection timeout)
recovery strategy, eg:
undo all
keep as is + manual fix + resume
ODL is open source
useful for proof of concepts: can be tweaked
open source is an element of the research strategy at bcom (lever effect)
backed-up by industry
carrier-grade device vendors support ODL
reassures bcom management => seen as a guarantee of available development resources & software robustness
NETCONF
at the beginning: we looked for an open source SDN controller providing some way to configure legacy network devices, eg via NETCONF or CLI
ODL provides NETCONF
when we started the project: odl was the only open source controller to support NETCONF
CLI would have been ok, but not found anywhere
MD-SAL
The MD-SAL (Model-driven software abstraction layer) is at the heart of ODL
It provides abstraction thanks to YANG
YANG is a modeling language for structured configuration data & remote calls (API)
ODL provides abstraction thanks to YANG: odl interfaces are specified with yang
evaluate odl
we had ODL on our radar
beyond the buzz, we wanted to test odl on a concrete use case
(it’s good to have a use case to evaluate a tool)
(((no use case: wandering, blocked at some point, move to smth else)))
We have been using ODL for about one year (we started during helium development cycle)
This is an overview of our implementation
Self-service user interface: used by clients to order their L2VPN: eg “I want a logical switch with 1 port on site 1 and 1 port on site 2”
(web-based, uses open source software: Java, Spring, Maven, AngularJS, JQuery, JQueryUI, D3JS)
L2VPN service orchestrator:
implements the service logic to automate L2VPN management (creation/suppression): resource allocation, network operation planning, network operation execution, error recovery.
exposes a service API used by the self-service UI
stores the states of the services in a database
Java web application, uses open source software
Generic device driver:
provides abstraction to make the service logic independent of the network devices & management plane
uses the NETCONF driver in ODL
a set of ODL modules.
In the following slides, I will:
focus on the generic device driver
give some elements about the service orchestrator
give some elements about NETCONF
Generic network device driver:
a set of modules in ODL
the piece of software that provides abstraction to service applications
provides abstraction to simplify the development of upper level applications
makes it possible to change the hardware without changing those upper level applications.
exposes a model of a generic network device (basically: an abstract router with ports)
provides network operations to configure one network device
composed of two layers + uses ODL NETCONF driver
To illustrate the role of each layer, we will take an example of a network operation: add a physical port (ethernet interface) of a given router to a given VPN.
We assume that the VPN is already created.
At the higher level: as the developer of an application that automates the deployment of the VPN, I want to “add port 3 of the “light blue” router (on site 1) to the orange user VPN”.
Next slides: we will see how this is declined at each layer, from bottom to top, starting with the NETCONF driver
Before we go into the presentation of the driver: a parenthesis with two comments about NETCONF
NETCONF better than CLI (command line interface) for network automationbut not enough to provide abstraction
structured data => easier to process in a program
we know clearly when it fails on the device (no need to interpret an error string)
the limits: some really useful features are optional in the standard, and cannot be found on all network devices:
example: commit/discard-changes: guarantees that we have transactional sets of operations
very useful for error management
impossible to emulate at a higher level, because the device to configure can become unreachable in the middle of a configuration
[Backup comment: over the management plane, the virtualized routers that we used can be configured as physical routers]
NETCONF driver: Java library provided by ODL (not a real md-sal driver)
connect to a given device and authenticate
exposes the NETCONF toolbox (lock, commit/discard-changes, …)
the user must provide vendor-dependent configuration elements that will be transported over the NETCONF protocol
What do those configuration elements look like? See next slide…
Example 1: configuration elements to add a port to a L2VPN on a Cisco device (Cisco-specific)
At first view: cryptic if you don’t know the configuration language
I won't go into the explanations.
We can see the interface name and the VPN number (in bold)
To deal with those configuration elements, you need:
1. to understand the theory behind L2VPNs2. to understand how the device vendor choose to expose it3. to deal with XML
Consequently: not something you want to deal with when you write a service management application
Example 1: the same thing: configuration elements to pass to the NETCONF driver to add a port to a L2VPN, but on a Juniper device (Juniper-specific)
(rem: incomplete for the sake of readability)
another vendor, another vendor-specific configuration language transported over NETCONF
we can see the vpn name and the interface name (in bold)
we don’t want to have specific code for each kind of device at the service level.
(rem: No generic YANG models for configuration data supported by the devices we used.)
Conclusion about the NETCONF driver:
not easy to deal with network device configuration elements
specific code need for each vendor
To overcome those issues, we created the “generic network device driver”
Device independent driver: network device abstraction layer:
Java library developed in ODL at bcom
exposes the generic network operations that can be done on a device
eg « create a virtual switch on a given routeur », « add a port to a given virtual switch »
converts those network operations into device-specific configuration elements
(rem: network-service specific API: one API per service (l2vpn, l3vpn, hostname, …)
Example: usage of the device independent driver
it’s better from a developer point of view!
* it’s more expressive
* it’s not vendor specifc
Still: we need to know some management-related technical details: what is the vendor? how can I connect to the device (ip address or hostname, port number)? How should I authenticate myself?
We would like the service developer not to have to deal with this details => this is the purpose of the management plane independent driver.
Management plane independent driver:
md-sal interface (YANG models)
abstracts the technical details of the management plane (protocol, address & port number, admin credentials)
Note: relies on a management database holding the details: the user only needs to know the id of the network devices
device_id ⇔ (vendor, protocol, ip_adress, port, login, password)
Example: usage of the “generic network device driver” API (RESTCONF)
simplifies the work of the user
in the end: we have two layers of abstraction which make it possible:
to change the hardware
to change the management plane (eg: new protocol, new auth scheme eg public/private key instead of login/password)
… without changing service applications (because: no impact on this interface)
(rem: vpn-name vs vpn-number: needed for persistency across controller restart)
(Pending issue: does it make sense to merge the device independent driver and the management plane independent driver?
Merger pros: easier update and maintenance of the driver
Merger cons: need to fill a database to begin to work with the driver. Less flexible, adds complexity (tests, simple prototypes, …)
)
The generic network device driver operates on a single device.
On top of that, the L2VPN service orchestrator can deploy a VPN on a set of devices
(another layer of abstraction)
L2VPN service orchestrator:
developed in Java using open source technology
not developed in odl (BPMN approach)
About the orchestrator:
knows a logical view of the network: the network resources, abstracted/high level as exposed by the generic network driver: routers, vfi, ports, …
exposes a service API that can operate on a set of network devices
eg “create a L2VPN between router A and router B with two ports on router A and one port on router B
On a service request, eg “create a L2VPN”
check resource availability + allocate resources
prepare operations: ordered list of “generic driver” operations
do operations
handle errors. automatically.
2 strategies to handle errors:
rollback on error
continue on error
strategy = decision by the user of the orchestrator: depends on whether the service can work in degraded mode (reduced functionality mode) (and whether this makes sense)
what we did (eg in case a device is unreachable):
vpn creation: rollback on error
vpn destruction: continue on error + fix manually + resume
[
assumption: the generic network device driver exposes transactional operations:
either an operation successes and the device is correctly configured and this is reported
or an operation fails and it is guaranteed that the device is not configured at all
this is tricky, and not completely solved.
]
Move orchestration to ODL:
to simplify the deployment of the system
((to make it easier to work around bugs in the devices that need a fix at the orchestration layer: eg configure router type 1 before router type 2
(this is a problem we encountered, and that was eventually fixed in the device: interop issue)
))
Data persistency in odl
what: management plane database & L2VPN service orchestrator database
we expected persistent persistency in lithium, but not there yet
=> we will try with clustering + mdsal datastore
CLI driver
NETCONF not everywhere
need a CLI driver to extend the set of network devices that we can manage
we have a working prototype: cli driver integrated in our generic network device driver
What we, the bcom team that worked on ODL, learned from experience.
We have been using ODL for about one year (we started during helium development cycle)
Cons:
hard to learn, hard to master
a complex beast
lots of java tools & libs => a good understanding of java tools is definetly helps lots of abstractions not easily understood
lack of exhaustive & up to date documentation
(dev guide too short, even in lithium release) obsolete documentation sometimes in the wiki code examples: in the code
=> steep learning curve
be ready to spend a lot of time each time we want to do smth new which is a prb because odl is supposed to be a toolbox
(also: working on the master branch was frustrating
builds that worked suddenly fail because of the snapshot updates)
Pros:
odl is stable for our use case (helium & lithium)
(disclaimer: we did only functional tests)
in our experience, odl core components (mdsal, NETCONF, yang generated code) stable at runtime;
no crash, no freeze, can stay up for days, no need to restart it before each demo
when doing the demo: no fear that it would fail because of odl
never experienced failure because of odl.
bug/patch submission/review process: well described, works okOur first contrib to ODL:
Interop issues with between the NETCONF driver in ODL and some routers:
(low level details in grey areas of the specs: unclear whether the device or odl did it wrong)
we identified and fixed them
we submitted bug reports and patches to ODL
we worked on the patches with the community
eventually: those patches were accepted in the controller code base (lithium, master)
community: friendly & responsive
with newcomers tolerant to process mistakes
In the end: we will go on with ODL for the foreseeable future
This work proves that ODL can be used to manage network devices (over the management plane).
This is a useful piece on the way to network virtualization.
Confirms that ODL is a general purpose network controller, not just about OpenFlow.
Our code is not open source yet, but we are ready to do it if there is interest.
This is a prototype, with capabilities limited to our use cases: l2vpn (shown), l3vpn (done)
If you are interested, you can get in touch with me during the conf or email me.