Network administration overhead is currently one of the major obstacles preventing customers from moving OpenStack into production for wider adoption and efficient utilization by applications. Cloud facilities might experience lack of visibility to common operations of underlying workers and coherent representation of physical and virtual network elements and their interconnections. They might find it hard to estimate impact of micro failures in their infrastructure and react fast to failures. Some might overcome complexity in operations, discovery and monitoring of their cloud by manual processes and/or complex batch operations. I'm offering a journey of troubleshooting and discovery cycles in a typical Cloud that we run today, suggest elegant ways to overcome overheads. Substantially simplifying networking operations, troubleshooting and monitoring might happen through unified Operations API and operations agent, those concepts will be presented, accompanied with practical demos.
2. OpenStack Discovery and Assurance
Koren Lev
DC Operator, IT Developer, Entrepreneur, Dev/Ops manager etc…
• I’ve been using OpenStack since Diablo (~ 6 years)
• I’ve been operating and supporting SP and ENT
deployments in Europe and the Middle East
3. General observations and thoughts…
• I believe OpenStack infrastructure is not very easy to operate
(post installation that is …)
• I believe it is a bit hard to maintain and troubleshoot
• Community’s focus on fulfilment (“make it work”), provisioning (“configure
it”) and abstraction (“end users don’t care about the details”)- Therein lies
the problem (IMHO)
• We neglected the Cloud operator operations needs (IMHO)
• According to Mirantis (example): running 5000 OpenStack nodes was failing
mostly because of issues around Neutron
• I’ll use networking charter to illustrate it, the points made fits all charters
5. Controllers and Agents vs Workers/Plugins
• Most OS modules operates using controllers and agents.
• Here is an example :
Controllers
Agents
Workers
APIs: for fulfilment and provisioning - abstracted
https://docs.openstack.org/developer/neutron/#neutron-
stadium
6. Neutron controller data (current API):
“instance”
Very simple, abstracted, awesome for the cloud user …
“port” “network”
“router”
…and be assured:
the network is active !
7. The views of cloud operations team…
• Let’s say a ‘vm200’ instance on ‘network100’ can’t communicate (it happens…)
• Troubleshooting with premium knowledge (good support personnel)
• Assuming: Mirantis 8.0 (Liberty), Mechanism : OVS and LXB, Type: VXLAN
• Assuming: only RegionOne
• Assuming: you found the nova instance-to-host mapping (nova API)
• Assuming: you found the nova instance-name-to-uuid mapping (nova API)
Since liberty *
8. • Running on host ‘node-6’ , OVS agent there, host and agent reachable.
• We need more details before going down to the hosts level …
• DHCP server and a gateway/router running on this network, find out where:
The views of cloud operations team…
9. • More details are missing, available through MariaDB, not exposed in API
(partial list):
• Is this really important
data for troubleshooting ?
• Well…depends what’s wrong
in the network ( if not being
‘active’ or ‘;-)’ )
Workers/plugins vendors are placing
their details in MariaDB (no ops API)
The views of cloud operations team…
10. • So based on the findings so far, moving to hosts level (yes, MariaDB data is not
enough !):
The views of cloud operations team…
11. • Ever wondered what’s going on in hypervisor interface list ? (partial list here):
The views of cloud operations team…
12. • Let’s skip vNIC model type details for now, move down to the linux bridge:
The instance representation
of a network ‘port’ inside
that specific hypervisor
(assuming linux bridge plugin)
The bridge-side network
‘port’ inside that specific
hypervisor (assuming linux
bridge plugin)
Thought : is it ‘active’ ?
The views of cloud operations team…
13. • Let’s skip monitoring details for now, move down to the OpenvSwitch:
The views of cloud operations team…
The ovs-side network ‘port’
inside that specific hypervisor
(assuming ovs plugin)
The tunneling bridge inside
OVS in-charge of isolation
and segmentation
Tunneling used for this
specific case (vxlan)
The integration bridge inside
OVS in-charge of isolation
and encapsulationThe ovs-side representation
of the instance ‘port’
14. • Now which communication is broken ? to which destinations ? depending on
the answers, we can go across to the specific tunnel destinations.
• Let’s assume vm200 has no ip address assigned , so investigating the tunnel to
node-6 (neutron-agent dhcp is over there, see slide 7):
The views of cloud operations team…
Node-1 192.168.2.1 as source and node-6
192.168.2.2 as destination
(assuming in this example there is no
routing needed from the source and
destination of the tunnel)
15. • Finding the physical NICs used for the segmentaion/tunneling from node-1 to
node-6:
The views of cloud operations team…
“br-mesh” bridge in this hypervisor is
holding the ip for the vxlan-sys tunneling
inside the ovs
“br-mesh” bridge in this hypervisor is
connected through pNIC ens160, sub-
interface 103 (vlan for the tunnel endpoint)
vi /etc/network/interfaces.d/ifcfg-ens160.103:
16. • Moving to node-1 for the L3, DHCP and Meta investigations :
The views of cloud operations team…
Find uuid of dhcp service running by that
specific dhcp agent on that specific node
The dhcp server has this vNIC port
connected down at node-1
17. • vServices vNIC interfaces connections on node-1 (dhcp - a quick summary):
The views of cloud operations team…
18. • vServices vNIC interfaces connections on node-1 (l3- a quick summary):
The views of cloud operations team…
19. • What if we change distribution/mechanism/types ? (guess what - different
discovery/collection logic and different details per object), dpdk/fd.io example:
The views of cloud operations team…
20. • What if more then 1 VM ? What if HA ? What if DVR ?
The views of cloud operations team…
• Discovery x VMs x 2 , Discovery x 2 , Discovery x Hosts
• Post discovery you can start finding a fix …
21. Yes, we are a small team that spent the last year developing a possible offering to
start solving the networking charter, focused on ‘Networking Operations API’ (see
next).
..not a cure for cancer …but it’s pretty good, tested with real IT operations teams
We call it ‘Calipso’
Point made (!?) stop bitching…any solution ?
Possible Openstack attachments: ‘Monasca’, ‘Vitrage’ , ‘Ceilometer’, ‘Neutron’, ‘Tacker’
Others: ‘Barometer’
22. • OpenStack “Operations APIs” – let’s get started…
• Exposing up the needed details for the Cloud operations team
• To be developed for any module suffering from lack of workers/plugins visibility
Our ‘Networking Operations API’:
• Modeled for Multi distribution, any mechanism driver / type drivers variances
• Includes smart discovery logic, a visualization solution , monitoring, analysis
Proposition: a possible starting point
Visibility = Predictability = Stability
24. Calipso objects - examples
OSDNA Object Object Details Example 1 Example 2 Example 3
vService Services Overlay (virtual) DHCP (ip netns) L3 GW (ip netns) FWaaS
vNIC VMs NIC, Container CNI Instance/vService
vNIC
Tap to linux-bridge VPP Virtual-Ethernet
vConnector L2 inside a host(isolation) Linux Bridge VPP bridge-domain VMware Port-Group
vEdge Virtual to Physical Edge OVS VPP Midonet
pNIC / Bond Physical Underlay Fabric Edge Ports EPGs in ACI Servers Eth / Ether-
channels
Network Segment Virtual Segments (for any
tunneling overlay)
VLAN VXLAN Segment-ID GRE segments
OTEP Overlay Tunnel VXLAN Geneve GRE
OSDNA Views Details Example 1 Example 2 Example 3
Virtual Topology Modular links graph in
Calipso discovery
vService to Network Instance to Network All virtual2physical
per network
Policy Topology Data from the APP
Driving OpenStack
App VM to DB VM VNF to end-user VNF chaining
26. Environment
A
Calipso Discovery
Logic
API
DB
CLI
Environment_Config A
Initial scan logic
Environment_Config B
Initial scan logic
API
DB
CLI
Environment
B
Environment_Config C
Initial scan logic
API
DB
CLI
Environment
C
"name" : “MyENV3",
"host" : "10.56.20.239",
"port" : "5673",
"user" : "nova",
"password" : "YVWMiKMshZhlxxxxqFu5PdT9d"
},
{
“Mon" : "Monitoring3",
"type" : "Sensu",
"host" : "korlev-nsxe1.cisco.com",
"port" : "4567"
[removed]
],
"distribution" : "Mirantis-8.0",
"last_scanned:" : "5/8/16",
"name" : "Mirantis-Liberty",
"mechanism_drivers" : "OVS"],
"type_drivers" : "vxlan",
"operational" : "yes",
"type" : "environment"
Calipso hierarchical,
modeled
Inventory:
regions
Projects
Hosts
Aggregates / zones
Networks
Ports
Instances
vNICs
vConnectors
vEdges
vServices
pNICs
OTEPs
etc ..
Links and
Relationships
Analysis:
Instance-vNIC
vNIC-vConnector
vConnector-vEdge
vEdge-pNIC
pNIC-OTEP
OTEP-vConnector
vService-vNIC
Network-Port
etc …
Calipso Cliques and Topologies:
(Cliques):
Focal_point_type (ex): instance
Clique_type: [array of links]
RabbitMQ
CRUD events
Real time
Updates
Environment_Listener A
Event-based scan logic
Environment_Listener B
Event-based scan logic
Environment_Listener C
Event-based scan logic
Object
Scan
SSH parsing caching
27. Environment A
Region X, Zone Y
Host 234
Calipso Monitoring
Sensu
Server Manager
(conf by Calipso)
Calipso Sensu
Checks
Sensu Redis DB
Calipso hierarchical,
modeled
Inventory:
regions
Projects
Hosts
Aggregates / zones
Networks
Ports
Instances
vNICs
vConnectors
vEdges
vServices
pNICs
OTEPs
etc ..
Real time Status and Statistics
OTEP
vNIC
pNIC
vEdge
Sensu Client
Transport
(configured and
deployed by
Calipso)
VPP stats/results
vNIC stats/results
LXB stats/results
OTEPs stats/results
pNICs stats/results
etc.. Checks are
customized and modeled
Sensu API
Sensu UI
Calipso Sensu Handler
Environment ACalipso Sensu Handler
Environment ACalipso Sensu Handler
Environment C
Monitoring Configurator
(Environment-aware)
Calipso BUS
Calipso porting to TSDB
Calipso Discovery
Logic
Possibly contributing
to OpenStack Health checks
Historical reporting
28. Calipso visualization:
modeled for complex virtual topologies
OpenStack
CalipsoDiscovery
Connecting physical and
virtual elements of cloud
networking
CalipsoUI
Calipso Graph
Cloud Networking
Assurance
Historical Trends , Root Cause , Impact Analysis
Cloud Network
Administrator
Tenant Network
Administrator
Virtual Network
Elements,
Dependencies,
Status,
Stats API Extensions for
discovery/assurance
Docker
ANY (*Open)Stack,
ANY Plugin
Model-Driven
Discovery
Engine
Inventory
Containers
Users:
29. OpenStack Discover*
Mongo
DB*
Monitor* BUS*
External
App
UI*API*
OS CRUD events
Scan 4 all Data (API, DB, CLI)
Scan (temp) Data Scan (temp) Data
Full Inventory Data
Environment Config
(Init/Setup) Environment Config
(Init/Setup)
State/Statistics
Checks Results
Live Updates
Inventory, Topology
Data
Full Topology Data
Run a
Scan
Scan 4 some Data (API, DB, CLI)
(scheduled)
Run a
Scan
Some Inventory
Data
Some Topology Data
Inventory, Topology
Inventory, Topology
Analysis APP
Inventory, Topology
Monitoring Config
(Init/Setup)Monitor Clients +
Checks Installation
Run a
Scan
Messages/Updates
Setup Monitor
Setup Monitor
Monitoring Config
(Init/Setup)
State/Statistics
State. StatisticsState/Statistics
Messages /
Notifications
API
DB
CLI
RabbitMQ
Sensu
Clients
Sensu
Checks
Messages /
Notifications
UI Config
UI Config
Environment Config
(Init/Setup)
State. Statistics
Agent for
‘Operations
API’
* All Container-based today
30. Discovery logic successfully running on:
OVS, VLANs, GREs, VXLANs:
• "Mirantis-6.0", "Mirantis-7.0", "Mirantis-8.0", "Mirantis-9.0",
• "RDO-Mitaka", "RDO-Liberty", "RDO-Juno"
• “Devstack-liberty", "Canonical-icehouse","Canonical-juno",
• "Canonical-liberty", "Canonical-mitaka",
• "Apex-Mitaka“ (3-o), "Devstack-Mitaka",
• "packstack-7.0.0-0.10.dev1682“
• "Stratoscale-v2.1.6",
• "Mirantis-9.1",
VPP, VLANs:
"RDO-Mitaka“, "Apex-Mitaka",
Pre QA: Midonet, vSphere (vSwitch)
If your variance is not on this list it
means we didn’t test/validate
We’d appreciate your help in
adapting to more variances !
32. OpenStack
Calipso objects for
Containers
Calipso objects for Bare
Metal
Through API
Objects in Calipso Discovery Calipso
Monitoring
Region - ex: NYC, SJC
Host – ex: compute
node
Project – ex: Coke
Port
Zone / Aggregate – ex:
B16, Floor 2 etc …
Calipso objects for
VMware vSphere
API – OpenStack API – Contiv , Docker API – Cisco UCS API – vSphere
Calipso
Adapters
Through API
Through API
Through API
Custom Sensu
Checks
N/A
Server
Tenant
Container veth
Cluster
N/A
N/A
N/A
NIC
N/A
DataCenter Cluster
Server
Tenant
Port-group
DataCenter
Network
Custom Sensu
Checks
Network Network Network
33. Calipso objects for
OpenStack
Calipso objects for
Containers
Calipso objects for Bare
Metal
Through API
Objects in Calipso Discovery Calipso
Monitoring
Instance / vService – ex:
a VM, a DHCP srv
pNIC – ex : TengigEth
vConnector – ex: Bridge
vEdge – ex: OVS, fd.io
etc
OTEP – ex: VXLAN, GRE
vNIC / Port
Network / Network
Segment
Container
pNIC
Bridge, BDomain
OVS, fd.io
VXLAN
Container veth, CNI
Network / Network
Segment
A Server
pNIC
N/A
N/A
N/A
N/A
Network / Network
Segment
Calipso objects for
VMware vSphere
API – OpenStack
DB – MySQL
CLI – Linux Bash / SSH
API – Contiv , Docker
DB – ETCD
CLI – Linux Bash / SSH / Docker
API – Cisco UCS
DB –
CLI – OS specific / SSH
API – vSphere,
DB – N/A
CLI – ESXi
VM
pNIC
Port-group
vSwitch / NSX switch
VXLAN
vNIC
Network
Calipso
Adapters
Custom Sensu
Checks
Custom Sensu
Checks
Custom Sensu
Checks
Custom Sensu
Checks
Custom Sensu
Checks
Custom Sensu
Checks