A RINA light implementation
Vincenzo Maffione
20/02/2017
Introduction (1)
● A Free and Open Source light implementation of RINA for Linux
● Implementation splitted between user-space and kernel-space
● KISS approach → codebase is clean and essential
● Focus:
○ basic functionality - do few things but to them well
○ stability and performance - support deployments with hundreds of nodes
○ minimality - avoid over-engineering
● Main goal: a baseline implementation for future RINA products
● Code and documentation available at https://github.com/vmaffione/rlite
Introduction (2)
● ~ 27 Klocs (not including blanks)
○ kernel-space: ~ 9 Klocs
○ user-space: ~ 18 Klocs
■ including tools and example applications
● Written mostly in C (some parts are C++ for convenience)
○ C: 14 Klocs
○ C++: 7 Klocs
● Network applications can be written in C
● Python bindings available to write network applications in Python
Introduction (3)
● kernel-space is implemented as a set of out-of-tree kernel modules, which run
on the unmodified Linux kernel.
○ Linux Kbuild system is used to build the modules against the running kernel
○ Build time (no parallel make): 3-15 seconds
● user-space is implemented as a set of shared libraries and programs
○ CMake is used to configure and build libraries and executables
○ Build time (no parallel make): 15-60 seconds
Basic features (1)
● Applications:
○ Flow allocation and deallocation, with QoS specification
○ Application registration and unregistration
○ Data transfer
● Stack administration:
○ Creation, deletion and configuration of IPCPs
○ Registration and enrollment among IPCPs
○ Monitoring and inspection
■ inspection of IPCPs in the system
■ inspection of RIBs
■ per-flow statistics
Basic features (2)
● QoS (supported through DTCP):
○ Flow control
○ Retransmission control
○ Maximum allowable gap
○ Simple token-bucket rate-limiting
● Decent performance (detailed performance plots to come)
○ About 9.5 Gbps on a 10 Gbit link without flow control and retransmission
○ About 6 Gbps on a 10 Gbit link with flow control
○ A lot of room for optimizations
● Stability indicators
○ Done 10 days long VM-based experiments with up 35 nodes, two levels of normal DIFs and 50 flows
allocations per second
○ Done experiments with up to 10 levels of DIFs
Architecture overview (2)
● kernel-space
○ Supports control operations
○ Implements datapath
○ Keeps state
● user-space
○ Libraries to abstract interaction with kernel-space functionalities
○ A daemon to implement management part of (many) IPCPs
○ An iproute2-like command-line tool to administer the stack
● Interactions between kernel-space and user-space only happen through character devices →
therefore through file descriptors
Kernel-space architecture (1)
● Supported functionalities:
○ IPCP creation, deletion and configuration (kernel keeps a per-IPCP data structure)
○ Flow (de)allocation (kernel keeps a per-flow data structure)
○ Application (un)registration (kernel keeps a data structure for each registered application)
○ RMT, DTP and DTCP components of the normal IPCP
○ Shim IPCP processes (e.g. interaction with network device drivers)
● State is maintained in kernel-space:
○ user-space can crash or be restarted at any time
○ user-space can recover state from kernel
Kernel-space architecture (2)
● User-space interacts with kernel-space only through two character devices
○ /dev/rlite for control operations
○ /dev/rlite-io for data transfer and synchronization
● Consequently, interactions only happen through file descriptors
● Both are “cloning devices”
○ each open() creates a new independent kernel-space instance
● Both devices support blocking and non-blocking operation
○ Standard poll() and select() widely used with the devices
Kernel-space architecture (3)
● /dev/rlite used for control operations
○ flow (de)allocation
○ Application (un)registration
○ IPCP creation, deletion and configuration
○ Management of PDU forwarding table
○ interactions between user-space and kernel-space parts of IPCPs
○ inspection and monitoring operations on flows and IPCPs
○ ...
Kernel-space architecture (4)
● Control operations follow a request/response paradigm:
○ write() to the control device to submit a request message
○ Response messages (not always present) can be read through read()
● The control device is used to avoid ioctls() and netlink
○ Easier porting to other OSes (e.g. FreeBSD)
● Request and response messages are represented by packed structs and are
serialized/deserialized during the user-space ←→ kernel-space transition
○ support for string (de)serialization
○ support for (apn, api, aen, aei) name (de)serialization
Kernel-space architecture (5)
● /dev/rlite-io for data transfer and synchronization
○ read()
○ write()
○ select(), poll(), epoll()
● Application workflow:
○ Use the control device to allocate a flow (kernel-space object)
○ Bind the flow to a newly-created data transfer file descriptor - this is the only task performed by
means of ioctl()
○ Use the data transfer file descriptor to exchange SDUs and/or wait for events
○ Close file descriptor to deallocate the associated flow
● Special binding mode to exchange management SDUs
Kernel-space architecture (6)
● Usual abstract factory pattern to manage different types of IPCPs
○ normal: implementation of the regular IPCP
○ shim-loopback: supports system-local IPC, with optional queued mode to decouple TX and RX
code-paths, and optional packet drop emulation
○ shim-eth: uses network device drivers to transmit and receive SDUs, sharing the device with
the Linux network stack
○ shim-udp4: tunnels RINA traffic over UDP socket; mostly implemented in user-space, only
data transfer is implemented in kernel-space
○ shim-tcp4: same as shim-udp4, but using a TCP socket; deprecated, since it duplicates
flow-control and congestion control done in higher layers
○ shim-hv: uses VMPI devices to transmit and receive SDUs
Some kernel-space internals
● Reference counters widely used to manage lifetime of objects (e.g. IPCPs,
flows, registered applications, PDUs)
● sk_buff-like approach to avoid copies throughout the datapath
● dynamic allocation of PDU buffers
○ The amount of header space to reserve at allocation time is precomputed by the user-space
daemon, depending on the local IPCP dependency graph
● All PDU queues are limited in size to keep memory usage under control
● Deferred work (workqueues) used only when necessary, to keep latency low
○ Example: driver transmission routine directly executes in the context of an application write()
system call, when possible
user-space libraries
● librlite (written in C)
○ main library, abstracts interactions with the rlite control device (/dev/rlite)
○ provides common utilities and helpers (application names, flow specification, control
messages, ...)
○ provides an API for RINA applications
● Other libraries
○ librlite-conf (C): extends librlite with kernel-space IPCP management functionalities
○ librlite-cdap (C++): CDAP implementation based on Google Protocol Buffer
librlite - Overview
● librlite provides API calls to interact with control device instances
○ Validation, serialization and deserialization of control messages in both directions (user →
kernel, kernel → user)
● It defines a POSIX-like APIs for applications:
○ Reminiscent of the socket API, to ease porting of existing socket applications...
○ … yet with the full power of RINA API (QoS support and complete naming scheme)
○ Easy to learn for grown-up network developers!
○ Documentation available at https://github.com/vmaffione/rlite/blob/master/include/rina/api.h
○ Other resources: https://github.com/IRATI/stack/wiki/Application-API
librlite - Application API
● Main API calls:
○ int rina_open() → fd
■ Opens a control device instance, returning a file descriptor.
○ int rina_flow_alloc(dif_name, local_name, remote_name, flowspec, flags) → fd
■ Issues a flow allocation request and possibly wait for the associated response. Returns a file descriptor to be
used for data transfer.
○ int rina_register(fd, dif_name, appl_name, flags)
■ Register an application into a given DIF.
○ int rina_register(fd, dif_name, appl_name, flags)
■ Unregister an application from a given DIF.
○ int rina_flow_accept(fd, flags) → remote_appl, flowspec
■ Wait and possibly accept an incoming flow request, where the destination application is one of the ones
registered to the control device referred by fd. Returns a file descriptor to be used for data transfer.
librlite-conf
● It is the backend for the rlite-ctl stack administration tool
● Exports the management and inspection functionalities:
○ IPCP creation
○ IPCP deletion
○ IPCP configuration
○ Fetch of current flows (with related statistics)
○ Dump state of a specific flow
○ Synchronization with uipcps daemon, to wait for the user-space part of an IPCP to show up
○ ...
librlite-cdap
● CDAP implementation using Google Protocol Buffer as concrete syntax
● Provides CDAP message constructors, serializers and deserializers
● Provides CDAP connections object to send and receive CDAP messages
● Each CDAP connection wraps a file descriptor
○ In this way CDAP can be used over arbitrary file descriptors
○ Primarily meant to be used with /dev/rlite-io file descriptors
○ No dependencies on other parts of rlite, can be reused as a stand-alone component
Uipcps daemon - Overview
● A multi-threaded single-process daemon that implements management part of some IPCPs
● When an IPCP is created by the kernel, the daemon gets notified, and creates the corresponding
user-space IPCP (uipcp)
● For regular IPCPs, it implements:
○ Flow allocation RIB objects
○ Directory Forwarding Table RIB objects
○ Enrollment RIB objects and enrollment state machines
○ Routing RIB objects
○ Address allocation RIB objects
● For shim-tudp4 IPCPs it implements UDP sockets setup and dynamic UDP port allocation
● For shim-tcp4 IPCPs it implements TCP connection setup and teardown for both client and server
side (connect(), accept(), etc.)
Uipcps daemon - Internals
● A custom event-loop thread for each IPCP
● An additional thread that implements a UNIX socket server to serve requests coming from the
rlite-ctl tool (or other future agents)
● Abstract factory pattern to manage different types of uipcps
● Reference counters used to manage uipcps lifetime
● Subsystems:
○ UNIX socket server, written in C
○ uipcps container for generic uipcp management (creation, deletion, …), written in C
○ shim-udp4 and shim-tcp4 user-space implementation, written in C
○ normal IPCP user-space implementation, written in C++ manly because of CDAP
● C++ code confined inside the uipcp-normal statically linked library.
Uipcps daemon - Subsystems
rlite-ctl
uipcp daemon
librlite-cdap
librlite
application
unix
server
uipcps
container
normal
shim
udp4
Uipcp daemon - Event loop
● A custom event-loop on top of rlite control devices
● The event-loop thread to select() over many file descriptors
○ rlite control devices: when events happen on the control device, event-specific callbacks get
executed
○ Other file descriptors: when an event is ready on one of those, an user-provided callback gets
executed
● Supports timers, that can be used to execute a callback after a certain
amount of time
Uipcp daemon - Advanced features
● The uipcp-containers module keeps track of the IPCPs in the local system
and the flows allocated among them
○ This information is maintained in a graph of local IPCPs
○ A node for each IPCP, an edge for each inter-IPCP flow
○ Graph used for automatic computation of:
■ per-IPCP Maximum SDU size (using the constraints provided by shim DIFs)
■ per-IPCP PCI header space to be reserved at kernel buffer allocation
○ Result of computation is pushed to the kernel for optimized operation
● Optional automatic re-enrollment triggers to create N-1 flows where they are
missing
rlite-ctl
● An ip-route2-like command-line tool to administer and monitor IPCP
processes
● Functionalities:
○ IPCP creation and deletion
○ IPCP configuration
○ Registration of an IPCP to a DIF
○ Enrollment between a local IPCP and a remote IPCP
○ Show list of IPCPs
○ Show RIB of a DIF
○ Show list of flows
○ Dump state of a specific flow
Common functionalities
● Common code is compiled both in user-space and kernel-space, to ease
maintenance:
○ Serialization and deserialization routines of control messages across user/kernel interface
■ Table-based serialization/deserialization, adding a new message is straightforward
○ Helper functions for RINA names - (APN, API, AEN, AEI) tuples.
Available RINA application
● Example applications:
○ rinaperf: multi-threaded client/server capable of parallel flow allocation, implementing basic
connectivity and performance testing: ping, request-response, unidirectional bandwidth
○ rina-echo-async: single-threaded event-loop based client/server tool, capable of concurrent
flow allocation and concurrent flow management
● Real application
○ nginx: RINA port of the popular Nginx server
○ dropbear: RINA port of the Dropbear ssh client/server
○ rina-gw: Event-loop application acting as an application gateway between a RINA network and
an IP network
■ It forwards TCP connections over RINA flows and the other way around
Demo
● RINA/TCP gateway, to make TCP/IP world interact with RINA world
● Minimally patched Nginx Web Server runs over RINA
TCP/IP
NETWORK
Proxy host
Client host 1
Web
browser
rina-gw
Server host 1
patched
nginx
RINA NETWORK
Client host 2
Web
browser
RINA flow
TCP connection
Demo
● RINA/TCP gateway, to make TCP/IP world interact with RINA world
● Minimally patched Nginx Web Server runs over RINA
VM A
patched
nginx
VM B
rina-gw Browser
n.1.DIF (normal)
Shim-eth (e.1.DIF)
TCP