Rlite software-architecture (1)

21 Feb 2017

Contenu connexe


Rlite software-architecture (1)

  1. A RINA light implementation Vincenzo Maffione 20/02/2017
  2. Introduction (1) ● A Free and Open Source light implementation of RINA for Linux ● Implementation splitted between user-space and kernel-space ● KISS approach → codebase is clean and essential ● Focus: ○ basic functionality - do few things but to them well ○ stability and performance - support deployments with hundreds of nodes ○ minimality - avoid over-engineering ● Main goal: a baseline implementation for future RINA products ● Code and documentation available at
  3. Introduction (2) ● ~ 27 Klocs (not including blanks) ○ kernel-space: ~ 9 Klocs ○ user-space: ~ 18 Klocs ■ including tools and example applications ● Written mostly in C (some parts are C++ for convenience) ○ C: 14 Klocs ○ C++: 7 Klocs ● Network applications can be written in C ● Python bindings available to write network applications in Python
  4. Introduction (3) ● kernel-space is implemented as a set of out-of-tree kernel modules, which run on the unmodified Linux kernel. ○ Linux Kbuild system is used to build the modules against the running kernel ○ Build time (no parallel make): 3-15 seconds ● user-space is implemented as a set of shared libraries and programs ○ CMake is used to configure and build libraries and executables ○ Build time (no parallel make): 15-60 seconds
  5. Basic features (1) ● Applications: ○ Flow allocation and deallocation, with QoS specification ○ Application registration and unregistration ○ Data transfer ● Stack administration: ○ Creation, deletion and configuration of IPCPs ○ Registration and enrollment among IPCPs ○ Monitoring and inspection ■ inspection of IPCPs in the system ■ inspection of RIBs ■ per-flow statistics
  6. Basic features (2) ● QoS (supported through DTCP): ○ Flow control ○ Retransmission control ○ Maximum allowable gap ○ Simple token-bucket rate-limiting ● Decent performance (detailed performance plots to come) ○ About 9.5 Gbps on a 10 Gbit link without flow control and retransmission ○ About 6 Gbps on a 10 Gbit link with flow control ○ A lot of room for optimizations ● Stability indicators ○ Done 10 days long VM-based experiments with up 35 nodes, two levels of normal DIFs and 50 flows allocations per second ○ Done experiments with up to 10 levels of DIFs
  7. Architecture overview (1) rlite-ctl uipcps daemon application librlite-cdaplibrlite-conf librlite /dev/rlite /dev/rlite-io rlite shim-eth shim-loopback shim-tcp4 normal user-space kernel-space shim-hv shim-udp4
  8. Architecture overview (2) ● kernel-space ○ Supports control operations ○ Implements datapath ○ Keeps state ● user-space ○ Libraries to abstract interaction with kernel-space functionalities ○ A daemon to implement management part of (many) IPCPs ○ An iproute2-like command-line tool to administer the stack ● Interactions between kernel-space and user-space only happen through character devices → therefore through file descriptors
  9. Kernel-space architecture (1) ● Supported functionalities: ○ IPCP creation, deletion and configuration (kernel keeps a per-IPCP data structure) ○ Flow (de)allocation (kernel keeps a per-flow data structure) ○ Application (un)registration (kernel keeps a data structure for each registered application) ○ RMT, DTP and DTCP components of the normal IPCP ○ Shim IPCP processes (e.g. interaction with network device drivers) ● State is maintained in kernel-space: ○ user-space can crash or be restarted at any time ○ user-space can recover state from kernel
  10. Kernel-space architecture (2) ● User-space interacts with kernel-space only through two character devices ○ /dev/rlite for control operations ○ /dev/rlite-io for data transfer and synchronization ● Consequently, interactions only happen through file descriptors ● Both are “cloning devices” ○ each open() creates a new independent kernel-space instance ● Both devices support blocking and non-blocking operation ○ Standard poll() and select() widely used with the devices
  11. Kernel-space architecture (3) ● /dev/rlite used for control operations ○ flow (de)allocation ○ Application (un)registration ○ IPCP creation, deletion and configuration ○ Management of PDU forwarding table ○ interactions between user-space and kernel-space parts of IPCPs ○ inspection and monitoring operations on flows and IPCPs ○ ...
  12. Kernel-space architecture (4) ● Control operations follow a request/response paradigm: ○ write() to the control device to submit a request message ○ Response messages (not always present) can be read through read() ● The control device is used to avoid ioctls() and netlink ○ Easier porting to other OSes (e.g. FreeBSD) ● Request and response messages are represented by packed structs and are serialized/deserialized during the user-space ←→ kernel-space transition ○ support for string (de)serialization ○ support for (apn, api, aen, aei) name (de)serialization
  13. Kernel-space architecture (5) ● /dev/rlite-io for data transfer and synchronization ○ read() ○ write() ○ select(), poll(), epoll() ● Application workflow: ○ Use the control device to allocate a flow (kernel-space object) ○ Bind the flow to a newly-created data transfer file descriptor - this is the only task performed by means of ioctl() ○ Use the data transfer file descriptor to exchange SDUs and/or wait for events ○ Close file descriptor to deallocate the associated flow ● Special binding mode to exchange management SDUs
  14. Kernel-space architecture (6) ● Usual abstract factory pattern to manage different types of IPCPs ○ normal: implementation of the regular IPCP ○ shim-loopback: supports system-local IPC, with optional queued mode to decouple TX and RX code-paths, and optional packet drop emulation ○ shim-eth: uses network device drivers to transmit and receive SDUs, sharing the device with the Linux network stack ○ shim-udp4: tunnels RINA traffic over UDP socket; mostly implemented in user-space, only data transfer is implemented in kernel-space ○ shim-tcp4: same as shim-udp4, but using a TCP socket; deprecated, since it duplicates flow-control and congestion control done in higher layers ○ shim-hv: uses VMPI devices to transmit and receive SDUs
  15. Some kernel-space internals ● Reference counters widely used to manage lifetime of objects (e.g. IPCPs, flows, registered applications, PDUs) ● sk_buff-like approach to avoid copies throughout the datapath ● dynamic allocation of PDU buffers ○ The amount of header space to reserve at allocation time is precomputed by the user-space daemon, depending on the local IPCP dependency graph ● All PDU queues are limited in size to keep memory usage under control ● Deferred work (workqueues) used only when necessary, to keep latency low ○ Example: driver transmission routine directly executes in the context of an application write() system call, when possible
  16. Architecture overview rlite-ctl uipcps daemon application librlite-cdaplibrlite-conf librlite /dev/rlite /dev/rlite-io rlite shim-eth shim-loopback shim-tcp4 normal user-space kernel-space shim-hv shim-udp4
  17. user-space libraries ● librlite (written in C) ○ main library, abstracts interactions with the rlite control device (/dev/rlite) ○ provides common utilities and helpers (application names, flow specification, control messages, ...) ○ provides an API for RINA applications ● Other libraries ○ librlite-conf (C): extends librlite with kernel-space IPCP management functionalities ○ librlite-cdap (C++): CDAP implementation based on Google Protocol Buffer
  18. librlite - Overview ● librlite provides API calls to interact with control device instances ○ Validation, serialization and deserialization of control messages in both directions (user → kernel, kernel → user) ● It defines a POSIX-like APIs for applications: ○ Reminiscent of the socket API, to ease porting of existing socket applications... ○ … yet with the full power of RINA API (QoS support and complete naming scheme) ○ Easy to learn for grown-up network developers! ○ Documentation available at ○ Other resources:
  19. librlite - Application API ● Main API calls: ○ int rina_open() → fd ■ Opens a control device instance, returning a file descriptor. ○ int rina_flow_alloc(dif_name, local_name, remote_name, flowspec, flags) → fd ■ Issues a flow allocation request and possibly wait for the associated response. Returns a file descriptor to be used for data transfer. ○ int rina_register(fd, dif_name, appl_name, flags) ■ Register an application into a given DIF. ○ int rina_register(fd, dif_name, appl_name, flags) ■ Unregister an application from a given DIF. ○ int rina_flow_accept(fd, flags) → remote_appl, flowspec ■ Wait and possibly accept an incoming flow request, where the destination application is one of the ones registered to the control device referred by fd. Returns a file descriptor to be used for data transfer.
  20. librlite-conf ● It is the backend for the rlite-ctl stack administration tool ● Exports the management and inspection functionalities: ○ IPCP creation ○ IPCP deletion ○ IPCP configuration ○ Fetch of current flows (with related statistics) ○ Dump state of a specific flow ○ Synchronization with uipcps daemon, to wait for the user-space part of an IPCP to show up ○ ...
  21. librlite-cdap ● CDAP implementation using Google Protocol Buffer as concrete syntax ● Provides CDAP message constructors, serializers and deserializers ● Provides CDAP connections object to send and receive CDAP messages ● Each CDAP connection wraps a file descriptor ○ In this way CDAP can be used over arbitrary file descriptors ○ Primarily meant to be used with /dev/rlite-io file descriptors ○ No dependencies on other parts of rlite, can be reused as a stand-alone component
  22. Architecture overview rlite-ctl uipcps daemon application librlite-cdaplibrlite-conf librlite /dev/rlite /dev/rlite-io rlite shim-eth shim-loopback shim-tcp4 normal user-space kernel-space shim-hv shim-udp4
  23. Uipcps daemon - Overview ● A multi-threaded single-process daemon that implements management part of some IPCPs ● When an IPCP is created by the kernel, the daemon gets notified, and creates the corresponding user-space IPCP (uipcp) ● For regular IPCPs, it implements: ○ Flow allocation RIB objects ○ Directory Forwarding Table RIB objects ○ Enrollment RIB objects and enrollment state machines ○ Routing RIB objects ○ Address allocation RIB objects ● For shim-tudp4 IPCPs it implements UDP sockets setup and dynamic UDP port allocation ● For shim-tcp4 IPCPs it implements TCP connection setup and teardown for both client and server side (connect(), accept(), etc.)
  24. Uipcps daemon - Internals ● A custom event-loop thread for each IPCP ● An additional thread that implements a UNIX socket server to serve requests coming from the rlite-ctl tool (or other future agents) ● Abstract factory pattern to manage different types of uipcps ● Reference counters used to manage uipcps lifetime ● Subsystems: ○ UNIX socket server, written in C ○ uipcps container for generic uipcp management (creation, deletion, …), written in C ○ shim-udp4 and shim-tcp4 user-space implementation, written in C ○ normal IPCP user-space implementation, written in C++ manly because of CDAP ● C++ code confined inside the uipcp-normal statically linked library.
  25. Uipcps daemon - Subsystems rlite-ctl uipcp daemon librlite-cdap librlite application unix server uipcps container normal shim udp4
  26. Uipcp daemon - Event loop ● A custom event-loop on top of rlite control devices ● The event-loop thread to select() over many file descriptors ○ rlite control devices: when events happen on the control device, event-specific callbacks get executed ○ Other file descriptors: when an event is ready on one of those, an user-provided callback gets executed ● Supports timers, that can be used to execute a callback after a certain amount of time
  27. Uipcp daemon - Advanced features ● The uipcp-containers module keeps track of the IPCPs in the local system and the flows allocated among them ○ This information is maintained in a graph of local IPCPs ○ A node for each IPCP, an edge for each inter-IPCP flow ○ Graph used for automatic computation of: ■ per-IPCP Maximum SDU size (using the constraints provided by shim DIFs) ■ per-IPCP PCI header space to be reserved at kernel buffer allocation ○ Result of computation is pushed to the kernel for optimized operation ● Optional automatic re-enrollment triggers to create N-1 flows where they are missing
  28. rlite-ctl ● An ip-route2-like command-line tool to administer and monitor IPCP processes ● Functionalities: ○ IPCP creation and deletion ○ IPCP configuration ○ Registration of an IPCP to a DIF ○ Enrollment between a local IPCP and a remote IPCP ○ Show list of IPCPs ○ Show RIB of a DIF ○ Show list of flows ○ Dump state of a specific flow
  29. Common functionalities ● Common code is compiled both in user-space and kernel-space, to ease maintenance: ○ Serialization and deserialization routines of control messages across user/kernel interface ■ Table-based serialization/deserialization, adding a new message is straightforward ○ Helper functions for RINA names - (APN, API, AEN, AEI) tuples.
  30. Available RINA application ● Example applications: ○ rinaperf: multi-threaded client/server capable of parallel flow allocation, implementing basic connectivity and performance testing: ping, request-response, unidirectional bandwidth ○ rina-echo-async: single-threaded event-loop based client/server tool, capable of concurrent flow allocation and concurrent flow management ● Real application ○ nginx: RINA port of the popular Nginx server ○ dropbear: RINA port of the Dropbear ssh client/server ○ rina-gw: Event-loop application acting as an application gateway between a RINA network and an IP network ■ It forwards TCP connections over RINA flows and the other way around
  31. Demo ● RINA/TCP gateway, to make TCP/IP world interact with RINA world ● Minimally patched Nginx Web Server runs over RINA TCP/IP NETWORK Proxy host Client host 1 Web browser rina-gw Server host 1 patched nginx RINA NETWORK Client host 2 Web browser RINA flow TCP connection
  32. Demo ● RINA/TCP gateway, to make TCP/IP world interact with RINA world ● Minimally patched Nginx Web Server runs over RINA VM A patched nginx VM B rina-gw Browser n.1.DIF (normal) Shim-eth (e.1.DIF) TCP