1. Improving data center management
operations using wireless sensor
networks
Panagiotis Garefalakis and Kostas Magoutis
Institute of Computer Science (ICS)
Foundation for Research and Technology – Hellas (FORTH)
Heraklion, Greece
The IEEE International Conference on Internet of Things November 2012, Besançon, France
3. Motivation
• Challenges:
High complexity configuration
Hardware maintenance
Software changes
• Several systems proposed to reduce management complexity
AutoPilot (Microsoft), SmartFrog (HP), OpenView (HP), etc.
• Several problems remain unsolved thus keeping the
complexity and the cost of running a DC high.
4. Goal
• Address three important problems:
Automatically determine the physical location of servers.
Notify administrators of any location changes.
Determine status of servers even if network is down.
• Our solution to these problems relies on :
Auto-configuring wireless sensor network.
Distributed monitoring and management system.
5. Wireless technology used
IEEE 802.15.4
Low power
250Kbit/sec
Range ~100m
• Zigbee IEEE 802.15.4:
Up to 65536 Personal area networks with 16 channels each.
Specific roles of each device ( coordinator, slave).
• Two types of messages:
Transparent mode (broadcast only, simple).
API communication mode (unicast, reliable, RSSI).
15. Server integration
• Access to a variety of sensors:
– Temperature
– Airflow
– Power consumption
– Rack information
• Current technology : BMC
• Intelligent Platform Management Interface (IPMI)
19. Office environment
• Server S movement over a 2
meter distance.
• We compare the means of
RSSI time series before and
after movement using the
unpaired student t-test.
• The mean of the time series
for the moved server has a
statistically reliable shift.
20. Data Center
• Metallic enclosures, electromagnetic interference introduce
noise.
• Management server continuously evaluates the RSSI of
messages received from all coordinators.
21. Server movement accuracy
• Coordinator movement
over a 1.5 meter distance.
• We compare the means of
RSSI time series before and
after movement using the
unpaired student t-test.
• The mean of the time series
for the moved coordinator
has a statistically reliable
shift.
• Known techniques can
increase accuracy using
Data Center Topology. Groups of servers
sharing a coordinator are show in dashed LQI(signal filtering).
boxes. Slave Zigbees are omitted.
22. Use of management interface
Server state is UNREACHABLE, but server state is UP (network partition)
Wireless sensor reports location change
23. Conclusions
• Extended Nagios to take advantage of auto configuring WSN .
Easy to deploy.
Low capital costs.
Helps administrators by:
o Collecting sensor data – monitoring status.
o Alert them in a case of location changes.
o Identifies types of failure.
o Sophisticated correlation of DC states.
• In line with trends in server management technology.
25. Security Considerations
128-bit symmetric key encryption (AES)
Hardware support by Zigbee on top of IEEE 802.15.4
Coordinator performs key management (trust center)
27. Implementation - Extending Nagios
• WSN plug in for localization.
Status code Explanation and status message
The plugin was able to check the service and it appeared to be
OK functioning properly :
“Signal-Fine Distance + distance (m)”
The plugin was able to check the service, but it appeared to
violate a warning threshold or not working properly :
Warning
“Signal-Low Distance + distance (m)” or
“Sensor Changed Position + distance (m)”
The plugin detected that either the service was not running or
Critical it was violating a critical threshold:
“Sensor Disconnected!”
Invalid command line arguments were supplied to the plugin
or low-level failures internal to the plugin (such as unable to
Unknown fork or to open a TCP socket) that prevent it from performing
the specified operation.
“Unknown State!”
Editor's Notes
Good morning my name is PanagiotisGarefalakis and today I am going to present our work on improving data center operations using wireless sensor networks.This is joint work with Kostas Magoutis at ICS-Forth in Heraklion Greece.
Larger and larger datacenters are creating significant challenges for the management staff that need to operate them. As a result we needLarge teams of system administratorswith significant expertise and associated cost.Important challenges that these teams are handling include:
Tame complexityhowever
In this work we address..Wired network –First, Second, ThirdIntergrated..Before moving further I would like to give you some background about the technologies used in our work.
Before moving further I would like to give some background about the technologies used in our work.transmission rangeSupports..
This is the prototype we used. It consists of a zigbee device at the top,mounted on an zigbee shield at the middle and an arduino microcontroller at the bottom.This Costs us about 60 euros in retail and paying for development features, we believe this could be about 20 euros in production mode and this prototype could also be compressed.
Each plugin contains expert information about how to gauge the status of a service and relies on script execution to mine the necessary information. Remote execution of scripts is also supported via the remote plug-in executor (RPE).Nagios also supports remoreplugin execution
Nagios periodically invokes known plugins, which report on the status of datacenter resources. Whenever a service changes state, a handler can be triggered to notify an administrator or to take corrective action (such as restarting a non-responding service) via script execution. State changes and other information detected by plugins are also stored in log files
Features a complex state machineI am not gonna walk you through this but I want you to keep in mind thatOk – not ok = , soft hard there can be transitions based on the persistence of a failure
During our research we came across a significant challenge how to determine host statusNagios supports monitoring of a specific topology ,as in a real system, network is evolvedIf a machine fails to report back to Nagios and the route to this host is fine...determine whether a machine is up or down so it marks it as unreachable. This is a fundamental problem if you are only using the wired network to determine connectivity. We propose an auxiliary way of determining network failures by using (WSN).In reality a system is much more complex..
In reality a system is much more complex, (Google)Typical datacenter with racks and server blades interconnected via a wired network.
Moving on to our system architecture.As in a Typical datacenter we have racks and server blades interconnected via a wired network.We attacheda wireless sensor in each machineMain component is the management server equipped with a WS ,a variety of plugins and uses a local DB. Deploy WSN agent at boot time by network boot.WSN agent should be able to determine the rack it is part of. Can be achieved by subnet grouping or other techniques such as special management interfaces.(Smart Racks)Out system goes through Two different phases : autoconfiguration & data collection
Nagios management server broadcasts initialization messages to discover all servers. Each wireless device responds via unicast (with rack id) and ms server is responsible for storing MAC address in the local repository.Management server elects the new coordinator by sending a unicast message including the addresses of the slave devices.Coordinator is responsible for creating PAN and inviting devices to join it. (distributed workload).Grouping makes it is easier to collect and organize management data.Moreover we avoid overdrawing bandwidth limit.
During this phase devices exchange only unicast messages.Two types of communication (i) ms and coordinator (ii) coordinator and slave devices.Periodic unicast specific type of queries to Coordinators across racks.Coordinators are responsible for collecting and reporting data back.
Newer server technologies provide access… through an intelligent Platform InterfaceProblem showed in the first imageBaseboard Management ControllerRF capability
For the Server Localization part we used an already known Techique called Trilateration method.We have 3 static reference points a,b,c and we want to calculate the the relative position of a new device(red Dot) with Reference points. The new point is in the intersection of the circles and we can measure these distances by converting signal strength to meters.
One of the powerful features of our work is event correlationTypicallywhen ping service on host fails,Handler is called Correlating ping status with the status of the wireless device to determine whether the host is unreachable or the wired network is down
Finally I am going to give you a feeling how the management interface in our system works.
Location tracking accuracy. 10by30m office with 30 server PCs.Ms continuously evaluates the RSSI.Measurements taken per minute basis for periods of two hours.We move Server S over a distance 2 mStudent t-test statistic showed that means of R2 and R3 do not change whereas S time series has a reliable shift ( P value <.0001)Server movement event is accurate!Warning message to admin and throw event.
In addition to the previous experiment
False positive
To conclude
In this work we extended a popular management system..In a number of different ways.Our system will easily interoperate with future server technologies.This concludes my talk and I would be happy to take any questions
From the motivaton part I mentioned the problem of distinguishing Event handler using reported status to detect wired network partitions.Handler called when ping service on host service fails.It correlates with the status of the wireless device to determine whether the host is unreachable or the wired network is down