This document introduces RackHD, an open source project that automates data center hardware lifecycle management. It discusses RackHD's capabilities for discovery, configuration, provisioning, firmware management and other tasks. The document also covers RackHD's integration with other tools and projects, provides examples of workflows, and outlines future plans such as expanded hardware support, improved workflows and additional integrations.
Ask yourselves a question. What do you want your data center to be like when you grow up?
It’s a pretty easy assumption that many of us want to run our datacenters like aws, or google or azure.
For a few years now, we are moving more operations to the cloud. We are starting to embrace it.
However, This article on Wired talked in-depth about how dropbox was able to move away from the cloud. Apple did the same thing. There are lots of factors that play into it, but many companies are starting to bring it all back on-prem. There is a scale factor and at the same time new tools are being developed. New tools that allow you to run your datacenter the same way AWS does.
The goal is to treat our infrastructure as code. The old world was rack and stack. Then we moved onto converged infrastructure. But there has to be a next. How do we operate infrastructure in a way that is more hands off. You can’t be like AWS unless you orchestrate at the very lowest levels. We have to be able to treat physical components as if they were virtual machines. At the end of the day, we can reduce cost by delivering a predictable workload.
The biggest problem to solve is what do we do with a server after it gets rolled into the data center, fitted with power and a network connection? What happens after we press that power button? The goal we want to accomplish is to get this server operational as fast as possible. That may be to add it to a cluster of servers for resources or It could be to run some bare-metal application.
New image.
Run Infr as code
Old world = rack stack and load personality
You can’t be like AWS unless orchestrate at the very lowest levels
.next after vblock. This is the modern dc
Treat physical as if they VMs
Bring it back to private bc tech has changed
Articles of companies coming back from AWS.
Cheaper with predicatble workload.
Now we can orchestrate
Dropbox & apple go back to internal
What is rackHD? Go high-level summary
RackHD is a technology stack for enabling automated hardware management and orchestration through cohesive APIs. It serves as an abstraction layer between other M&O layers and the underlying physical hardware.
Developers can use the RackHD APIs to incorporate RackHD functionality into a larger orchestration system or to create a user interface for managing hardware services regardless of the underlying hardware in place.
available under the Apache 2.0 license
RackHD serves as an abstraction layer between other M&O layers and the underlying physical hardware. Developers can use the RackHD API to create a user interface that serves as single point of access for managing hardware services regardless of the specific hardware in place.
RackHD has the ability to discover the existing hardware resources, catalog each component, and retrieve detailed telemetry information from each resource. The retrieved information can then be used to perform low-level hardware management tasks, such as BIOS configuration, OS installation, and firmware management.
RackHD sits between the other M&O layers and the underlying physical hardware devices. User interfaces at the higher M&O layers can request hardware services from RackHD. RackHD handles the details of connecting to and managing the hardware devices.
With a datacenter that contains many bare metal machines, managing and maintaining each individual node can quickly become very time consuming and un-scalable. So it’s essential to have an automated service, like RackHD, to manage the nodes. The primary goals of RackHD are to provide REST APIs and live data feeds to enable automated solutions for managing hardware resources. The technology and architecture are built to provide a platform agnostic solution.
Application automation services such Heroku or CloudFoundry are service API layers (AWS, Google Cloud Engine, SoftLayer, OpenStack, and others) that are built overlying infrastructure. Those services, in turn, are often installed, configured, and managed by automation in the form of software configuration management: Puppet, Chef, Ansible, etc. To automate data center rollouts, managing racks of machines, etc - these are built on automation to help roll out software onto servers - Cobbler, Razor, etc.
The closer you get to hardware, the less automated systems tend to become. Cobbler and SystemImager were mainstays of early data center management tooling. Razor (or Hanlon, depending on where you’re looking) expanded on that base system , supported mainly by people working to implement further automation solutions.
RackHD expands the capabilities of hardware management and operations beyond the mainstay features
RackHD enables deeper and fuller automation by “playing nicely” with both existing and future potential systems. It adds to existing open source efforts by providing a significant step the enablement of converged infrastructure automation.
Discovery and Cataloging
Discovers the compute, network, and storage resources and catalogs their attributes and capabilities.
Telemetry and Genealogy
Telemetry data includes genealogical details, such as hardware, revisions, serial numbers, and date of manufacture
Device Management
Powers devices on and off. Manages the firmware, power, OS installation, and base configuration of the resources.
Configuration
Configures the hardware per application requirements. This can range from the BIOS configuration on compute devices to the port configurations in a network switch.
Provisioning
Provisions a node to support the intended application workflow, for example lays down ESXi from an image repository. Reprovisions a node to support a different workload, for example changes the ESXi platform to Bare Metal CentOS.
Firmware Management
Manages all infrastructure firmware versioning.
Logging
Log information can be retrieved for particular elements or collated into a single timeline for multiple elements within the management neighborhood.
Environmental Monitoring
Aggregates environmental data from hardware resources. The data to monitor is configurable and can include power information, component status, fan performance, and other information provided by the resource.
Fault Detection
Monitors compute and storage devices for both hard and soft faults. Performs suitable responses based on pre-defined policies.
Analytics Data
Data generated by environmental and fault monitoring can be provided to analytic tools for analysis, particularly around predictive failure.
Discovery and Cataloging
Discovers the compute, network, and storage resources and catalogs their attributes and capabilities.
Telemetry and Genealogy
Telemetry data includes genealogical details, such as hardware, revisions, serial numbers, and date of manufacture
Device Management
Powers devices on and off. Manages the firmware, power, OS installation, and base configuration of the resources.
Configuration
Configures the hardware per application requirements. This can range from the BIOS configuration on compute devices to the port configurations in a network switch.
Provisioning
Provisions a node to support the intended application workflow, for example lays down ESXi from an image repository. Reprovisions a node to support a different workload, for example changes the ESXi platform to Bare Metal CentOS.
Firmware Management
Manages all infrastructure firmware versioning.
Logging
Log information can be retrieved for particular elements or collated into a single timeline for multiple elements within the management neighborhood.
Environmental Monitoring
Aggregates environmental data from hardware resources. The data to monitor is configurable and can include power information, component status, fan performance, and other information provided by the resource.
Fault Detection
Monitors compute and storage devices for both hard and soft faults. Performs suitable responses based on pre-defined policies.
Analytics Data
Data generated by environmental and fault monitoring can be provided to analytic tools for analysis, particularly around predictive failure.
RackHD is focused on being the lowest level of automation that interrogates agnostic hardware and provisions machines with operating systems. The API can be used to pass in data through variables in the workflow configuration, so you can parameterize workflows. Since workflows also have access to all of the SKU information and other catalogs, they can be authored to react to that information.
The real power of RackHD, therefore, is that you can develop your own workflows and use the REST API to pass in dynamic configuration details. This allows you to execute a specific sequence of arbitrary tasks that satisfy your requirements.
When creating your initial workflows, it is recommended that you use the existing workflows in our code repository to see how different actions can be performed.
Need to add in animations
Rubber duck
As software transforms industries across the world, more companies are embracing software as core competency to differentiate themselves with customers and capture new opportunities.
( Mobile changing --- consumer access – data generated --- intelligence gathered--- new featured – constant feedback.._
Companies like Square, Uber, Netflix, Airbnb, and Tesla continue to possess rapidly growing private market valuations and turn the heads of executives of their industries’ historical leaders. What do these innovative companies have in common? (How can they go from idea to product so quickly)
• Speed of innovation
• Always-available services
• Web scale
• Mobile-centric user experiences
Enterprises are following:
Kroger: DevOps adoption with PCF Automated build pipeline
AllState: Major IT transformation, want to Uberize the insurance industry
LockHeed Martin : Building apps using PCF and Spring (Java FMW)
HomeDepot: Software Transformation – major competiion for AMAZON …so have to delivery new capability quickly and efficently.
Software is transforming industries across the world, more companies are embracing software as core competency to differentiate themselves with customers and capture new opportunities.
Companies like Square, Uber, Netflix, Airbnb, and Tesla continue to possess rapidly growing private market valuations and turn the heads of executives of their industries’ historical leaders. What do these innovative companies have in common? (How can they go from idea to product so quickly)
• Speed of innovation ( innovate, expirement and deliver software quickly)
• Always-available services
• Web scale
• Mobile-centric user experiences
OTHER:
Businesses today are constantly pressured to adopt the myriad of
technical driving forces impacting software development and deliv‐
ery. These driving forces include:
• Anything as a service
• Cloud computing
• Containers
• Agile
• Automation
• DevOps
• Microservices
• Business-capability teams
• Cloud-native applications
Moving to the cloud is a natural evolution of focusing on software, and cloud-native application architectures are at the center of how these companies obtained their disruptive character
Speed
It’s become clear that speed wins in the marketplace. Businesses that are able to innovate, experiment, and deliver software-based solutions quickly are outcompeting those that follow more traditional delivery models.
Safety
It’s not enough to go extremely fast. If you get in your car and push the pedal to the floor, eventually you’re going to have a rather expensive (or deadly!) accident. Transportation modes such as aircraft and express bullet trains are built for speed and safety. Cloud-native application architectures balance the need to move rapidly with the needs of stability, availability, and durability. It’s possible and essential to have both.
So how do we go fast and safe?
Visibility
Our architectures must provide us with the tools necessary to see failure when it happens
Fault isolation
In order to limit the risk associated with failure, we need to limit the scope of components or features that could be affected by a failure. -- Microservices
Recovery
Scale:
Rather than scale vertical scaling, Innovative companies dealt with this problem through two pioneering
moves:
• Rather than continuing to buy larger servers, they horizontally scaled application instances across large numbers of cheaper commodity machines. These machines were easier to acquire (or assemble) and deploy quickly.
• Poor utilization of existing large servers was improved by virtualizing several smaller servers in the same footprint and deploying multiple isolated workloads to them
As software transforms industries across the world, more companies are embracing software as core competency to differentiate themselves with customers and capture new opportunities.
Companies like Square, Uber, Netflix, Airbnb, and Tesla continue to possess rapidly growing private market valuations and turn the heads of executives of their industries’ historical leaders. What do these innovative companies have in common? (How can they go from idea to product so quickly)
• Speed of innovation
• Always-available services
• Web scale
• Mobile-centric user experiences
Moving to the cloud is a natural evolution of focusing on software, and cloud-native application architectures are at the center of how these companies obtained their disruptive character
Speed
It’s become clear that speed wins in the marketplace. Businesses that are able to innovate, experiment, and deliver software-based solutions quickly are outcompeting those that follow more traditional delivery models.
Safety
It’s not enough to go extremely fast. If you get in your car and push the pedal to the floor, eventually you’re going to have a rather expensive (or deadly!) accident. Transportation modes such as aircraft and express bullet trains are built for speed and safety. Cloud-native application architectures balance the need to move rapidly with the needs of stability, availability, and durability. It’s possible and essential to have both.
So how do we go fast and safe?
Visibility
Our architectures must provide us with the tools necessary to see failure when it happens
Fault isolation
In order to limit the risk associated with failure, we need to limit the scope of components or features that could be affected by a failure. -- Microservices
Recovery
Scale:
Rather than scale vertical scaling, Innovative companies dealt with this problem through two pioneering
moves:
• Rather than continuing to buy larger servers, they horizontally scaled application instances across large numbers of cheaper commodity machines. These machines were easier to acquire (or assemble) and deploy quickly.
• Poor utilization of existing large servers was improved by virtualizing several smaller servers in the same footprint and deploying multiple isolated workloads to them
0.1 release, currently work in progress for full state management (i.e inclusion of support for start, stop, remove, restart commands)
Shovel is an application that provides a service with a set of APIs that wraps around RackHD/Ironic existing APIs allowing users to find Baremetal Compute nodes dynamically discovered by RackHD and register/unregister them with Ironic (OpenStack Bare Metal Provisioning Program).Shovel also provides poller service that monitors compute nodes and logs the errors from SEL into Ironic Database.
A Shovel Horizon plugin is also provided to interface with the Shovel service. The plugin adds a new Panel to the admin Dashboard called rackhd that displays a table of all the Baremetal systems discovered by RackHD. It also allows the user to see the node catalog in a nice table View, Register/Unregister node in Ironic, display node SEL and enable/register a failover node.
(ORFS-152)
Background
Related to the work list in V2 API in it's notion of enabling relationships, we have sufficient information with the existing LLDP catalog and the capability of getting related switch port information (the mac-address/switch port table from the switch) from a remote catalog there. With the combined information, we should be able to process those information sources and create the relevant "links" to represent a topology from the RackHD APIs that show what compute servers are connected to what switches, and at which port.
Goals
a mechanism that will capture needed data from a top-of-rack switch and combine it with lldp catalogs or node data where available to create or amend the underlying data to expose the topology connections.
To be able to unplug one of those physical cables and have this mechanism update the topology correctly
To be able to plug in a physical cable adding a second, independent network connection between compute node and switch, and have that connection be represented in the topology
REST resource API outputs with the V2 API that show the linkages using the relationship structure pattern defined in V2 API
Defined/documented events on the AMQP bus that get sent when a topology is calculated and a change is detected. Specifically, an event if a new link is formed with the details of that link, and a event if a link is broken that previously existed.
Extending the concepts of workflow orchestration to have more knowledge and (potential) access to systems after the OS has been laid down means extending in a number of new ways. Knowing about network configurations and connections, access via SSH to the host OS, and the potential to grab significant and additional telemetry or install additional packages. The first steps to this are to enable SSH access to a HOST OS and to reflect similar information in our data models/API resources
Goals
add representation and appropriate schema for nics and networks to compute nodes, switches, including VLAN specific interfaces
add representation of an IP address and credentials to inquire for OS level details
workflow task to use this mechanism to capture OS package w/ versions and store them in a catalogs to include package collection
expand workflow tasks to arbitrary SSH commands with credentials from node (https://github.com/mscdex/ssh2, https://github.com/tsmith/node-control, https://github.com/mikeal/sequest)
expand catalogs to include an OS-level view of network connections as a catalog - nics (interface names for the OS), IP address, gateway, subnet mask, and VLAN if provided/appropriate
enable IP lookups to support mapping any data from Ip addresses assigned to compute servers so that ancilliary services can know which node this relates to
workflow task to set in an updated SSH Key
workflow task to SSH into a switch to set the switch into ZTP/boot mode to reset it
extend OS.Install workflows to leverage an in-band connection to verify that the machine is responding via SSH prior to completing the OS.Install workflow