The information in this paper is based on my experience ‘fixing’ the storage consumption process for a large customer ( 3 PB SAN environment) in a managed services (outsourced) environment. The names have been changed to protect the innocent but the processes, tools, and issues are very real.
Regardless of whether you are the customer (end-consumer) of storage or the manager/provider of storage, the accuracy of storage consumption reports is essential to maintaining IT’s reputation and customer trust.
In this paper we discuss the challenges of creating a sustainable, reliable, and robust charge back processes and tools for SAN storage consumption reporting.
9. Environment Size – Data Complexity 54 Number of Array Migrations 15 NAS Filers 17 Number of Input data Sources 6 Number of SAN Fabric Locations 3 ECC/StorageScope Servers 900 Server Count Other Logistics 19 IBM 4 HP 36 EMC Array Count
10. Environment - Data Inputs Manually created Array management tools Flash copy relationships Automatically generated report Problem and Change Management Tools Change Tickets Manually created Application table Application Correlation Manually created Backup reporting team Backup Media Servers Manually created Capacity report Array Location Automatically generated report Asset Management Database Exclusion Host Names Automatically generated report Asset Management Database Host Status, LOB Manually created HP-Compaq CLI HP-Compaq Host-LUN Report template - Manually ran EMC StorageScope 6.0 Host_Consumption_Query Automatically generated via queries Configuration database Symdev list - fibre - nofibre Automatically gathered by shell scripts EMC CLI Symbcv output Automatically generated via query Configuration database DS8K Configuration Automatically generated via query Configuration database Clariion Configuration Automatically generated via query Configuration database EMC Configuration Automatically generated via query Billing database Host Allocation Report template - Manually ran EMC StorageScope 6.0 EMC NAS Report template – Manually ran Ops Mgr NAS Base Composite Description Input Source Input Description
22. Sample Exception Report no no yes yes yes 770.00 Production Pluto server19 yes no yes yes yes 385.00 Decommissioned Mars server18 yes no yes yes yes 385.00 Decommissioned Pluto server17 no yes yes yes yes 1,120.00 Decommissioned Mars server16 no yes yes yes yes 864.00 Decommissioned Pluto server15 no no yes yes yes 851.35 Production Mars server14 no no yes yes yes 851.35 Production Pluto server13 no no yes yes yes 1,121.00 Production Mars server12 no no yes yes no 1,121.00 Pluto server11 no no yes yes no 1,280.00 Mars server10 IBM & EMC? In TPC? In Billing DB? IBM Supported In Asset DB Usable GB Status Location Server Name
34. Appendix A – What is a SAN? Fabric Edge Switch - A Edge Switch - B Core Switch - A Core Switch - B Links Links Storage Switch - A Storage Switch - B Servers Storage Servers
Notes de l'éditeur
Background The information in this paper is based on my experience ‘fixing’ the storage consumption process for a large customer ( 3 PB SAN environment) in a managed services (outsourced) environment. The names have been changed to protect the innocent but the processes, tools, and issues are very real. Regardless of whether you are the customer (end-consumer) of storage or the manager/provider of storage, the accuracy of storage consumption reports is essential to maintaining IT’s reputation and customer trust. In this paper we discuss the challenges of creating a sustainable, reliable, and robust charge back processes and tools for SAN storage consumption reporting.
SAN storage is included in the scope of this document. Server resources such as CPU, memory, network, or fibre channel ports are not included.
On the left hand size of the diagram there are 2 boxes representing server 1 and server 2 respectively. Each server has a line that connects it to the SAN Fabric cloud which is made up of various switches. The switches are then connected to the backend storage array. The storage array is composed of some number of physical disks that are logically grouped together to form some type of storage pool. This storage pool is then used to create Volumes. These volumes are often referred to as LUNs and are presented to the host as SCSI targets. From a storage consumption perspective the size of the LUN(s) associated with each server needs to be measured, tracked and reported. The measurement can be reported by host based tools, storage array tools, or both.
We developed the following goals to help us understand when we were finished. Process: Manual processes tend to create significant variances in the results. Repeatable processes drive repeatable results which install confidence in customers. Confident customers tend to pay their bills. Auditable and transparent processes provide tracking of changes throughout a system and provide rationale for any changes that occur. They also allow scrutiny of the process and results for anyone that is interested in determining how results were derived. At a storage consumption level the tools should apply very little changes to the actual data sources. The tools should apply only minimal processing for the purpose of summarizing the storage consumption information in format that is already agreed upon. The only changes to the source data were inclusion of certain data such as: Storage Tier, Change Management Tickets, and Asset Management DB status information. Business rules such as cost/GB, and application ID, and special adjustments were added later. The purpose of establishing processes to remediate data issues is that so once a baseline has been established any data inconsistencies can be identified and resolved in an ongoing basis. Tools Identify tool failure scenarios and configured SNMP traps for automating problem and incidents relating to these failures and notification to operational team. Develop process documentation and SLA agreements between the operational and reporting teams that define severity of incidents and timing of remediation's. Develop reporting/tracking tools that are re-usable. Introduce coverage of new tools to provide coverage for all storage platforms from the host and storage array perspective. Data: Identify and report all storage that is allocated to hosts. Identify and secure secondary sources of data. We utilized array based information to compare with host based information. This provides a sanity check. Business: One of the goals with this entire process is to develop tracking and reporting that meets the requirements of the business. In all cases business rules will need to be considered and agreed upon between the storage consumer and the storage purveyor. In most cases the consumer will want to see documented change management justification for any change in the consumption.
There were a number of challenges to producing a complete and accurate storage consumption report including challenges with tools, processes and the size and complexity of the environment. Environment: The size of the environment (+3PB) and 900 SAN attached servers led to significant challenges from the standpoint of scope and complexity. There were operational challenges in maintaining active agents. The number of data sources required to provide a complete and accurate picture of the environment was overwhelming the manual process. Processes: There were a number of steady state processes that were partially incomplete and created storage orphans: Migrations, Host Renames, Host Decommissions, HBA Re-use. The failure of end to end validation and closed loop processes led to errors in reporting and challenges to tracking. Tools: Manual processing: Due to a lack of integration with all the required data sources and a lack of automation the process to develop the storage consumption report took 40-80 hours per month. This introduced inconsistency and data errors. Reporting coverage: Only about 70% of the environment was reported on due to limitations in the tools for certain storage arrays. Agent based reporting only provided one source of data so there was no ability to validate with a secondary source. Business Rules: Deciding on when storage became billable. Would it become billable upon allocation or upon confirmation on the host? Deciding how to handle variances and determining justification criteria for variances. Determine acceptable back billing timelines. Determine what servers are valid for billing. In this case the only servers that were valid for billing were those servers that were listed as Production in the asset management tool. Invoice: If Business Rules are not met storage is not billed. Dispute resolution processes must also be considered as well as interpretation of the invoice and sufficient detail to explain changes.
The size and complexity of the environment created some unique challenges in identifying, collecting, and automating the processing of the data sources.
In any large environment there will be a significant number of data sources. These data sources will vary depending on the storage platforms.
The purpose of this section is to describe the correct processes required to facilitate accurate storage consumption as well to identify the impacts of broken or partially completed processes.
This slide represents a high level process diagram of the storage consumption and invoice preparation process. Actors: Consumption analyst – Responsible for gathering and processing technical information related to storage consumption. The consumption analyst provides the consumption report to the invoice creator.. Invoice creator – Responsible for gathering the storage consumption report and applying business rules to determine actual charges for a business unit. Delivers invoice to invoice consumer. Invoice consumer – The business unit representative responsible for approving the invoice. If they approve then money is sent to the service provider. If they object to the invoice then a formal dispute is initiated.
The measurement adage ‘garbage in = garbage out’ is accurate in storage consumption measurement and the purpose of the diagram is to understand how other infrastructure management activities intersect with storage consumption measurement. All orange ovals represent processes that can impact storage consumption. The processes within the light green rectangle represents those processes that can create orphan storage such as hostname changes, decommissions, migrations and BCV changes. The processes within the light grey rectangle represent server build processes such as host repurposes, hba re-uses and new server builds. The processes in the green-blue rectangle represent production business as usual processes such as operational maintenance for the tools, SAN de-allocations and SAN allocations. The processes within the greenish-brown rectangle represent those processes that happen in the domain of the storage consumption processing. All processes impact the configuration of the environment yet many processes can be performed on the server without any operational impact to the storage array. The converse is also true. This can lead to situations where operational functionality is not necessarily impacted by synchronization of changes across between the server and the storage array. The following pages will step through each of these processes and identify the required steps to maintain both operational and reporting accuracy between the server and the storage array.
The SAN migration process is essentially moving storage from one storage platform to another storage platform. It is very basic, it should involve allocating new storage, migrating data, and removing old storage. From a storage consumption perspective orphans are created when SAN storage is migrated but the old storage is not removed from the SAN zoning and array. This prevents the proper release of the storage array and creates both utilization and consumption reporting inaccuracies. The first step in the SAN migration process is to identify the scope of the migrations: server(s), storage array(s The second step is to gather detailed information on the host including wwpns, storage array and luns, HBA levels and microcode. If the server being migrated requires software/hardware updates then remediation is scheduled and fixes are applied. Otherwise the host preparation processes are initiated and changes are scheduled as appropriate. After the host is seeing the new storage and everything works a solution review is performed and data is migrated. The very last step is the SAN Disk de-allocation. In most environments this requires an additional change request and in order to close this loop a check should be performed to ensure the SAN is de-allocated prior to closing the migration project. If either of the last two steps are not completed both the server and the storage array will continue to show old storage as long as the ‘old’ array is still powered on and supporting data collections. In addition to the validation of SAN removal any Arrays that are decommissioned as part of this process require a ticket to be opened with the data collection owners to disable collection from decommissioned arrays and remove any ‘old’ files.
Servers can have applications removed properly, file-systems unmounted properly, cables unplugged and can be shipped to the moon but if the storage array and SAN zones are not updated to reflect the removal of the server and its SAN storage then orphaned storage is created. This ‘orphaned’ storage can result in a false reduction in the usable port and array capacity. It can also contribute to significant storage consumption confusion and potential false charges. Conduct Requirements review to gather requirements and distribute to all staff affected by the implementation. Perform server validation and host preparation work to validate there are no active users or applications on box. If server is not ready then perform remediation steps. Review solution through pre change request scheduling and coordination meeting. Create the change of the Server decommission: Perform server change including removal of filesystems, volumes, disks on server. Power down and remove server. Validate server configuration changes. Open SAN configuration change Perform SAN configuration changes: zoning removal, unmapping, unmasking, LUN wipe Performance misc SAN validations Update Asset Management and Configuration DBs to reflect host status. The new host status should show decommissioned in the document of record to prevent accidental charge back for resources.
If the storage array is not updated to reflect the correct server name, the server will still be able to access any SAN previously zoned to the HBAs. Updating the Storage array information to reflect the updated server name is not critical to the access by the server to the storage on the SAN, however, hostname synchronization issues between the array and the server can lead to storage consumption inaccuracies resulting in over/under billing. One of the most difficult types of orphans to track down are those orphans resulting from a server that is being renamed to the same name of a server that was previously decommissioned but did not have the SAN removed properly. The appropriate process is as follows: Conduct Requirements review to gather requirements and distribute to all staff affected by the implementation. Perform server validation and preparation work -Run fiber to the location for the SAN -Review Firewall Change (Routes) to determine if there are changes to be made for the IP addresses and/or host name -Conduct pre change request scheduling and coordination meeting: -Create the change of the Server name: Conduct network preparation Complete the firewall changes Perform the physical move of the server. Perform server change Perform SAN zoning and host name entry changes to update SAN configuration. Performance misc SAN/application/db/server validations Update Asset Management and Configuration Database. Complete the firewall changes Perform the physical move of the server. Perform server changes Perform SAN zoning and host name entry changes to update SAN configuration. Performance misc SAN/application/db/server validations
BCVs can be removed on the host and not removed on the SAN without any operational impact. BCVs can be removed on the SAN and the server can have them partially removed such that they are not in use but they can be reported as in use at the server level. This is another example of a reporting problem. The only way to fix this issue is to have validation checks after changes are made. Conduct Requirements review to gather requirements and distribute to all staff affected by the implementation. The appropriate process is as follows: Perform server validation and host preparation work to validate the BCVs to be removed are not in use. If server is not ready the perform remediation. Review solution through pre change request scheduling and coordination meeting. Create the change for the removal of the BCVs from the server. Perform server changes to remove BCVs and update appropriate configuration files. Validate server configuration changes to ensure BCVs no longer show. Open SAN configuration change Perform SAN configuration changes: unmapping, unmasking of BCV LUNs Perform misc SAN validations. Update Asset Management and Configuration DBs to synchronize host and san changes as appropriate.
HBAs can be removed from one server and installed in another server just like a server can be repurposed. From a SAN perspective the zoning follows the HBA/WWPN. It is possible that if a server was not properly decommissioned that the SAN would still be available from an old server to a new server. While this would not necessarily cause an operational problem it could allow access to sensitive information and the recipient server would be potentially charged with SAN storage that it might not need. The orphaned storage/zones were created as part of an incomplete decommission process. The appropriate process is as follows: Conduct Requirements review to gather requirements and distribute to all staff affected by the implementation. Perform server validation and host preparation work to validate there are no active SAN paths or devices for the HBA. If server is not ready the perform remediation. Review solution through pre change request scheduling and coordination meeting. Create the change for the removal of the HBA. Perform server change Validate server configuration changes. Open SAN configuration change Perform SAN configuration changes: zoning creation, mapping, masking, LUN allocation Performance misc SAN validations Update Asset Management and Configuration DBs to synchronize host and san changes as appropriate.
The storage change process represents the activities required to add or remove storage from a server. In the SAN change process, changes are executed on the storage array and switches (for new storage) prior to the changes on the server are implemented to see and use the storage. The process is similar to the other processes. For chargeback there are a couple of implications with this process: Changes are made on the SAN prior to being made on the server. This can create a timing issue in which storage is reported on the storage array but is not reported on the server. Typically the business rules dictate that the storage must be validated on the host prior to charging for it. If host changes are abandoned for some reason after the SAN is allocated then storage can be left in an orphaned state. While this is not a common occurrence, it does happen.
For environments relying on host agent data for storage consumption operational discipline and maturity is critical. SNMP alerts should be sent to an alert management tool such as Tivoli TEC where additional processing can initiate interactions with problem and change management tools in order to create and assign tickets to the storage resource management teams queue. Once created, the agent remediation process can be initiated.
Monthly processing must create reports that identify discrepancies in the data including the following: Servers that have been decommissioned but are still reporting SAN Servers whom report storage by the arrays but not by the host agent Servers whom report storage but do not have a valid host name entry in the asset management database. Audit exceptions such as increase/decrease in storage without corresponding change ticket or situations where decommissioned arrays show in reporting. Storage quantity synchronization issues such as: Servers that report more storage on the array than on the host. Servers that report storage on the host that is not reported on the array. Servers that report BCVs that the arrays do not report on. Amount discrepancies for total or individual LUN allocations between the Agent based reporting and the Array based reporting. Discrepancies between hostnames for matching WWPNs for Array based reporting versus Agent based reporting. As part of this process any changes identified and resolved that will result in an additional increase in charges to a business unit require socialization of the amounts, changes, and justification prior to entering the invoice process.
The purpose of the exception report is to provide information necessary to identify issues within the data that are a result of process breakdowns or data entry errors. Looking at the first example, server 10, we have identified a server that has a significant amount of storage, 1280 GB, but is not in the asset management database. For this type of case further investigation would ensue to determine if it’s an active server that needs to be added to the asset management db, or if the server name is spelled incorrectly, or if the server had been renamed. In the case of server12 the server is an active production server in the asset management database and it is also in the billing database but the server does not have any TPC data. A ticket will probably need to be opened with the TPC team to remediate the TPC data collection. In the case of Server15 the server is listed as decommissioned in the asset management database but it is still reporting in the billing database and in TPC. It appears that the decommission process was broken and the server never had its storage properly decommissioned or the TPC agent un-installed. In the case of server17 and server18 the server is not reporting in TPC but it is reporting both EMC and IBM storage. In this environment this is typical of an incomplete migration process and indicates that part of the storage was never de-allocated from the storage arrays.
My tool wish list.
Rather than focus on a tools evaluation, I wanted to focus this section on identifying requirements for a tool. Support of all storage arrays and hosts in the environment. Flexibility – Ability to handle new data sources as identified and as required for support of new requirements. Open vs. Proprietary – Ability to utilize freely available ubiquitous technologies which allows for rapid development, modification and ease of maintenance. Transparent versus closed – All logic should be documented and visible to auditors. Sustainable – While some scripting skills might be needed we didn’t want highly specialized skills to maintain code base. No duplicate charges requires logic in the reporting to only invoice for any single LUN one time. All additional identification of the same LUNs for other servers should be excluded from the invoice. Sensible cluster naming convention – This was a subject of great debate and significant discussion as cluster names mean different things to different job roles. To application folks application clusters are the name that a server cluster is referred to. For AIX system administrators managing an HACMP cluster, the cluster name is what is referred to. We decided to use a loose clustering naming convention whereby any server is associated with another server if any of the servers share a LUN. For example, if Server A sees LUN 1 and 2, Server B sees LUNs 2 and 3 and Server C sees LUNs 3 and 4, then by virtue of this relationship Server A, B and C are in a cluster. This provided some confusion but it identified reporting problems fairly quickly which in some cases were a result of configuration defects that needed remediation. Ability to track SAN changes. If variance must be justified with a change management record then it is logical to assume that the change management tool should provide fields required for tracking and identifying changes associated with servers for a given time period. Unfortunately, in our change management fields required for tracking were not available. At the time of this document creation additional changes to the change management tools have been submitted and accepted for inclusion in updates to the change management tools. The reporting logic must be able to evaluate configuration criteria and determine tier.
Existing Server: This field should be a Multi-Text Option List containing 2 options: YES or NO. Purpose of Request: This field should be a Multi-Text Option List containing options: Storage Allocation, Storage De-Allocation, Decommission, Re-tier, Other. Location or datacenter: This field should be selected from a Multi-Text Option List with one of the following location options: Physical Server name to SAN: User must answer the existing server question first. If this is an existing server this should be searchable text field that is pre-filled with server names from asset management database. This will require additional CMI extract loaded to DB on a nightly basis. Business Rule: If existing server, then server name must exist in asset management db. Tier: Here are the available tiers that should be available from an option list: T1-High, T2-Medium, T3N-Low, T4N-Near Line Amount of usable storage being requested: It should be a required Integer field with a description/column header of: Usable Size (GB) Number of SAN ports: Only show this field for new servers. This should be an Integer value with a default of 2. User can adjust.
In the previous consumption reporting the storage consumption report was created manually. LOB was added both in the storage consumption report and then re-created later by the billing team. This added confusion as there were often mismatches between the storage consumption report and the invoice. There was a significant dependency on the veracity of the storage scope data which is dependent on the operational discipline in the environment. There was a large list of servers that had not reported data for a significant period of time that were still included in the report. Clusters created significant confusion.
The primary changes when compared to the previous data flow are: TPC replaces SS as a source for storage consumption information EMC NAS is derived from EMC command line interface instead of SS BCV information derived directly from arrays via CLI commands. Flash copy information added to the process. The flow as read from left to right and starting at the top is as follows: NAS aggregate data is reported in the NetApps Operations Manager and a CSV extract is created and emailed to the consumption analyst. Host agents report information on host consumption and automatically send it to the TPC server where the data is loaded on a daily basis. A host allocation report is then extracted from TPC using query and emailed to consumption analyst. EMC NAS and EMC BCV relationships are extracted from command line tools and emailed to consumption analyst. Flash copy relationships for IBM DS8000 storage arrays are derived via command line queries and emailed to the consumption analyst. RAID array information and relationships to LUNs are extracted via command line queries from EMC and DS8K and automatically imported into a staging database. This process is done for both EMC and IBM storage arrays. Similar information is sent to another configuration database where data from certain arrays that are not reporting properly can be gathered to supplement the other collections. Device and RAID configuration information is extracted from staging and configuration databases and emailed to consumption analyst. A billing database that contains customer -> server -> LUN information is fed by the staging database and used to extract host consumption reports. The host consumption reports are sent to the consumption analyst. Manual extracts are created based on CLI output from the HP storage arrays and emailed to consumption analyst. A script is ran against this data to create files that can be imported into billing report. All the other data feeds (Except NAS and HP) are combined along with asset management database information, and problem and change management reports to create a series of reports: Billing report, LUN detail report, and several exception reports. It is worth noting that there was not a good way to identify SAN changes. This was due to a problem with the forms in the change management tool. Certain fields required to identify host changes were not required inputs for change requests. This caused and continues to cause significant challenges. A new change tracking tool with the required inputs is being developed but is not available at the moment. The billing report is a combination of the HP, NAS and combined billing report and is the rough draft for the Invoice. At this point in the processing some basic checks should be performed: Validate data to ensure that none of the decommissioned storage arrays are included in the reporting Remove hosts that are decommissioned Compare array count against previous month and identify if any new arrays have been added or if others are missing Analyze host exception reports to see if any newly identified hosts have been found. Apply business rules and correlate application data and create invoice Perform QA on Invoice to ensure that every variance has appropriate justification. Submit invoice.
The purpose of the summary report is to provide the technical information required for billing including the vendor type, storage tier, server name, location, amount of storage and any change tickets associated with this server. The first column contains the name of the server or cluster. The second column identifies the location of the data center in which the server and storage array are located. This location should never have multiple entries as a server can only be physically located in one location. Multiple locations did come up on occasion as a result of problems with clustering or orphaned storage left by servers that had been decommissioned and shipped to another data center. The Tier column provides the tier of the storage which is used to calculate charges. The Type column refers to the vendor/manufacturer of the storage. The Raw Total is the amount of Raw disk storage in GB consumed for this server by tier and type. The Usable amount in GB provides the amount of usable disk storage consumed for this server by tier and type.
The purpose of the detailed report is to provide a breakdown of each server by LUN with the array identifier, location, server name, tier, vendor type, lun type and size in raw and usable. This can be used to identify and resolve discrepancies. In this case the server has two types of devices: STD which indicate they are in a source to BCV pairing and BCV which are the targets of the STD devices. There should be at least 1 BCV for every STD device.
The invoice process takes the output of the storage consumption summary report and integrates it with the business unit, application, tower and rate elements to create an amount that is charged per server. This information can in turn be rolled up to a business unit level
Ensure there is executive sponsorship from both IT and customers. Do not attempt to establish this type of process without the support and sponsorship of both the customer and IT as there will be significant costs, changes and time required by all parties. Identify a customer that is willing to work through ‘true-up’ processing to validate storage allocated to each server. This process in itself is a significant effort and requires some type of secondary data source such as a host based collection to validate the array based information. In our initial processing we did not have host agent data, we only had storage array data as a source. A very sophisticated customer had rolled his own host based data collection scripts to gather and report SAN storage consumption. This provided an immediate source for data true activities. Secondary customers did not have the their own tools. By the time they were ready to participate in true up activities we had established a process to harness a newly deployed host agent reporting to use as a secondary source for true up activities. Establish a baseline after all known data issues are resolved. Do not commit to strict change justification prior to data cleanup. Communicate data issues with customers and include them in the data remediation process to ensure they understand the impact of any changes in the invoice. This will avoid significant frustration and will provide the customer additional time to absorb any cost increases and adjust their financial forecasts as appropriate. Establishing robust processes and tools will require significant time, energy and communication but will result in a customer that is confident in the invoice and willing to pay
Questions?
Consumption chargeback reporting are disciplines focused on identifying assets relating to specific business entities and allocating charges for to those entities based on what resources are being used. SAN storage consumption represents one of the use cases in the consumption/chargeback disciplines. The focus of storage consumption is to determine the amount of storage used by each server regardless of the physical location of the storage. While the SAN environment provides some additional complexities the consumption/chargeback discipline, it is beholden to the same laws that govern other types of infrastructure consumption reporting, namely, processes and tools.