Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Case Study: University of Chicago Achieves High
Availability through a Centralized and Service
Centric Approach to IT Moni...
2 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD
© 2015 CA. All rights reserved. All trademarks referenced herein belong...
3 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD
Abstract
Learn how the IT operations team at University of Chicago buil...
4 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD
Agenda
BACKGROUND
OUR APPROACH & STRATEGY
THREE PHASES OF IMPLEMENTATIO...
Background University Of Chicago
– Founded by Rockefeller in 1890
– ~6k Undergrads, ~10k Graduate Students, ~8K Staff and ...
Background Erik Giles & Abe Shaker
• Erik Giles
– Univ. of Chicago Command Center Manager (SM Best Practice Consultant)
– ...
CA UIM @ University Of Chicago
Ensures IT services availability and reliability
• Our environment at a glance
– Running CA...
CA UIM & CA Spectrum Architecture Diagram
Since Q1FY14, Spectrum has been upgraded twice from 8.6 to 9.4.2.1
CA UIM server...
Broad vs Deep
• We have been following a strategy that brings all devices
into monitoring and then slowly increases the fi...
We Chose #2 - Here Is Why
• We can see everything in our environment.
• We never have to rework or go back on
something to...
Phased Approach
• To do “broad then deep” you have to have a clear set of phases.
1. “Ping Only” to start to get everythin...
Seven Quarters of Outages*
0
5
10
15
20
25
Q1FY14 Q2FY14 Q3FY14 Q4FY14 Q1FY15 Q2FY15 Q3FY15
xMail Email/Calendar
Wireless ...
3 Nines of Hardware Uptime (last two quarters)
13
99.600%
99.620%
99.640%
99.660%
99.680%
99.700%
99.720%
99.740%
99.760%
...
All In! (Phase 1 – Connectivity)
• Our Goal was to just get every device into spectrum via a simple Ping.
• Objectives and...
Connectivity Technical Strategy
• Top Five Technical Notes
– Hardware Firewall Issues
• ICMP blocked in most locations acr...
Getting Proactive(Phase II – SNMP)
• The goal was to get as many devices on SNMP as we had licenses and integrate the rest...
SNMP Technical Strategy
• Top Eleven Technical Notes
1. The same challenges experienced w/ ICMP across all of Phase I had ...
SNMP Technical Strategy (continued)
7. Technical challenges in dealing w/ the increase of monitoring from just ICMP to SNM...
Service Centric (Phase III – Service View)
• Our goal here is to go from a hardware centric view to a services centric vie...
CA eHealth to CA UIM
• Business Advantages
– Provides far greater monitoring capability with improved scalability
– Signif...
Synthetic Transaction Monitoring
• Synthetic Transaction Monitoring is the “cool thing” in monitoring and you cannot imple...
Achievements by uChicago
Three 9’sAvailability
Reduced Outages by
300%
ENHANCED
VISIBILTY
ACROSS THE
ORGANIZATION
BETTER
S...
23 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD
Q & A
24 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD
For More Information
To learn more, please visit:
http://cainc.to/Nv2V...
Prochain SlideShare
Chargement dans…5
×

Case Study: University of Chicago Achieves High Availability through a Centralized and Service Centric Approach to IT Monitoring

2 061 vues

Publié le

Learn how the IT operations team at University of Chicago built a centralized and service centric approach to IT infrastructure monitoring. The University of Chicago is using CA Unified Infrastructure Management (CA UIM) to implement a central, integrated approach for monitoring IT systems, applications, networks, VoIP phones, data center infrastructure and business services. Be it PBX phones or data center water chillers, they are monitoring it all centrally. As breaking down organizational silos is critical to success in this approach, learn insights and tips on how to overcome this barrier. In additional, we will talk about the experience of moving to CA UIM from CA eHealth.

For more information, please visit http://cainc.to/Nv2VOe

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

Case Study: University of Chicago Achieves High Availability through a Centralized and Service Centric Approach to IT Monitoring

  1. 1. Case Study: University of Chicago Achieves High Availability through a Centralized and Service Centric Approach to IT Monitoring Erik Giles DevOps: Agile Ops The University of Chicago Command Center Manager DO5T17S @ErikGiles Abe Shaker The University of Chicago IT Monitoring Engineer
  2. 2. 2 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD © 2015 CA. All rights reserved. All trademarks referenced herein belong to their respective companies. The content provided in this CA World 2015 presentation is intended for informational purposes only and does not form any type of warranty. The information provided by a CA partner and/or CA customer has not been reviewed for accuracy by CA. For Informational Purposes Only Terms of this Presentation
  3. 3. 3 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD Abstract Learn how the IT operations team at University of Chicago built a centralized and service centric approach to IT infrastructure monitoring. The University of Chicago is using CA Unified Infrastructure Management (CA UIM) to implement a central, integrated approach for monitoring IT systems, applications, networks, VoIP phones, data center infrastructure and business services. Be it PBX phones or data center water chillers, they are monitoring it all centrally. As breaking down organizational silos is critical to success in this approach, learn insights and tips on how to overcome this barrier. In additional, we will talk about the experience of moving to CA UIM from CA eHealth. Erik Giles Abe Shaker University Of Chicago
  4. 4. 4 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD Agenda BACKGROUND OUR APPROACH & STRATEGY THREE PHASES OF IMPLEMENTATION COMPARING CA UIM AND CA E-HEALTH QUESTIONS 1 2 3 4 5
  5. 5. Background University Of Chicago – Founded by Rockefeller in 1890 – ~6k Undergrads, ~10k Graduate Students, ~8K Staff and Researchers – 89 Nobel Prize winners – 1st Heisman Winner in 1935 (Jay Berwanger) [anyone know the original name?] – Campus extensions in New Deli, Paris, London, and Honk Kong
  6. 6. Background Erik Giles & Abe Shaker • Erik Giles – Univ. of Chicago Command Center Manager (SM Best Practice Consultant) – Running Command Centers (and their tools since 2004) – In technology since 1996 with MS in Engineering & MBA from USC – Worked in Technical Leadership at Boeing, Orbits, Publicis, IL Tollway • Abe Shaker – Lead Reporting & Monitoring Engineer for Univ. of Chicago – Working in monitoring tool management since 2010 – In technology since 1998 with degree in Electronics – Worked at State Farm, Motorola, Dept. of Veterans Affairs, DeVry
  7. 7. CA UIM @ University Of Chicago Ensures IT services availability and reliability • Our environment at a glance – Running CA UIM since June 2011 – Currently Monitor over 5,000 devices across the globe (35K alarms) – Integrate 6 “other” monitoring tools into common window – Watched 24/7/365 by the Command Center Team – Working with 13 Infrastructure Teams
  8. 8. CA UIM & CA Spectrum Architecture Diagram Since Q1FY14, Spectrum has been upgraded twice from 8.6 to 9.4.2.1 CA UIM servers were installed on RHEL 6.6 (UIM 8.0) and were upgraded to 8.2.
  9. 9. Broad vs Deep • We have been following a strategy that brings all devices into monitoring and then slowly increases the fidelity of that monitoring in phases. • This is a trade off. Do you… 1. Get early wins by working with one supportive team to show how much can be done with a strong Enterprise Monitoring tool (such as Network or Windows). 2. Build a foundation for a complete integrated monitoring solution by building in all devices at the same time and operationalizing it as everyone comes on board.
  10. 10. We Chose #2 - Here Is Why • We can see everything in our environment. • We never have to rework or go back on something to make a new technology work. • Our culture and technology can evolve at the same time. • All the infrastructure teams have skin in the game and learn from each other.
  11. 11. Phased Approach • To do “broad then deep” you have to have a clear set of phases. 1. “Ping Only” to start to get everything in the system. (five months for us) 2. Hardware based SNMP Model to get proactive. (16 months for us) 3. Service based solution to align to management and customers. (just started) • Each phase has the following – Unique Team across leadership and technical elements – Very Specific Outcomes and Objectives (hard metrics) – A plan with templated deliverables and management commitments – Standing meetings: Steering Committee (mthly), Leadership (wkly), and technical (wkly) • Must complete each phase in order (no matter how long it takes) • Tip: Tie project to Outage reduction and SM Metric Improvements
  12. 12. Seven Quarters of Outages* 0 5 10 15 20 25 Q1FY14 Q2FY14 Q3FY14 Q4FY14 Q1FY15 Q2FY15 Q3FY15 xMail Email/Calendar Wireless Data Networking Windows Voicemail and Unified Messaging UPS UChicago Time (aka Time and Attendance) Telephone Service Storage Specialty Voice Services Service Desk Remote Site Connectivity Physical Network Connections Phones & Internet Connections Mainframe Job Scheduling Mainframe Gargoyle - Student Information System Employee Self Service (ESS, Benefits Management) DNS Management Desktop Support Delphi Planning Database Management cVPN (VPN) Virtual Private Network ComEd Power Cisco (VoIP) Chalk Learning Management System Call Center AURA - Grants Reporting (Research/Business Objects) Averaging an Outage a week as we had for years up until the monitoring effort Phase I Monitoring catching many more events and our numbers go up As we do the work for Phase II Outages are dropping considerably Phase I Production Phase II Development Phase II Production Phase I Development This slide is that sold leadership on the approach
  13. 13. 3 Nines of Hardware Uptime (last two quarters) 13 99.600% 99.620% 99.640% 99.660% 99.680% 99.700% 99.720% 99.740% 99.760% 99.780% 99.800% 99.820% 99.840% 99.860% 99.880% 99.900% 99.920% 99.940% 99.960% 99.980% 100.000% 1 2 3 4 5 6 7 8 9 10 11 12 13 Q3FY15 Hardware Uptime (to date) Average Uptime – 99.954% 99.600% 99.620% 99.640% 99.660% 99.680% 99.700% 99.720% 99.740% 99.760% 99.780% 99.800% 99.820% 99.840% 99.860% 99.880% 99.900% 99.920% 99.940% 99.960% 99.980% 100.000% 1 2 3 4 5 6 7 8 9 10 11 12 13 Q2FY15 Hardware Uptime Average Uptime – 99.960% This slide is that sold the technical teams
  14. 14. All In! (Phase 1 – Connectivity) • Our Goal was to just get every device into spectrum via a simple Ping. • Objectives and Outcomes – Capture 95% of Outages in Monitoring – Monitor all uChicago Hardware – Operationalize Monitoring with 24/7 Staff – Report on Hardware uptime and alarm counts • Business Notes – We exceeded our outage capture targets (98% outages captured on ping only) – Getting all the hardware integrated was much harder than you’d think – Letting an operational team monitor other team’s hardware was a BIG cultural shift – Spent a lot of time getting the operational side right (and consistent) – Avoid politics by focusing on the data (always have data!)
  15. 15. Connectivity Technical Strategy • Top Five Technical Notes – Hardware Firewall Issues • ICMP blocked in most locations across campus – Software Firewall Issues • IPTABLE rules needed to be put in place to allow communication from our SpectroSever – Network ACLs • ACLs had to be updated on our entire distribution layer to account for the source traffic from Spectrum – Non-routable IP space • Lots of private IP space required the setup of VPNs to allow communication – Used IPs and old DNS entries • We noticed a lot of discrepancies between the list of systems to be monitored and how Spectrum was discovering them. Large blocks of re-used IPs w/ out the proper update to DNS
  16. 16. Getting Proactive(Phase II – SNMP) • The goal was to get as many devices on SNMP as we had licenses and integrate the rest so that we could be proactive at a hardware level. • Objectives and Outcomes – Reduce Outages by 50% – Reduce Severity 1-3 Incidents by 50% overall – Operationalize responses to non-Outage events (adding Major/Minor Alarms) – Real-time dashboards and reports for all hardware teams • Business Notes – We far exceeded our outage goals with a 300% reduction in Outages (before we finished) – Zero teams wanted to commit to this goal (management intervention required) – Huge emphasis on cleaning up technical debt and following processes – Even technical folks didn’t understand the difference between “events” and “utilization” – Best work was done with technical folks worked with each other (without management)
  17. 17. SNMP Technical Strategy • Top Eleven Technical Notes 1. The same challenges experienced w/ ICMP across all of Phase I had to be re-visited w/ SNMP 2. SNMP Device Certification of custom models 3. Lack of SNMP support • Encountered some critical devices that either didn’t support SNMP or had it disabled 4. NATd IPs • Had issues getting NATd models monitored 5. No use of agents • Agents were not allowed on certain servers which limited our monitoring options 6. Lots and lots of 3rd party monitoring system integrations • Instead of natively monitoring models we have a large number of third party integrations. This presented us w/ a new set of challenges – Mapping alarm and event fields between applications – Auto-clearing – Dynamic Alarm fields – Mapping of severities – Two-way communication for note entries and the ACK check mark
  18. 18. SNMP Technical Strategy (continued) 7. Technical challenges in dealing w/ the increase of monitoring from just ICMP to SNMP • Server and Network load – Massive Increase in events 8. Architectural Diagram • Ran into issues during upgrades as OneClick was installed on our SRM server (w/ out knowing) 9. LDAP Issues • Had to abandon direct Spectrum to LDAP solution as we were having that hung-session/time-out problem. Instead EEM was upgraded and re-integrated 10.Uniformity Across User Accounts • We initially had different views for different techs in the command center which led to some alarms getting missed. Locked the initial view down on their user preferences to resolve. • Noticed there were far too many user groups w/ each having their own set of permissions. Simplified permissions to better match the logical grouping of the University 11.The local_policy.jar file • We have a number of non-technical users who do not have admin rights to their computers. The Spectrum upgrade requiring the local_policy.jar file to be placed into the Java security folder required elevated permissions. 18
  19. 19. Service Centric (Phase III – Service View) • Our goal here is to go from a hardware centric view to a services centric view by regrouping all the hardware by service and then monitor and measure by service. • Objectives and Outcomes – Three 9’s (99.9%) of uptime for all uChicago IT Services (44 minutes of downtime a month) – Command Center actually monitoring “services” not just the hardware – Vanity Dashboards by Service for Customers and Management • Business Notes – (can’t give you outcomes yet because we are just starting this phase) • Our hardware is already at Four 9’s – Application Teams are very supportive of doing this work and active participants • BUT… any shortcuts you took with Hardware at SNMP will now bite you in the a__! – Having a CMDB makes this easier but if not you’ll end up building a lot of the basics of it – People are obsessed with Synthetic Transaction Monitoring (without understanding it) – With a totally new team it feels a lot like starting over from a “buy-in” perspective
  20. 20. CA eHealth to CA UIM • Business Advantages – Provides far greater monitoring capability with improved scalability – Significantly improved look and feel with lots of dashboard customization options – Better quality and “current-ness” of technical support – Clearly the futureproof option with the CA product • Technical Challenges – There was a lot of effort into getting FW rules and ACLs updated to allow SNMP polling from the new UIM servers – Currently the EEM (single sign on) solution w/ eHealth allowed Spectrum users to view eHealth content from OneClick w/ out having to re-authenticate. No single sign on solution exists yet. • Note, CA UIM doesn’t support Unix/Linux based LDAP authentication. Windows AD or eDirectory only – Migration of any monitoring setup in eHealth (i.e. network or disk utilization) to CA UIM – CA UIM’s architecture is different than eHealth which could lead to a larger application footprint • Suggested setup could have 4 different servers at the minimum – Two UIM servers (primary/standby) – External DB server (SQL or Oracle) – UMP Server (Webserver) – We have our SNMP collector housed on a separate server as well – Work will have to be put in to duplicate any reports generated in eHealth as the default/out-of-the-box reports in CA UIM differ – A learning curve will be present as you’ll have to get use to a new interface, setup procedure, application maintenance, etc…
  21. 21. Synthetic Transaction Monitoring • Synthetic Transaction Monitoring is the “cool thing” in monitoring and you cannot implement an Integrated Central solution on local enterprise systems without getting feedback that if we just did STM then none of this would mater and it’d all be easier. (usually from your cloud loving app folks) • STM does have its place and can be a key element of your strategy – It WILL find more things than traditional monitoring will do. – It WILL give you a better understanding of the customer experience. – It IS easier to setup and get started right away. • But STM is not really a complete solution and here is why – It WON’T tell you what is actually wrong with your systems, including where the problem is. – It WON’T actually end up being cheaper because to come even close to the amount of data you can get from a traditional monitoring solution would be 5X as expensive. – It ISN’T a complete solution unless all you are looking for is connectivity and latency. • We DO plan to use STM down the line (Phase IV) – It will improve our enterprise solution by identifying issues for which we didn’t configure (yet) – It will give us a solution for Cloud-Based solution that we cannot monitor traditionally – It will show us the customer experience (connectivity and latency)
  22. 22. Achievements by uChicago Three 9’sAvailability Reduced Outages by 300% ENHANCED VISIBILTY ACROSS THE ORGANIZATION BETTER SERVICE QUALITY IMPROVED SCALABILTIY FUTURE PROOF
  23. 23. 23 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD Q & A
  24. 24. 24 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD For More Information To learn more, please visit: http://cainc.to/Nv2VOe CA World ’15

×