Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher, Business Systems Analyst, University of the Sunshine Coast | ANZTLC15
Within a year, USC have enhanced various system administration tasks. From length file and database interrogation, we are now running with a proactive instant alerting process where incidents are captured and actioned before staff and students are impacted. A number of commercial, open-source and in-house tools have been utilised to facilitate these improvements and sights are now set on shifting to self-healing incidents.
Delivered at Innovate and Educate: Teaching and Learning Conference by Blackboard. 24 -27 August 2015 in Adelaide, Australia.
Similaire à Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher, Business Systems Analyst, University of the Sunshine Coast | ANZTLC15
Why 2015 is the Year of Copy Data - What are the requirements?Storage Switzerland
Similaire à Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher, Business Systems Analyst, University of the Sunshine Coast | ANZTLC15 (20)
Machine Data to Readable Reports - System Monitoring, Alerting and Reporting - Ashley Fisher, Business Systems Analyst, University of the Sunshine Coast | ANZTLC15
1. Machine Data to Readable Reports
System Monitoring, Alerting and Reporting
Ashley Fisher
University of the Sunshine Coast, Queensland.
4. 4
Microsoft Windows Ahead
While this presentation focusses on Microsoft Windows Server and associated technologies, the concepts
and implementation of these systems is similar in other operating environments.
5. Underlying Infrastructure
• USC is Microsoft centric
• Servers are running on Windows Server 2008 R2
• Authentication through Active Directory
• Currently running Microsoft SQL Server 2008
5
7. Mediasite Infrastructure
• 2 Environments
• Total 12 Application Servers
• 2 SQL Clusters
• 8 F5 BigIP Pools
• 9.5tb File Share Storage
• 380 Recorded Presentations per Week
• Approximately 1,100 hours of content viewed per Day
7
8. Monitoring Systems In Place
• Nagios
– Monitoring Server Availability
• Zabbix (Pictured Left)
– Monitoring Server Availability and
Performance
– Currently Proof of Concept
• Splunk
– Log Monitoring
8
9. 9
Splunk captures, indexes and correlates real-time data in a searchable repository from
which it can generate graphs, reports, alerts, dashboards and visualizations.
Splunk has a mission of making machine data accessible across an organization by
identifying data patterns, providing metrics, diagnosing problems and providing intelligence
for business operations. Splunk is a horizontal technology used for application management,
security and compliance, as well as business and web analytics.
https://en.wikipedia.org/wiki/Splunk
11. Blackboard Logging
• 67 Log files on each Blackboard host
» A lot of information we can and are using.
» A lot we’re potentially missing.
• Daily rotation of important logs
» Troubleshooting issues across multiple days is frustrating.
• Logs archived Monday morning, weekly
» As above, however we need to unzip the archived logs to get access to the
contained information.
11
12. Blackboard Database
12
• The activity_accumulator table retains a transcript of user activity.
• We can use the behind table joins to track user login times, course access
times, and individual content item interactions.
• USC rotates our activity_accumulator table data into a backup database
every 180 days.
13. Student Contesting Late Submission Penalty
13
Students are penalised by a percentage of their received grade for late assignment submissions, students do contest the penalty from time to time.
• Traditional Method of Investigation
– Database Query (activity_accumulator)
– Individual Host Log Interrogation (Repeat)
• Lots of Steps
• Time Consuming
• Room for Error or Misinterpretation
14. Student Contesting Late Submission Penalty
14
Students are penalised by a percentage of their received grade for late assignment submissions, students do contest the penalty from time to time.
• Intermediate Method
– Database Query (activity_accumulator)
– Log Into Splunk
– Search string:
index=“blackboard_prod” “_userpk1_”
• Few Steps
• Easy Training
• Now Dashboarded
16. 16
Zabbix is an enterprise open source monitoring solution for networks and applications(…)
It is designed to monitor and track the status of various network services, servers, and other
network hardware.
• Simple checks can verify the availability and responsiveness of standard services such
as SMTP or HTTP without installing any software on the monitored host.
• A Zabbix agent can also be installed on UNIX and Windows hosts to monitor statistics
such as CPU load, network utilization, disk space, etc.
• As an alternative to installing an agent on hosts, Zabbix includes support for monitoring
via SNMP, TCP and ICMP checks, as well as over IPMI, JMX, SSH, Telnet and using
custom parameters(…)
https://en.wikipedia.org/wiki/Zabbix
18. • Zabbix holds a very template centred view of deployment.
• The approach we’ve taken is to have ‘opt-in’ templates available for hosts.
• CPU Load, Memory Use, Network Traffic/Bandwidth and HDD Space checks are in a template added to all
hosts with an agent installed
Our Zabbix Environment
18
19. Zabbix Templates
• Example Template: ‘Core Infrastructure Connectivity’.
19
When this template is applied to a host, the Zabbix agent on the host will ping those
end-points locally. We can see if an individual host cannot connect to the time
servers, domain controllers or our LDAP servers.
20. Blackboard and Zabbix
• We have multiple Blackboard specific templates, one is inline with the last example,
however it watches availability and response times of external connectors,
SafeAssign and Collaborate for example.
20
22. Blackboard and Zabbix
• One very powerful tool we have is JMX monitoring pulling information about the
Blackboard application itself.
22
23. Zabbix Environment Mapping
Zabbix allows you to map
relationships between nodes. Show
where problems lay, and their impact.
IE. If there was a problem with file03,
the line between bbdev01 and file03
would turn red, file03’s status would
change from OK to Problem. This is an
easy way to assess what the problem
will impact.
23
24. Mediasite and Zabbix
• Mediasite is really the forefront of monitoring through Zabbix.
• In Nagios, we currently have 5 checks per recorder in production.
• In Zabbix so far, I have 26 individual checks per recorder.
24
26. • The graph below shows the available space on our production
Blackboard file share for the incident.
• Emergency maintenance was carried out on the 15th to
increase the allocated disk space.
The Platforms in Collaboration
26
27. • An alert was set up in Splunk to in real time, let us know when a student
submits an assessment submission is greater than 200mb.
The Platforms in Collaboration
27
28. Self-Healing?
https://mediasite.usc.edu.au/Mediasite/Play/4af80791a9784f0bb418be531d7e31671d
The above video is the only way that I could think of how to present this particular part.
In the video, I have the Zabbix monitoring platform on one side, and a camera feed of the remote
Mediasite recorder on the other.
As illustrated in the previous slide, there are a few checks deemed “self-healing”, this is one such
scenario. In the event that the Mediasite scheduler service fails, or stops, Zabbix picks it up,
realises there is something not right, and I’ve got it sending a command to the recorder to shut the
software down, and force a restart on the recorder.
28