Scalability on a large environment can be a challenge on many different aspects involving customization of monitors, performance and reporting. The goal of this presentation is to share the experience we had at Dell, monitoring a big number of servers in an environment with constant changes, lots of custom monitors and new servers configured every week. We will present, from our 3 years of experience with Zabbix and Oracle, which positive/negative aspects we have taken from the configuration parameters we used, involving strong use of User Macros, optimization of Database Queries, Table Partitioning and Automation.
7. 7
Our Processes
Windows and Linux serversServer Monitoring
Oracle and SQL databasesDatabase Monitoring
Incident Mgmt, Change Mgmt, Request Mgmt, …ITIL Process
Focused on setup of monitoringZabbix Admin Team
Focused on watching Monitoring IncidentsMonitoring Team (L1)
Escalation for L1, define monitoring requirementsApplication Team (L2)
Identified by L2, created by Zabbix AdminCustom Monitors
Identified and created by Zabbix AdminBaseline Monitors
8. 8
Global Team
Brazil
• Zabbix Admin
• Monitoring L1
• Application L2
• Developers L3
United States
• Application L2
India
• Application L2
• Developers L3
Malaysia
• Zabbix Admin
• Monitoring L1
10. 10
Main Challenges
Environment Maintenances
• Frequent changes in the
environment being
monitored
• Issues caused by
changes
Performance
• Oracle Database
• Large environment
Configuration Updates
• Constant changes on
monitored items
Reporting
11. 11
Table Partitioning
Our Approach - 1) Performance
Pros:
- Keep size of tables under control
- Reduces housekeeping effort
Cons:
- Don’t take benefit of
partitioning during SELECT
New column: DATE_COL
HISTORY HISTORY
LOG
HISTORY
STR
HISTORY
TEXT
HISTORY
UINT
Faster
queries in
History
- Daily Partition
- Daily cleanup job
(deletes old partition)
12. 12
Query Optimization
Our Approach - 1) Performance
Identify top
offending
queries
Debug
mode in
Zabbix
frontend
SQL
profiling
tool inside
Web servers
DBA
Analytics
Optimize
queries in
code
Create
new
index
Apply
SQL
Profile
13. 13
Query Optimization
Our Approach - 1) Performance
• Web Servers
• File: /var/www/html/include/db.inc.php
• Function: Dbselect
• Queries
– Last value from history with clock filter
– OLD: SELECT * FROM (SELECT * FROM history_uint h WHERE h.itemid='152604' AND h.clock>1453661848 ORDER BY
h.clock DESC) WHERE rownum BETWEEN 0 AND 1
– NEW: SELECT * FROM history_uint h WHERE h.itemid='152604' and h.clock>1453661848 and H.CLOCK = (SELECT
MAX(H.CLOCK) FROM history_uint h WHERE h.itemid='152604' and h.clock>1453661848)
– Last value from history
– OLD: SELECT * FROM (SELECT * FROM history_uint h WHERE h.itemid='137781' ORDER BY h.clock DESC) WHERE
rownum BETWEEN 0 AND 1
– NEW: SELECT * FROM history_uint h WHERE h.itemid='137781' and H.CLOCK = (SELECT MAX(H.CLOCK) FROM
history_uint h WHERE h.itemid='137781')
• Improvement
– Execution Time (avg): 0.9s (Old) X 0.001s (New)
– Hourly runs: 300k+
– Hourly savings: 75h (parallel executions)
14. 14
Others
Our Approach - 1) Performance
.last(0)
function
Active
Proxy
Items Not
Supported
Actions
with Delay
Passive
agents
15. 15
Our Approach - 2) Configuration updates
Generic Templates
Baseline Templates
- Basic monitors, valid for all servers of that type
- Example: Windows Template with CPU Usage,
Memory Usage, Disk Space monitors
- User Macros to customize thresholds per server
Extended Templates
- Specific types of monitors per template
- All Items/Triggers are the same, changing only the
macro they refer to
- Example:
- service_state[{$SVC01}]
- service_state[{$SVC02}]
- If server needs new monitor, add User Macro, link
template and enable Item/Trigger
- Limited amount of Items (covering 90% of servers)
- Same concept of the Generic Templates
- Difference: number of Items/Triggers pre-configured
- Example:
- Generic Service Template
- 7 Items/Triggers
- 600+ Hosts
- Extended Service Template
- 20 Items/Triggers
- 30+ Hosts
text
text
Baseline Templates
Generic
Templates
Extended
Templates
Baseline Templates
17. 17
Automation
Our Approach - 3) Environment Maintenances
Zabbix agent issue/installation
To manage thousands of hosts, it’s very
important to fix agent issues quickly
Integration with Change Mgmt Tool
Automatically create Maintenance periods when a
change is happening, avoids alerts during code update
Quick fix of common issues
Windows service restart, disk / partition
space cleanup and others
18. 18
Others
Our Approach - 3) Environment Maintenances
Load Balancer
Monitor
Quickly remove
traffic from bad
Web Server
Oracle Database
Monitor corrupted
indexes, automate
for quick fix
Action step delay
Wait 30min before sending
event to Incident Mgmt tool
19. 19
Our Approach - 4) Reporting
Used
• Availability Report
– Extracted weekly by one person
– Available in shared folder for everyone
• Inventory Hosts
– Checking which groups a Host is part of
Not Used
92% of users are Zabbix Users (no access to configuration)
• IT Services and Maps
– Manual configuration
– Too many triggers (30k+)
– Too many hosts (4.5k+)
– Too many logical groups (400+)
22. 22
Wish List
Maintenances flexibility
• More flexible permissions for configuring
maintenances
• Allowing certain user groups to setup
maintenances without modifying the
configuration of the hosts
Dashboards / Reporting
• More dashboards allowing multi-group
filtering
• Pre-configure report before running it
(availability report)
User Macros
• Develop discovery based on User Macros,
to enable dynamic setup/removal of the
monitors
• User Macros on Host Groups
Templates
• Associate a template with a Host Group,
so that all Hosts inside that group would
be linked with that template as well
23. 23
Main Take Away
Database
partitioning in
HISTORY tables
User macros
are really helpful
for managing
custom
monitors
Work with DBA
to identify top
offending queries,
replace them in
code if needed
Large
Environment
with Oracle