Contenu connexe Similaire à Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gain Network Visibility | InfluxDays NA 2021 (20) Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gain Network Visibility | InfluxDays NA 20211. How Red Hat Uses gNMI,
Telegraf and InfluxDB to
Gain Network Visibility
Martin Moucka - Principal Network Engineer
Red Hat
2. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Agenda
• Introduction
• Scope
• Why InfluxDB?
• Architecture
• Visualizations
• Flux
3. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Red Hat
The world’s leading provider
of open source enterprise IT solutions
MORE THAN
90%
of the
FORTUNE
500
RED HAT
use
PRODUCTS &
SOLUTIONS*
~13,815
EMPLOYEES
105+
OFFICES
40+
COUNTRIES
THE FIRST
$3
OPEN
SOURCE
COMPANY
IN THE WORLD
BILLION
4. © 2021 InfluxData Inc. All Rights Reserved.
Martin Moucka
Principal Network Engineer, Red Hat
● With company for more than 7 years
● Built a network automation around Ansible, utilizing single source of truth
● Started transition to modern monitoring connected to the network automation
● Tech lead of Network Automation & Tools team
E-mail: mmoucka@redhat.com
5. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Network Monitoring
Network monitoring provides insight to
the network. It monitors the status of
network devices (switches, routers,
firewalls, etc..), network
status/performance. It provides a
graphical view of metrics (e.g. link
utilization) and/or device status (e.g. up
or down) together with alerting when
something is out of service.
Key Capabilities of Network Monitoring
Performance metric visualizations. Monitoring of the network
for performance issues, display information in a visual format
(Dashboards) - understand your network performance at a
glance.
Network alerts. Alert on any problems that occur. Discovery of
issues from monitored data, augment alert data with relevant
information helping support teams to respond quickly.
Network mapping. Visualization of complex network
landscapes in a map format including device/network health
state.
Bandwidth monitoring. Identify where network bandwidth
usage is not optimal, and drive decisions to improve utilization.
6. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Scope
• Juniper, Cisco (WLC, ASA, IOS, UCS, etc...), OpenGear, F5 and Mist
• Custom probes for synthetic monitoring
• 60+ sites
• ~ 1.6k monitored devices
• ~ 14k monitored interfaces
• 5 collectors
7. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Why InfluxDB?
• Open Source with Enterprise support
• Efficient data storage
• Flexibility in integrations/languages
• Modular agent Telegraf with support of JTI (Juniper Telemetry Int.)
• Support for SQL-like query language
• Flux as powerful flexible query language
8. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Solution Architecture
Distributed Monitoring
Services / Storage
Network Devices
Telegraf/Kapacitor/InfluxDB
Troubleshooting
Network
Automation
Adding/Removing
device
Event
Management
Visualization
Probes
Alert
Check / Send data
Manual intervention
Event
Automation
Troubleshooting
Fix
Configure
Configure
New monitored
system/device
9. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
10. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Visualizations - Immediate response
• Device detailed status
• Interface utilization (SNMP / gNMI)
• Interface errors (SNMP / gNMI)
• CPU/Memory utilization (SNMP)
• BGP neighbors status (SNMP / gNMI in progress)
• etc...
• Site View
• Data from probe (Latency, Packet loss, HTTP response time, DNS delay)
• SLI/SLO status (Kapacitor processed + Flux query)
• Internet link utilization (processed by Kapacitor)
• Top talkers (from other tool via RestAPI)
• Wireless status
• Statistics of WLC/APs and connected clients
11. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
12. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
13. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
14. © 2021 InfluxData Inc. All Rights Reserved.
14
© 2021 InfluxData Inc. All Rights Reserved.
15. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Visualizations - Long-term planning
• Link capacity utilization
• Status page based on SLI/SLO
• Wireless AP (Cisco WLC) anomaly detection - Flux
• Compliance reporting
16. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
17. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
18. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Flux
• Provides very flexible programmatic way of query
• Allows changing data type within a query
• Within compliance report, we connect up to 5 different
measurements
• Used for access point, poor SNR anomaly detection across regions
• Focus where it matters most
• Allows custom functions
• Median Absolute Deviation used for anomaly detection
• Well-documented at
https://www.influxdata.com/blog/anomaly-detection-with-median-abs
olute-deviation/
19. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Median Absolute Deviation - Function
import "math"
import "experimental"
mad = (table=<-, threshold=3.0) => {
data = table |> group(columns: ["_time"], mode:"by")
med = data |> median(column: "_value")
diff = join(tables: {data: data, med: med}, on: ["_time"], method: "inner")
|> map(fn: (r) => ({ r with _value: math.abs(x: r._value_data - r._value_med) }))
|> drop(columns: ["_start", "_stop", "_value_med", "_value_data"])
k = 1.4826
diff_med =
diff
|> median(column: "_value")
|> map(fn: (r) => ({ r with MAD: k * r._value}))
|> filter(fn: (r) => r.MAD > 0.0)
output = join(tables: {diff: diff, diff_med: diff_med}, on: ["_time"], method: "inner")
|> map(fn: (r) => ({ r with _value: r._value_diff/r._value_diff_med}))
|> map(fn: (r) => ({ r with
level:
if r._value >= threshold then "anomaly"
else "normal"
}))
return output
}
20. © 2021 InfluxData Inc. All Rights Reserved.
© 2021 InfluxData Inc. All Rights Reserved.
Median Absolute Deviation - Usage
pc_duration = from(bucket: "XXXXXX")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) =>
r._measurement == "bsnAPTable" and
r._field =~ /radio1PoorSNRClients|radio1Users/ and
r.region == "${region}"
)
|> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
|> filter(fn: (r) =>
r.radio1PoorSNRClients > 0 and
r.radio1Users > 0
)
|> map(fn: (r) => ({ r with CNPR: float(v: r.radio1PoorSNRClients) / float(v: r.radio1Users)}))
|> stateDuration(
fn: (r) => r.CNPR >= 0.1,
column: "duration"
)
|> map(fn: (r) => ({ r with _value: float(v: r.duration) / float(v: r.CNPR)}))
|> filter(fn: (r) => r._value > 0)
|> truncateTimeColumn(unit: 1h)
|> toFloat()
pc_duration |> mad(threshold:10.0)
|> filter(fn: (r) => r.level == "anomaly")
|> group(columns: ["APName"])
|> count()
|> group()