7. 7
Timeline of Private Cloud History
Hypervisor: Xen
OS Instances: 2,000+
Management features from scratch
Hypervisor: KVM
Use OpenStack API
2015
Gen3
2012
Gen2
2010
Gen1
Hypervisor: VMware ESXi
OS Instances: 25,000+
Management features from scratch
9. 9
Benefits
Logging enables log visualization
Get easier to analysis and debugging
From a business point of view
Shorten the time spent on troubleshooting
Leads to a better Customer Support
10. 10
Assumptions
Messages might be un-manageable
Increasing logs require huge log storage
Concerns
How to take care of data loss
How to parse data from different sources
13. 13
Huge Number of Targets
Hundreds of Hypervisors (ESXi & KVM)
Tens of thousands of VMs
Cover many sort of log
Splunk is suited for log analytics
Need Time-series DB for performance logs
Splunk
InfluxDB
18. 18
Huge Number of log files
22 log files in a single cluster
Manage logs for every Regions & Availability Zones
Manage un-manageable logs
CRITICAL message is un-manageable
Need to have strong analytical storage engine
Component # Log files
Nova 8
Keystone 1
Neutron 6
Glance 2
Cinder 5
etc. etc.
2013-02-25 21:05:51 17409 CRITICAL cinder [-
] Bad or unexpected response from the
storage volume backend API: volume group
cinder-volumes doesn't exist
...
2013-02-25 21:05:51 17409 TRACE cinder
VolumeBackendAPIException: Bad or unexpected
response from the storage volume
backend API: volume group cinder-volumes
doesn't exist
2013-02-25 21:05:51 17409 TRACE cinder
22. System Configuration
Splunk v6.4.x (as of Nov 2016)
Using Indexer cluster and Search head cluster
Manage huge data
150+ GB input size per a day
30+ TB indexed data size
22
Input size / a day
Indexed data size
31. 31
OpenStack Hosts logs
Use Fluentd exec plugin for getting nf_conntrack_count
Metricbeat v5 for cpu, mem, diskio, filesystem, network
VMware HVs and SAN logs
Use In-house Fluentd custom plugin for getting
Output to InfluxDB and analyze on Grafana
33. 33
#!/usr/bin/env python
import json, libvirt
conn = libvirt.openReadOnly()
for id in conn.listDomainsID():
dom = conn.lookupByID(id)
print(json.dumps({
"uuid": dom.UUIDString(),
"name": dom.name(),
"id": dom.ID(),
"vcpus":dom.vcpus()[0][3],
}))
From KVM (OpenStack)
Use libvirt Python bindings to build the custom scripts
Generate json data and use in_tail plugin
From ESXi (VMware)
Get logs from vCenter
35. 35
Kafka Specs
Kafka v0.10.0
Run on OpenStack and use full SSDs
System Configuration
100~500 partitions and 3 replications per topics
Make backup for important logs to GCS
Transform to the other Kafka (If necessary)
KafkaGoogle
Cloud
Storage
Kafka
37. 37
InfluxDB
Run InfluxDB v1.1.0 on physical server
Multiple post by using Kafka and Fluentd
Grafana
72 dashboards for visualizing performance data
Access to Multiple InfluxDBs via Load balancer
Kafka
Grafana
38. 38
Fluentd - Useful Log Collector
Fluentd can handle various log format and be easy to parse logs
Minimum resource usage
Redundant system
Realize InfluxDB mirroring by Kafka and Fluentd
Minimize data loss by transporting logs to Kafka – Additionally use GCS
40. 40
2 logging Engine
Splunk for event logs, InfluxDB for performance logs
Cover all of our requirements
Easy for troubleshooting, visualization, analysis and
improvement