SlideShare une entreprise Scribd logo
1  sur  92
Télécharger pour lire hors ligne
vSphere APIs for performance monitoring
London Workshop
October – 2010
Balaji Parimi, Staff Engineer, Ecosystem Performance, VMware, Inc.
Ravi Soundararajan, Senior Staff Engineer, Performance, VMware, Inc.
Motivation
To debug performance, why deal with this...?
Motivation
When you can deal with this instead?
More motivation
Why look at data like this…?
Before memhog: no guest swapping After memhog, guest swaps,
but Host does not!
More motivation
When you can look at it like this?
Even more motivation…
Why compare resource pool performance like this…?
Even more motivation
When you can compare them like this…?
Why?
vSphere gives you awesome, helpful charts
But you don’t have to rely solely on these charts
Do you want to learn how to make your own charts?
• Keep watching
Goal
Teach you how to use our APIs for performance monitoring
Agenda
What sorts of stats are useful?
How does vSphere retrieve them?
How can you get these stats and use them yourself?
Useful stats
Basics of performance monitoring in virtual infrastructure
• Find underperforming resources
• Find overcommitted resources
• Identify issues due to resource sharing among VMs
Resources we will look at
• CPU
• Memory
• Disk
• Network
Resources that we often look at
CPU
Memory
Disk
Network
CPU basics
ESX
CPU0 CPU1 CPU2 CPU3
VM0 VM1 VM2 VM3
VM4
Run (accumulating used time)
Ready (wants to run, no physical CPU available)
Wait: blocked on I/O or voluntarily descheduled
VM5
VM6
Run
Ready
Wait/Idle
Why is my VM slow?
CPU saturated (cpu.usage.average)
Ready time? (cpu.ready.summation)
Latency to be swapped in? (cpu.swapwait.summation)
CPU saturation
2 vCPUs
2.2GHz/CPU
~4.4GHz used
(Look at left y-axis)
Small ready time
Ready time vCPU1: 150ms
Real-time chart: refresh 20s
150ms / 20s = 0.75% (No big deal)
Right y-axis is relevant
Now, turn on CPU burner on same host…
CPU burner
~100% of 1 vCPU
And see what happens to original VM’s ready time
SpecJBB ready time
~2000ms = 10%
(ps. SpecJBB perf. dropped by 10%)
Latency to load in VM: cpu.swapwait.average
Sometimes there is a latency to load VM data from disk: cpu swapwait
CPU takes 20s to load in data before VM can run!
CPU issues: Summary
CPU saturated?
High Ready time
• Problematic if it is sustained for high periods
• Sample rule of thumb: > 20% per vCPU investigate further
• Possible contention for CPU resources among VMs
• Workload Variability? Fix with VMotion/DRS
• Resource limits on VMs? Check Limits, reservations and shares
• Actual over commitment? Fix with Vmotion/DRS/more CPUs
High SwapWait time
• Consider setting memory reservation (see next section, “Memory”)
Resources that we often look at
CPU
Memory
Disk
Network
Memory
ESX must balance memory usage
• Page sharing to reduce memory footprint of Virtual Machines
• Ballooning to relieve memory pressure in a graceful way
• Host swapping to relieve memory pressure when ballooning insufficient
• Compression to relieve memory pressure without host-level swapping
ESX allows over commitment of memory
• Sum of configured memory sizes of virtual machines can be greater than
physical memory if working sets fit
Memory also has limits, shares, and reservations
Host swapping can cause performance degradation
VM1
Ballooning, compression, and swapping (1)
Ballooning: Memctl driver grabs pages and gives to ESX
• Guest OS choose pages to give to memctl (avoids “hot” pages if possible): either free
pages or pages to swap
• Unused pages are given directly to memctl
• Pages to be swapped are first written to swap partition within guest OS and then
given to memctl
Swap partition w/in
Guest OS
ESX
VM2
memctl
1. Balloon
2. Reclaim
3. Redistribute
F
Swap
Partition
(w/in guest)
Ballooning, swapping, and compression (2)
Swapping: ESX reclaims pages forcibly
• Guest doesn’t pick pages…ESX may inadvertently pick “hot” pages ( possible VM
performance implications)
• Pages written to VM swap file
VM1
ESX
VM2
VSWP
(external to guest)
1. Force Swap
2. Reclaim
3. Redistribute
ESX
Compression
Cache
Ballooning, swapping and compression (3)
Compression: ESX reclaims pages, writes to in-memory cache
• Guest doesn’t pick pages…ESX may inadvertently pick “hot” pages ( possible VM
performance implications)
• Pages written in-memory cache faster than host-level swapping
Swap
Partition
(w/in guest)
VM1 VM2
1. Write to Compression Cache
2. Give pages to VM2
Ballooning, swapping, and compression
Bottom line:
• Ballooning may occur even when no memory pressure just to keep memory
proportions under control
• Ballooning is preferable to compression and vastly preferably to swapping
• Guest can surrender unused/free pages
• With host swapping, ESX cannot tell which pages are unused or free and may
accidentally pick “hot” pages
• Even if balloon driver has to swap to satisfy the balloon request, guest chooses what
to swap
• Can avoid swapping “hot” pages within guest
• Compression: reading from compression cache is faster than reading from disk
Swapping in Guest! = Swapping in Host
DVDstore benchmark: SQL DB benchmark… uses lots of memory
About to start memory hogger program in guest
Force Guest swapping: No Host-level swapping
Before memhog: no guest swapping After memhog, guest swaps, but
Host does not!
Viewing Host-level swapping with performance charts
Setup: 2 VMs…one dvdstore, one memhog, competing for host memory
Host swaps out dvdstore VM memory to fulfill memhog VM requests
Host swaps in dvdstore VM memory to fulfill dvdstore VM requests
Using Swap Rate Counters: Remember CPU SwapWait?
Cpu.swapwait.summation: CPU is waiting for memory to be swapped in
Absolute Swap Counters…
Swapin, swapout (KB) show some activity but hard to detect…
And Swap Rate Counters…
SwapinRate, SwapoutRate (KBps) show activity much more clearly
Rule of thumb: host swapping > 1MBps is cause for concern
Resources that we often look at
CPU
Memory
Disk
Network
ESX storage stack
Different latencies for local disk vs. SAN (caching, switches, etc.)
Queuing within kernel and in hardware
vSphere shows
• Total Command Latency
• Kernel Latency
• Device Latency
• Bandwidth/IOPS
Disk performance problems 101
What should I look for to figure out if disk is an issue?
• Am I getting the IOPs I expect?
• Am I getting the bandwidth (read/write) I expect?
• Are the latencies higher than I expect?
• Where is time being spent?
What are some things I can do?
• Make sure devices are configured properly (caches, queue depths)
• Use multiple adapters and multipathing
• Check networking settings (for iSCSI/NAS)
Another disk example: Slow VM power on
Trying to Power on a VM
• Sometimes, powering on VM would take 5 seconds
• Other times, powering on VM would take 5 minutes!
Where to begin?
• Powering on a VM requires disk activity on host Check disk metrics for host
Let’s look at the vSphere client…
Max Disk Latencies range from 100ms to 1100ms…very high! Why?
(counter name: disk.maxTotalLatency.latest)
Rule of thumb:
latency > 20ms is
Bad.
Here:
1,100ms
REALLY BAD!!!
High disk latency: Mystery solved
Host events: disk has connectivity issues high latencies!
Bottom line: monitor disk latencies; issues may not be related to
virtualization!
Resources that we often look at
CPU
Memory
Disk
Network
Network performance problems 101
What should I look for to figure out if network is an issue?
• Am I getting the packet rate that I expect?
• Am I getting the bandwidth (read/write) I expect?
• Is all traffic on one NIC, or spread across many NICs?
• [more advanced… not available through counters]: out-of-order packets?
What are some things I can do?
• Check host networking settings
• Full-duplex/Half-duplex
• 10Gig network vs 100Mb network?
• Firewall settings
• Check VM settings: all VMs on proper networks?
Network performance troubleshooting
Customer complains about slow network
• She’s running netperf on a GigE Link
• She sees only 200Mbps
• Why? I bet it’s that VMware stuff!!
• Note to reader: Please don’t blame VMware first ☺
Where do we start?
All VMs using same NIC (VM network)
All VMs using “VM Network” and sharing 1 physical NIC
Where do we begin? Check VM bandwidth
Measure VM Bandwidth (net.transmitted.average)
• 200 Mb/s
• Screenshot from the vSphere client
Check Host Bandwidth
Measure Host Bandwidth (net.transmitted.average)
• Host sees around 900Mbps…why is VM at 200Mbps?
• Hmm… are we sharing this NIC with multiple VMs?
All traffic is going through one NIC!
Measure per-physical-NIC traffic
Hmm… all VM traffic is going through 1 NIC
Let’s split the VMs across NICs
All traffic through one
NIC on this host
Split VMs across multiple NICs. Bingo!
Network issues: Configuration woes
Network adapter set to “full
duplex, 100 Mbps”:
< 0.1Mbps!
Specific combo of switch and
adapter caused this
performance degradation!
Lesson: Check specs &
configuration!
Network adapter set to
“autonegotiate”: 90Mbps
Agenda
What sorts of stats are useful?
How does vSphere retrieve them?
How can you get these stats and use them yourself?
Stats infrastructure in vSphere
ESX
VM VM VM VM VM
vCenter Server
(vpxd, tomcat)
ESX
VM VM VM VM VM
ESX
VM VM VM VM VM
DB
1. Collect 20s
and 5-min host
and VM stats
2. Send 5-min
stats to
vCenter
3. Send 5-min
stats to DB
4. Rollups
Rollups
DB
1. Past-Day (5-minutes) Past-Week
2. Past-Week (30-minutes) Past-Month
3. Past-Month (2-hours) Past-Year
4. (Past-Year = 1 data point per day)
DB only archives historical data
• Real-time (i.e., Past hour) NOT archived at DB
• Past-day, Past-week, etc. Stats Interval
• Stats Levels ONLY APPLY TO HISTORICAL DATA
Anatomy of a stats query: Past-hour (“RealTime”) Stats
Client
ESX
VM VM VM VM VM
vCenter Server
(vpxd, tomcat)
ESX
VM VM VM VM VM
ESX
VM VM VM VM VM
DB
1. Query
2. Get stats
from host
3. Response
No calls to DB
Note: Same code path for past-day stats within last 30 minutes
Anatomy of a stats query: Archived stats
Client
ESX
VM VM VM VM VM
vCenter Server
(vpxd, tomcat)
ESX
VM VM VM VM VM
ESX
VM VM VM VM VM
DB
1. Query
3. Response
No calls to ESX host (caveats apply)
Stats Level = Store this stat in the DB
2. Get Stats
Agenda
What sorts of stats are useful?
How does vSphere retrieve them?
How can you get these stats and use them yourself?
Phew! Ok, How do I get these stats?
You want a chart like this?
PowerCLI
• CPU Usage for a VM for last hour:
• $vm = Get-VM –Name “Foo”
• Get-Stat –Entity $vm –Realtime –Maxsample 180 –Stat
cpu.usagemhz.average
• Grab appropriate fields from output, use graphing program, etc.
Looks simple… What’s going on behind the scenes?
To get stats, this is what is going on FOR EACH GET-STAT CALL
• Retrieve PerformanceManager
• QueryPerfProviderSummary $vm Says what intervals are supported
• QueryAvailablePerfMetric $vm Describes available metrics
• QueryPerfCounter Verbose description of counters
• Create PerfQuerySpec Query specification to get the stats
• QueryPerf Get stats
Bottom line: The PowerCLI toolkit spares you details…Easy to use!
PowerCLI Is so easy… Why use Java / C#?
PowerCLI is great for scripting
• Stateless
• Hides details
But with Java / C#
• You can squeeze out more performance!
• Much higher scalability
Pseudo code
Get MOREF
for each Get-Stat {
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
QueryPerf();
}
Get MOREF
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
for each Get-Stat {
QueryPerf();
}
PowerCLI Java
perfCounter property
Of
PerformanceManager
Performance implications: Need to write scalable scripts!
Entities
(cpu.usagemhz.average)
PowerCLI
(Time in secs)
Java
(Time in secs)
1 VM 9.2 14
6 VMs 11 14.5
39 VMs 101 16
363 VMs 2580 (43 minutes) 50
Java provides opportunities for scalable, ongoing stats collection
Let’s examine Java code in more detail…
A Naïve script that works for small environments may not be suitable
for large environments
Highly-tuned
Java Stats
Collector
GetPerfStats – Main method
Get MOREF
Get CounterIds
QueryAvailablePerfMetric
QueryProviderSummary
create PerfQuerySpec
QueryPerf
Get MOREF
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
for each Get-Stat {
QueryPerf();
}
perfCounter
GetPerfStats
Get MOREF
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
for each Get-Stat {
QueryPerf();
}
Get MOREF
Get the entity MOREF
GetPerfStats
Get MOREF
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
for each Get-Stat {
QueryPerf();
}
perfCounter property
Of
PerformanceManager
Get CounterIds
Get available counterIDs
from perfCounter property of
PerformanceManager
Map human-readable stat name to counterID
(e.g., cpu.usagemhz.average 101)
QueryPerf (…) requires counterID
GetPerfStats
Get MOREF
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
for each Get-Stat {
QueryPerf();
}
QueryPerfProviderSummary
• All VMs have same value
• All Hosts have same value etc.
Call once for a given entity type and store result
GetPerfStats
Get MOREF
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
for each Get-Stat {
QueryPerf();
}
Create PerfQuerySpec
Use wild card
CSV output format
GetPerfStats
Get MOREF
QueryAvailablePerfMetric();
QueryPerfCounter();
QueryPerfProviderSummary();
create PerfQuerySpec();
for each Get-Stat {
QueryPerf();
}
QueryPerf
So, what is Java / C# buying us?
Avoiding redundant work
More compact return format (CSV vs. objects)
Low-overhead tracking of ongoing inventory changes
Etc.
If we dig deeper, we can optimize even more…
Digging deeper: The PerfQuerySpec architecture
To grab counters:
QueryPerf(PerfQuerySpec[] querySpec)
PerfQuerySpec: Specifies which counters to grab
PerfQuerySpec[]: [pQs1, pQs2, pQs3, …]
Array of PerfQuerySpec objects pQs1, pQs2, pQs2
Can grab multiple stats using single QueryPerf call
Entity (host,
VM)
Format
(CSV,
normal)
MetricId StartTime EndTime IntervalID
(20s, 300s)
maxSample
Complexities of QueryPerf
How Does vSphere Process QueryPerf(querySpec[])?
1. vCenter receives queryPerf request with querySpec[]
2. vCenter takes each querySpec one at a time
3. vCenter gets data for each querySpec before processing next one
Options for querySpec[]:
1. 1 entry 1 stat or set of stats for a single entity (e.g., all CPU)
2. Multiple entries. Examples:
• Each entry for a different entity …
• Each entry for a different stat type, same entity
VM1,cpu.* VM2,cpu.* H3,mem.*
VM1,cpu.* VM1,net.* VM1,mem.
*
pQs1 pQs2 pQs3
Implications of QuerySpec
Format of QuerySpec Allows Multiple Client Options
1. Grab each stat one at a time
2. Grab a group of stats per entity at once
3. Grab all stats for all entities at once
4. Grab stats for a subset of entities at once
Some Tradeoffs:
1. Network processing (large result sets vs. small result sets)
2. Client aggregation overhead
3. vCenter processing (Each QueryPerf handled in a single thread)
What about in-guest stats?
Using VIX APIs:
• Create a script that can get what ever stats you are interested in.
• Make the script write the stats to a file.
• Copy file from the guest.
• Session covering this topic
• PPC-15 – Guest Operations using VMware VIX APIs and Beyond
Back to the Future (1)
Now I know how to I convert this… (many metrics on different charts)
Back to the Future (2)
To This (CPU, Memory, Disk, and Network on the same chart)
Combining metrics across VMs & Hosts
Combining metrics across VMs & Hosts
Comparing resource pools
Use VIX API + vSphere counters to get RP performance data
What about VMs running on a Host?
Memory usage of VMs on a Host
Summary, Part 1: Some useful Counters to monitor
Resource Metric Host or
VM?
Description
CPU Usage Both CPU % used
Ready VM Ready to run, but limit or no available physical CPU
SwapWait VM CPU time spent waiting for host-level swap-in
Memory Swapin,
swapinrate
Both Memory ESX host swaps in from disk (per VM, or
cumulative over host)
Swapout,
swapoutrate
Both Memory ESX host swaps out to disk (per VM, or
cumulative over host)
Disk commands Both Operations done during stats refresh interval
totalLatency Host End-to-end disk latency (available for reads & writes)
Usage Both Disk bandwidth utilized (available for reads & writes)
Network Packets
received,
transmitted
Both Operations done during stats refresh interval
Usage Both Network bandwidth used (available for reads & writes)
For completeness…VM memory metrics
Metric Description
Memory Active (KB) Physical pages touched recently by a virtual machine
Memory Usage (%) Active memory / configured memory
Memory Consumed
(KB)
Machine memory mapped to a virtual machine,
including its portion of shared pages. Does NOT
include overhead memory.
Memory Granted (KB) VM physical pages backed by machine memory. May
be less than configured memory. Includes shared
pages. Does NOT include overhead memory.
Memory Shared (KB) Physical pages shared with other virtual machines
Memory Balloon (KB) Physical memory ballooned from a virtual machine
Memory Swapped (KB) Physical memory in swap file (approx. “swap out –
swap in”). Swap out and Swap in are cumulative.
Overhead Memory (KB) Machine pages used for virtualization
Host memory metrics
Metric Description
Memory Active (KB) Physical pages touched recently by the host
Memory Usage (%)* Active memory / configured memory
Memory Consumed
(KB)
Total host physical memory – free memory on host.
Includes Overhead and Service Console memory.
Memory Granted (KB) Sum of memory granted to all running virtual
machines. Does NOT include overhead memory.
Memory Shared (KB) Sum of memory shared for all running VMs
Shared common (KB) Total machine pages used by shared pages
Memory Balloon (KB) Machine pages ballooned from virtual machines
Memory Swap Used
(KB)
Physical memory in swap files (approx. “swap out –
swap in”). Swap out and Swap in are cumulative.
Overhead Memory (KB) Machine pages used for virtualization
*For a cluster, mem.usage.average = (consumed + overhead)/total mem
Summary, Part 2: Cheat sheet
Rules of Thumb
• Ready Time > 20% sustained is undesirable
• Host-level swapping is bad, > 1MBps is especially bad
• Disk latencies > 20 ms BAD
• Use IOmeter to assess disk bandwidth and latency
• Network
• run netperf to get network baselines
Summary, Part 3: SDK/API Tips and tricks
Collect static data once
• CounterIDs, metricIDs, MOREFs etc.
• Use Views to keep this data up to date.
• Reuse PerfQuerySpec as much as possible
Use CSV format
• Reduces serialization cost and the size of metadata
Choose metrics and query intervals carefully
• Query the real-time stats at a slower rate than the refresh rate
• Choose correct stats levels
Use parallelism (multi-threaded clients)
Conclusion
vSphere gives a bunch of awesome charts
If you want to see the data differently, use the API
PowerCLI is great for simple scripts
When designing for scalability, consider Java / C#
Resources
Developer Support
• Dedicated support for your organization when building solutions using vSphere
APIs, PowerCLI, vSphere Web Services SDKs and many more VMware SDKs
• http://vmware.com/go/sdksupport
PowerCLI Training
• 2 day instructor led training, 40% lecture, 60% lab
• http://vmware.com/go/vsphereautomation
VMware Developer Community
• SDK Downloads, Documentation, Sample Code, Forums, Blogs
• http://developer.vmware.com
Technology Alliance Partner (TAP) Program
• Updated partner benefits
• http://www.vmware.com/partners/alliances/programs/
Disclaimer
This session may contain product features that are
currently under development.
This session/overview of the new technology represents
no commitment from VMware to deliver these features in
any generally available product.
Features are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
“These features are representative of feature areas under development. Feature commitments are
subject to change, and must not be included in contracts, purchase orders, or sales agreements of
any kind. Technical feasibility and market demand will affect final delivery.”
Backup slides
What about VMs across resource pools?
Back to the Future (2)
To This (CPU, Memory, Disk, and Network on the same chart)
Combining metrics across VMs & Hosts

Contenu connexe

Tendances

[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
OpenStack Korea Community
 

Tendances (20)

Ansible presentation
Ansible presentationAnsible presentation
Ansible presentation
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
KubeVirt (Kubernetes and Cloud Native Toronto)
KubeVirt (Kubernetes and Cloud Native Toronto)KubeVirt (Kubernetes and Cloud Native Toronto)
KubeVirt (Kubernetes and Cloud Native Toronto)
 
OpenStack Quantum Intro (OS Meetup 3-26-12)
OpenStack Quantum Intro (OS Meetup 3-26-12)OpenStack Quantum Intro (OS Meetup 3-26-12)
OpenStack Quantum Intro (OS Meetup 3-26-12)
 
Automating with Ansible
Automating with AnsibleAutomating with Ansible
Automating with Ansible
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region mode
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Introduction to virtualization
Introduction to virtualizationIntroduction to virtualization
Introduction to virtualization
 
Highly efficient backups with percona xtrabackup
Highly efficient backups with percona xtrabackupHighly efficient backups with percona xtrabackup
Highly efficient backups with percona xtrabackup
 
My sql failover test using orchestrator
My sql failover test  using orchestratorMy sql failover test  using orchestrator
My sql failover test using orchestrator
 
Ansible presentation
Ansible presentationAnsible presentation
Ansible presentation
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
[OpenStack Days Korea 2016] Track1 - 카카오는 오픈스택 기반으로 어떻게 5000VM을 운영하고 있을까?
 
Proxmox for DevOps
Proxmox for DevOpsProxmox for DevOps
Proxmox for DevOps
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
NGINX Ingress Controller for Kubernetes
NGINX Ingress Controller for KubernetesNGINX Ingress Controller for Kubernetes
NGINX Ingress Controller for Kubernetes
 
Turning Virtual Machines Cloud-Native using KubeVirt
Turning Virtual Machines Cloud-Native using KubeVirtTurning Virtual Machines Cloud-Native using KubeVirt
Turning Virtual Machines Cloud-Native using KubeVirt
 

En vedette

Using puppet, foreman and git to develop and operate a large scale internet s...
Using puppet, foreman and git to develop and operate a large scale internet s...Using puppet, foreman and git to develop and operate a large scale internet s...
Using puppet, foreman and git to develop and operate a large scale internet s...
techblog
 
Redis — The AK-47 of Post-relational Databases
Redis — The AK-47 of Post-relational DatabasesRedis — The AK-47 of Post-relational Databases
Redis — The AK-47 of Post-relational Databases
Karel Minarik
 

En vedette (20)

Introduction to vSphere APIs Using pyVmomi
Introduction to vSphere APIs Using pyVmomiIntroduction to vSphere APIs Using pyVmomi
Introduction to vSphere APIs Using pyVmomi
 
Windows Azure Platform: Articles from the Trenches, Volume One
Windows Azure Platform: Articles from the Trenches, Volume OneWindows Azure Platform: Articles from the Trenches, Volume One
Windows Azure Platform: Articles from the Trenches, Volume One
 
CloudStackをMuninで監視・序 ~リソースをAPI経由で監視してみる話~
CloudStackをMuninで監視・序 ~リソースをAPI経由で監視してみる話~CloudStackをMuninで監視・序 ~リソースをAPI経由で監視してみる話~
CloudStackをMuninで監視・序 ~リソースをAPI経由で監視してみる話~
 
Controlling multiple VMs with the power of Python
Controlling multiple VMs with the power of PythonControlling multiple VMs with the power of Python
Controlling multiple VMs with the power of Python
 
VMware Automation, PowerCLI presented at the Northern California PSUG
VMware Automation, PowerCLI presented at the Northern California PSUGVMware Automation, PowerCLI presented at the Northern California PSUG
VMware Automation, PowerCLI presented at the Northern California PSUG
 
Mapping, Interlinking and Exposing MusicBrainz as Linked Data
Mapping, Interlinking and Exposing MusicBrainz as Linked DataMapping, Interlinking and Exposing MusicBrainz as Linked Data
Mapping, Interlinking and Exposing MusicBrainz as Linked Data
 
Using puppet, foreman and git to develop and operate a large scale internet s...
Using puppet, foreman and git to develop and operate a large scale internet s...Using puppet, foreman and git to develop and operate a large scale internet s...
Using puppet, foreman and git to develop and operate a large scale internet s...
 
Exploring VMware APIs by Preetham Gopalaswamy
Exploring VMware APIs by Preetham GopalaswamyExploring VMware APIs by Preetham Gopalaswamy
Exploring VMware APIs by Preetham Gopalaswamy
 
Continuously-Integrated Puppet in a Dynamic Environment
Continuously-Integrated Puppet in a Dynamic EnvironmentContinuously-Integrated Puppet in a Dynamic Environment
Continuously-Integrated Puppet in a Dynamic Environment
 
JSON and the APInauts
JSON and the APInautsJSON and the APInauts
JSON and the APInauts
 
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7
 
Sensu
SensuSensu
Sensu
 
Ruby application based on http
Ruby application based on httpRuby application based on http
Ruby application based on http
 
Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling)
Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling)Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling)
Dlsecyx pgroammr (Dyslexic Programmer - cool stuff for scaling)
 
PostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordPostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active Record
 
The Complete MariaDB Server Tutorial - Percona Live 2015
The Complete MariaDB Server Tutorial - Percona Live 2015The Complete MariaDB Server Tutorial - Percona Live 2015
The Complete MariaDB Server Tutorial - Percona Live 2015
 
Google Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with ZabbixGoogle Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with Zabbix
 
Redis — The AK-47 of Post-relational Databases
Redis — The AK-47 of Post-relational DatabasesRedis — The AK-47 of Post-relational Databases
Redis — The AK-47 of Post-relational Databases
 
Taking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetTaking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and Puppet
 
Detecting headless browsers
Detecting headless browsersDetecting headless browsers
Detecting headless browsers
 

Similaire à vSphere APIs for performance monitoring

Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02
Suresh Kumar
 
VMware Backups That Work—Lessons Learned From VADP Performance Benchmark Testing
VMware Backups That Work—Lessons Learned From VADP Performance Benchmark TestingVMware Backups That Work—Lessons Learned From VADP Performance Benchmark Testing
VMware Backups That Work—Lessons Learned From VADP Performance Benchmark Testing
Symantec
 
Varrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationVarrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentation
pittmantony
 
Dynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 SDynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 S
Eduardo Castro
 
Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1
Eduardo Castro
 
Vmware vsphere taking_a_trip_down_memory_lane
Vmware vsphere taking_a_trip_down_memory_laneVmware vsphere taking_a_trip_down_memory_lane
Vmware vsphere taking_a_trip_down_memory_lane
Metron
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
glbsolutions
 

Similaire à vSphere APIs for performance monitoring (20)

Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
 
Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02
 
ESX performance problems 10 steps
ESX performance problems 10 stepsESX performance problems 10 steps
ESX performance problems 10 steps
 
VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld 2013: Successfully Virtualize Microsoft Exchange Server VMworld 2013: Successfully Virtualize Microsoft Exchange Server
VMworld 2013: Successfully Virtualize Microsoft Exchange Server
 
Master VMware Performance and Capacity Management
Master VMware Performance and Capacity ManagementMaster VMware Performance and Capacity Management
Master VMware Performance and Capacity Management
 
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI ServersGet Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
 
VMware Backups That Work—Lessons Learned From VADP Performance Benchmark Testing
VMware Backups That Work—Lessons Learned From VADP Performance Benchmark TestingVMware Backups That Work—Lessons Learned From VADP Performance Benchmark Testing
VMware Backups That Work—Lessons Learned From VADP Performance Benchmark Testing
 
Varrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentationVarrow madness 2013 virtualizing sql presentation
Varrow madness 2013 virtualizing sql presentation
 
Presentation v mware performance overview
Presentation   v mware performance overviewPresentation   v mware performance overview
Presentation v mware performance overview
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
 
VDI Design Guide
VDI Design GuideVDI Design Guide
VDI Design Guide
 
South jersey sql virtualization
South jersey sql virtualizationSouth jersey sql virtualization
South jersey sql virtualization
 
Sql saturday dc vm ware
Sql saturday dc vm wareSql saturday dc vm ware
Sql saturday dc vm ware
 
Exchange 2010 New England Vmug
Exchange 2010 New England VmugExchange 2010 New England Vmug
Exchange 2010 New England Vmug
 
Dynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 SDynamic Memory Management Hyperv 2008 R2 S
Dynamic Memory Management Hyperv 2008 R2 S
 
Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1Dynamic Memory Management HyperV R2 SP1
Dynamic Memory Management HyperV R2 SP1
 
Vmware vsphere taking_a_trip_down_memory_lane
Vmware vsphere taking_a_trip_down_memory_laneVmware vsphere taking_a_trip_down_memory_lane
Vmware vsphere taking_a_trip_down_memory_lane
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Virtualisation Oversubscription - What's so scary?
Virtualisation Oversubscription - What's so scary?Virtualisation Oversubscription - What's so scary?
Virtualisation Oversubscription - What's so scary?
 

Plus de Alan Renouf (6)

Bill board
Bill boardBill board
Bill board
 
Dutch VMUG 2010 PowerCLI Presentation
Dutch VMUG 2010 PowerCLI PresentationDutch VMUG 2010 PowerCLI Presentation
Dutch VMUG 2010 PowerCLI Presentation
 
Advanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAdvanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtop
 
PowerCLI & Onyx
PowerCLI & OnyxPowerCLI & Onyx
PowerCLI & Onyx
 
TA6944 PowerCLI is for Administrators!
TA6944 PowerCLI is for Administrators!TA6944 PowerCLI is for Administrators!
TA6944 PowerCLI is for Administrators!
 
VMware VI Toolkit UKVMUG
VMware VI Toolkit UKVMUGVMware VI Toolkit UKVMUG
VMware VI Toolkit UKVMUG
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

vSphere APIs for performance monitoring

  • 1. vSphere APIs for performance monitoring London Workshop October – 2010 Balaji Parimi, Staff Engineer, Ecosystem Performance, VMware, Inc. Ravi Soundararajan, Senior Staff Engineer, Performance, VMware, Inc.
  • 2. Motivation To debug performance, why deal with this...?
  • 3. Motivation When you can deal with this instead?
  • 4. More motivation Why look at data like this…? Before memhog: no guest swapping After memhog, guest swaps, but Host does not!
  • 5. More motivation When you can look at it like this?
  • 6. Even more motivation… Why compare resource pool performance like this…?
  • 7. Even more motivation When you can compare them like this…?
  • 8. Why? vSphere gives you awesome, helpful charts But you don’t have to rely solely on these charts Do you want to learn how to make your own charts? • Keep watching
  • 9. Goal Teach you how to use our APIs for performance monitoring
  • 10. Agenda What sorts of stats are useful? How does vSphere retrieve them? How can you get these stats and use them yourself?
  • 11. Useful stats Basics of performance monitoring in virtual infrastructure • Find underperforming resources • Find overcommitted resources • Identify issues due to resource sharing among VMs Resources we will look at • CPU • Memory • Disk • Network
  • 12. Resources that we often look at CPU Memory Disk Network
  • 13. CPU basics ESX CPU0 CPU1 CPU2 CPU3 VM0 VM1 VM2 VM3 VM4 Run (accumulating used time) Ready (wants to run, no physical CPU available) Wait: blocked on I/O or voluntarily descheduled VM5 VM6 Run Ready Wait/Idle
  • 14. Why is my VM slow? CPU saturated (cpu.usage.average) Ready time? (cpu.ready.summation) Latency to be swapped in? (cpu.swapwait.summation)
  • 15. CPU saturation 2 vCPUs 2.2GHz/CPU ~4.4GHz used (Look at left y-axis)
  • 16. Small ready time Ready time vCPU1: 150ms Real-time chart: refresh 20s 150ms / 20s = 0.75% (No big deal) Right y-axis is relevant
  • 17. Now, turn on CPU burner on same host… CPU burner ~100% of 1 vCPU
  • 18. And see what happens to original VM’s ready time SpecJBB ready time ~2000ms = 10% (ps. SpecJBB perf. dropped by 10%)
  • 19. Latency to load in VM: cpu.swapwait.average Sometimes there is a latency to load VM data from disk: cpu swapwait CPU takes 20s to load in data before VM can run!
  • 20. CPU issues: Summary CPU saturated? High Ready time • Problematic if it is sustained for high periods • Sample rule of thumb: > 20% per vCPU investigate further • Possible contention for CPU resources among VMs • Workload Variability? Fix with VMotion/DRS • Resource limits on VMs? Check Limits, reservations and shares • Actual over commitment? Fix with Vmotion/DRS/more CPUs High SwapWait time • Consider setting memory reservation (see next section, “Memory”)
  • 21. Resources that we often look at CPU Memory Disk Network
  • 22. Memory ESX must balance memory usage • Page sharing to reduce memory footprint of Virtual Machines • Ballooning to relieve memory pressure in a graceful way • Host swapping to relieve memory pressure when ballooning insufficient • Compression to relieve memory pressure without host-level swapping ESX allows over commitment of memory • Sum of configured memory sizes of virtual machines can be greater than physical memory if working sets fit Memory also has limits, shares, and reservations Host swapping can cause performance degradation
  • 23. VM1 Ballooning, compression, and swapping (1) Ballooning: Memctl driver grabs pages and gives to ESX • Guest OS choose pages to give to memctl (avoids “hot” pages if possible): either free pages or pages to swap • Unused pages are given directly to memctl • Pages to be swapped are first written to swap partition within guest OS and then given to memctl Swap partition w/in Guest OS ESX VM2 memctl 1. Balloon 2. Reclaim 3. Redistribute F
  • 24. Swap Partition (w/in guest) Ballooning, swapping, and compression (2) Swapping: ESX reclaims pages forcibly • Guest doesn’t pick pages…ESX may inadvertently pick “hot” pages ( possible VM performance implications) • Pages written to VM swap file VM1 ESX VM2 VSWP (external to guest) 1. Force Swap 2. Reclaim 3. Redistribute
  • 25. ESX Compression Cache Ballooning, swapping and compression (3) Compression: ESX reclaims pages, writes to in-memory cache • Guest doesn’t pick pages…ESX may inadvertently pick “hot” pages ( possible VM performance implications) • Pages written in-memory cache faster than host-level swapping Swap Partition (w/in guest) VM1 VM2 1. Write to Compression Cache 2. Give pages to VM2
  • 26. Ballooning, swapping, and compression Bottom line: • Ballooning may occur even when no memory pressure just to keep memory proportions under control • Ballooning is preferable to compression and vastly preferably to swapping • Guest can surrender unused/free pages • With host swapping, ESX cannot tell which pages are unused or free and may accidentally pick “hot” pages • Even if balloon driver has to swap to satisfy the balloon request, guest chooses what to swap • Can avoid swapping “hot” pages within guest • Compression: reading from compression cache is faster than reading from disk
  • 27. Swapping in Guest! = Swapping in Host DVDstore benchmark: SQL DB benchmark… uses lots of memory About to start memory hogger program in guest
  • 28. Force Guest swapping: No Host-level swapping Before memhog: no guest swapping After memhog, guest swaps, but Host does not!
  • 29. Viewing Host-level swapping with performance charts Setup: 2 VMs…one dvdstore, one memhog, competing for host memory Host swaps out dvdstore VM memory to fulfill memhog VM requests Host swaps in dvdstore VM memory to fulfill dvdstore VM requests
  • 30. Using Swap Rate Counters: Remember CPU SwapWait? Cpu.swapwait.summation: CPU is waiting for memory to be swapped in
  • 31. Absolute Swap Counters… Swapin, swapout (KB) show some activity but hard to detect…
  • 32. And Swap Rate Counters… SwapinRate, SwapoutRate (KBps) show activity much more clearly Rule of thumb: host swapping > 1MBps is cause for concern
  • 33. Resources that we often look at CPU Memory Disk Network
  • 34. ESX storage stack Different latencies for local disk vs. SAN (caching, switches, etc.) Queuing within kernel and in hardware vSphere shows • Total Command Latency • Kernel Latency • Device Latency • Bandwidth/IOPS
  • 35. Disk performance problems 101 What should I look for to figure out if disk is an issue? • Am I getting the IOPs I expect? • Am I getting the bandwidth (read/write) I expect? • Are the latencies higher than I expect? • Where is time being spent? What are some things I can do? • Make sure devices are configured properly (caches, queue depths) • Use multiple adapters and multipathing • Check networking settings (for iSCSI/NAS)
  • 36. Another disk example: Slow VM power on Trying to Power on a VM • Sometimes, powering on VM would take 5 seconds • Other times, powering on VM would take 5 minutes! Where to begin? • Powering on a VM requires disk activity on host Check disk metrics for host
  • 37. Let’s look at the vSphere client… Max Disk Latencies range from 100ms to 1100ms…very high! Why? (counter name: disk.maxTotalLatency.latest) Rule of thumb: latency > 20ms is Bad. Here: 1,100ms REALLY BAD!!!
  • 38. High disk latency: Mystery solved Host events: disk has connectivity issues high latencies! Bottom line: monitor disk latencies; issues may not be related to virtualization!
  • 39. Resources that we often look at CPU Memory Disk Network
  • 40. Network performance problems 101 What should I look for to figure out if network is an issue? • Am I getting the packet rate that I expect? • Am I getting the bandwidth (read/write) I expect? • Is all traffic on one NIC, or spread across many NICs? • [more advanced… not available through counters]: out-of-order packets? What are some things I can do? • Check host networking settings • Full-duplex/Half-duplex • 10Gig network vs 100Mb network? • Firewall settings • Check VM settings: all VMs on proper networks?
  • 41. Network performance troubleshooting Customer complains about slow network • She’s running netperf on a GigE Link • She sees only 200Mbps • Why? I bet it’s that VMware stuff!! • Note to reader: Please don’t blame VMware first ☺ Where do we start?
  • 42. All VMs using same NIC (VM network) All VMs using “VM Network” and sharing 1 physical NIC
  • 43. Where do we begin? Check VM bandwidth Measure VM Bandwidth (net.transmitted.average) • 200 Mb/s • Screenshot from the vSphere client
  • 44. Check Host Bandwidth Measure Host Bandwidth (net.transmitted.average) • Host sees around 900Mbps…why is VM at 200Mbps? • Hmm… are we sharing this NIC with multiple VMs?
  • 45. All traffic is going through one NIC! Measure per-physical-NIC traffic Hmm… all VM traffic is going through 1 NIC Let’s split the VMs across NICs All traffic through one NIC on this host
  • 46. Split VMs across multiple NICs. Bingo!
  • 47. Network issues: Configuration woes Network adapter set to “full duplex, 100 Mbps”: < 0.1Mbps! Specific combo of switch and adapter caused this performance degradation! Lesson: Check specs & configuration! Network adapter set to “autonegotiate”: 90Mbps
  • 48. Agenda What sorts of stats are useful? How does vSphere retrieve them? How can you get these stats and use them yourself?
  • 49. Stats infrastructure in vSphere ESX VM VM VM VM VM vCenter Server (vpxd, tomcat) ESX VM VM VM VM VM ESX VM VM VM VM VM DB 1. Collect 20s and 5-min host and VM stats 2. Send 5-min stats to vCenter 3. Send 5-min stats to DB 4. Rollups
  • 50. Rollups DB 1. Past-Day (5-minutes) Past-Week 2. Past-Week (30-minutes) Past-Month 3. Past-Month (2-hours) Past-Year 4. (Past-Year = 1 data point per day) DB only archives historical data • Real-time (i.e., Past hour) NOT archived at DB • Past-day, Past-week, etc. Stats Interval • Stats Levels ONLY APPLY TO HISTORICAL DATA
  • 51. Anatomy of a stats query: Past-hour (“RealTime”) Stats Client ESX VM VM VM VM VM vCenter Server (vpxd, tomcat) ESX VM VM VM VM VM ESX VM VM VM VM VM DB 1. Query 2. Get stats from host 3. Response No calls to DB Note: Same code path for past-day stats within last 30 minutes
  • 52. Anatomy of a stats query: Archived stats Client ESX VM VM VM VM VM vCenter Server (vpxd, tomcat) ESX VM VM VM VM VM ESX VM VM VM VM VM DB 1. Query 3. Response No calls to ESX host (caveats apply) Stats Level = Store this stat in the DB 2. Get Stats
  • 53. Agenda What sorts of stats are useful? How does vSphere retrieve them? How can you get these stats and use them yourself?
  • 54. Phew! Ok, How do I get these stats? You want a chart like this? PowerCLI • CPU Usage for a VM for last hour: • $vm = Get-VM –Name “Foo” • Get-Stat –Entity $vm –Realtime –Maxsample 180 –Stat cpu.usagemhz.average • Grab appropriate fields from output, use graphing program, etc.
  • 55. Looks simple… What’s going on behind the scenes? To get stats, this is what is going on FOR EACH GET-STAT CALL • Retrieve PerformanceManager • QueryPerfProviderSummary $vm Says what intervals are supported • QueryAvailablePerfMetric $vm Describes available metrics • QueryPerfCounter Verbose description of counters • Create PerfQuerySpec Query specification to get the stats • QueryPerf Get stats Bottom line: The PowerCLI toolkit spares you details…Easy to use!
  • 56. PowerCLI Is so easy… Why use Java / C#? PowerCLI is great for scripting • Stateless • Hides details But with Java / C# • You can squeeze out more performance! • Much higher scalability
  • 57. Pseudo code Get MOREF for each Get-Stat { QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); QueryPerf(); } Get MOREF QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); } PowerCLI Java perfCounter property Of PerformanceManager
  • 58. Performance implications: Need to write scalable scripts! Entities (cpu.usagemhz.average) PowerCLI (Time in secs) Java (Time in secs) 1 VM 9.2 14 6 VMs 11 14.5 39 VMs 101 16 363 VMs 2580 (43 minutes) 50 Java provides opportunities for scalable, ongoing stats collection Let’s examine Java code in more detail… A Naïve script that works for small environments may not be suitable for large environments Highly-tuned Java Stats Collector
  • 59. GetPerfStats – Main method Get MOREF Get CounterIds QueryAvailablePerfMetric QueryProviderSummary create PerfQuerySpec QueryPerf Get MOREF QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); } perfCounter
  • 61. Get MOREF Get the entity MOREF
  • 63. Get CounterIds Get available counterIDs from perfCounter property of PerformanceManager Map human-readable stat name to counterID (e.g., cpu.usagemhz.average 101) QueryPerf (…) requires counterID
  • 65. QueryPerfProviderSummary • All VMs have same value • All Hosts have same value etc. Call once for a given entity type and store result
  • 67. Create PerfQuerySpec Use wild card CSV output format
  • 70. So, what is Java / C# buying us? Avoiding redundant work More compact return format (CSV vs. objects) Low-overhead tracking of ongoing inventory changes Etc. If we dig deeper, we can optimize even more…
  • 71. Digging deeper: The PerfQuerySpec architecture To grab counters: QueryPerf(PerfQuerySpec[] querySpec) PerfQuerySpec: Specifies which counters to grab PerfQuerySpec[]: [pQs1, pQs2, pQs3, …] Array of PerfQuerySpec objects pQs1, pQs2, pQs2 Can grab multiple stats using single QueryPerf call Entity (host, VM) Format (CSV, normal) MetricId StartTime EndTime IntervalID (20s, 300s) maxSample
  • 72. Complexities of QueryPerf How Does vSphere Process QueryPerf(querySpec[])? 1. vCenter receives queryPerf request with querySpec[] 2. vCenter takes each querySpec one at a time 3. vCenter gets data for each querySpec before processing next one Options for querySpec[]: 1. 1 entry 1 stat or set of stats for a single entity (e.g., all CPU) 2. Multiple entries. Examples: • Each entry for a different entity … • Each entry for a different stat type, same entity VM1,cpu.* VM2,cpu.* H3,mem.* VM1,cpu.* VM1,net.* VM1,mem. * pQs1 pQs2 pQs3
  • 73. Implications of QuerySpec Format of QuerySpec Allows Multiple Client Options 1. Grab each stat one at a time 2. Grab a group of stats per entity at once 3. Grab all stats for all entities at once 4. Grab stats for a subset of entities at once Some Tradeoffs: 1. Network processing (large result sets vs. small result sets) 2. Client aggregation overhead 3. vCenter processing (Each QueryPerf handled in a single thread)
  • 74. What about in-guest stats? Using VIX APIs: • Create a script that can get what ever stats you are interested in. • Make the script write the stats to a file. • Copy file from the guest. • Session covering this topic • PPC-15 – Guest Operations using VMware VIX APIs and Beyond
  • 75. Back to the Future (1) Now I know how to I convert this… (many metrics on different charts)
  • 76. Back to the Future (2) To This (CPU, Memory, Disk, and Network on the same chart)
  • 79. Comparing resource pools Use VIX API + vSphere counters to get RP performance data
  • 80. What about VMs running on a Host? Memory usage of VMs on a Host
  • 81. Summary, Part 1: Some useful Counters to monitor Resource Metric Host or VM? Description CPU Usage Both CPU % used Ready VM Ready to run, but limit or no available physical CPU SwapWait VM CPU time spent waiting for host-level swap-in Memory Swapin, swapinrate Both Memory ESX host swaps in from disk (per VM, or cumulative over host) Swapout, swapoutrate Both Memory ESX host swaps out to disk (per VM, or cumulative over host) Disk commands Both Operations done during stats refresh interval totalLatency Host End-to-end disk latency (available for reads & writes) Usage Both Disk bandwidth utilized (available for reads & writes) Network Packets received, transmitted Both Operations done during stats refresh interval Usage Both Network bandwidth used (available for reads & writes)
  • 82. For completeness…VM memory metrics Metric Description Memory Active (KB) Physical pages touched recently by a virtual machine Memory Usage (%) Active memory / configured memory Memory Consumed (KB) Machine memory mapped to a virtual machine, including its portion of shared pages. Does NOT include overhead memory. Memory Granted (KB) VM physical pages backed by machine memory. May be less than configured memory. Includes shared pages. Does NOT include overhead memory. Memory Shared (KB) Physical pages shared with other virtual machines Memory Balloon (KB) Physical memory ballooned from a virtual machine Memory Swapped (KB) Physical memory in swap file (approx. “swap out – swap in”). Swap out and Swap in are cumulative. Overhead Memory (KB) Machine pages used for virtualization
  • 83. Host memory metrics Metric Description Memory Active (KB) Physical pages touched recently by the host Memory Usage (%)* Active memory / configured memory Memory Consumed (KB) Total host physical memory – free memory on host. Includes Overhead and Service Console memory. Memory Granted (KB) Sum of memory granted to all running virtual machines. Does NOT include overhead memory. Memory Shared (KB) Sum of memory shared for all running VMs Shared common (KB) Total machine pages used by shared pages Memory Balloon (KB) Machine pages ballooned from virtual machines Memory Swap Used (KB) Physical memory in swap files (approx. “swap out – swap in”). Swap out and Swap in are cumulative. Overhead Memory (KB) Machine pages used for virtualization *For a cluster, mem.usage.average = (consumed + overhead)/total mem
  • 84. Summary, Part 2: Cheat sheet Rules of Thumb • Ready Time > 20% sustained is undesirable • Host-level swapping is bad, > 1MBps is especially bad • Disk latencies > 20 ms BAD • Use IOmeter to assess disk bandwidth and latency • Network • run netperf to get network baselines
  • 85. Summary, Part 3: SDK/API Tips and tricks Collect static data once • CounterIDs, metricIDs, MOREFs etc. • Use Views to keep this data up to date. • Reuse PerfQuerySpec as much as possible Use CSV format • Reduces serialization cost and the size of metadata Choose metrics and query intervals carefully • Query the real-time stats at a slower rate than the refresh rate • Choose correct stats levels Use parallelism (multi-threaded clients)
  • 86. Conclusion vSphere gives a bunch of awesome charts If you want to see the data differently, use the API PowerCLI is great for simple scripts When designing for scalability, consider Java / C#
  • 87. Resources Developer Support • Dedicated support for your organization when building solutions using vSphere APIs, PowerCLI, vSphere Web Services SDKs and many more VMware SDKs • http://vmware.com/go/sdksupport PowerCLI Training • 2 day instructor led training, 40% lecture, 60% lab • http://vmware.com/go/vsphereautomation VMware Developer Community • SDK Downloads, Documentation, Sample Code, Forums, Blogs • http://developer.vmware.com Technology Alliance Partner (TAP) Program • Updated partner benefits • http://www.vmware.com/partners/alliances/programs/
  • 88. Disclaimer This session may contain product features that are currently under development. This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. “These features are representative of feature areas under development. Feature commitments are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery.”
  • 90. What about VMs across resource pools?
  • 91. Back to the Future (2) To This (CPU, Memory, Disk, and Network on the same chart)