Iwan ‘e1’ Rahabok who's working as a Staff SE, Strategic Accounts in Singapore ha created an awesome vCenter Operations 5 Training. It's available in PowerPoint format and I really would like to advise you to read the slide notes. The presentation serves 2 purposes, first it provides in-depth training for those who are learning or evaluating vCenter Operations 5 and second it provides materials that vCenter Ops champion can use to share with internal colleagues (e.g. storage team, app team, etc)
2. Document Information
This deck is part 2 of a series.
• Part 1 is Management in the Virtual World: a technical introduction.
• http://communities.vmware.com/docs/DOC-17841
This deck has pre-requisite
• Intro video: http://www.youtube.com/watch?v=Z-DJuTiqKag
• VC Ops 5 technical introduction at Vault or Partner Central.
This deck only covers vCenter Operation (enterprise + advance)
• Focus on concept & ‘under the hood’ to get you understand the product deeper.
• Does not cover: competitive, installation, configuration
• Does not run through feature after feature.
• See the official training deck for that at Vault or Partner Central. This is a very long
training material.
• vCenter Operations modules that it does not covers
Use the Section feature
• Chargeback
to see how it is
• Infrastructure Navigator organised.
• Configuration Manager
Further reading
• virtual-red-dot.blogspot.com
2
3. Table of Contents
Built for vCenter Standard
Core: Metrics, Threshold, Analytics
Badges
Heat Map
Smart Alert
Details & Charts
Capacity Management
Settings
VCM integration
Concepts & Advance Concepts
Deep dive into Metrics
Dashboard and Widgets
3
4. Managing Performance/Capacity in vSphere: the basic
Is it healthy? Is it enough? Is it optimised?
• Every VM & ESX • Enough CPU, RAM, • Which VMs need
performing well? Network, Disk? adjustment?
CPU, RAM, Future risk? • What are my key
Network, Disk? • Time remaining? ratios?
• Are they behaving • Capacity • How much can I
expectedly? remaining? claim back from
• Any fault on any • Where are the “fat” VMs?
component? “Stress points” • How many more
in time? VMs can I put
without impacting
performance?
4
5. Direct Mapping by vCenter Operations
Is it healthy = Health
• Workload
• Anomalies
• Faults
Is it enough = Risk
• Time remaining
• Capacity remaining
• Stress period
Is it optimised = Efficiency
• What can we reclaim?
• Density. Key ratios for management
Daily update at midnight
5
7. Visibility across vCenters
Sample from ASEAN Lab:
6 vCenters.
Mixed of Appliance and Windows
2 are LinkedMode (SRM)
7
8. Performance Troubleshooting: a day in the life…
You got an email from the app team, saying the main Intranet application was slow.
• The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that.
• So it was slow between 1-2 hours ago, but ok now.
• You did a check. Everything is indeed ok in the past 1 hour.
• The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM
• You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest
OS.
• Your environment: 1 VC, 4 clusters, 30 hosts, 300 VM, 20 datastores, 1 midrange array, 10 GE FCoE
Test your vSphere knowledge!
How do you solve/approach this with just vSphere?
What do you do?
A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE
B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.
C: SMS your wive, “Honey, I’m staying overnight at the datacenter “
D: Take a blood pressure medicine so it won’t shoot up.
E: Buy the app team very nice dinner, and tell them to keep quiet.
8
9. Performance Troubleshooting: a day in the life…
The minimum you need to prove
• Performance is not caused by your infrastructure, or at least not by your VMware.
• Infrastructure = VMware + Storage + Network
• Application = VM + App inside the VM
What you need to prove
• For each of the 10 VM, the following was ok between 1-2 hours ago: CPU, RAM, Disk, network
• To strengthen the above, prove that:
• The shared infrastructure was also healthy: relevant ESX, relevant Datastore
• The overall platform was also healthy.
• No relevant faults that happened 1-2 hours ago.
• Give the list of ports (that the 10 VM use) to network team to ensure the firewall is not dropping them.
What challenges do you face in vSphere to do the above?
• Group discussion: what limitations do you face, if vCenter + vMA + PowerGUI + RVTools is all you have?
The ideal you need to prove
• Show the exact application-level counter that are slow, with the underlying infrastructure-level counter that
caused it. Another word, application-specific + root-cause-analysis
9
11. Challenge 1: details are lost after 1 hour
The following counters are lost:
1. Used
2. System
3. Idle
4. Latency
5. Overlap
6. Demand
7. Wait
8. Run
9. Swap wait
11 10. Max Limited
12. Challenge 1: details are lost after 1 hour
Memory Counters Disk Counters
<1 hour >1 hour <1 hour >1 hour
12
17. Deep understanding of vCenter is required
Here is a common example of why
a deep understanding of vSphere counters make a huge difference.
Buy more RAM?
17
18. Deep understanding of vCenter is required
Yes, buy more RAM.
ESXi has 32 GB RAM.
It is highly used
18
19. Deep understanding of vCenter is required
vCenter Ops shows
a very different data.
Memory is only 32%.
Plenty of headroom.
What?! It’s been high constantly for the last 24 hours! Better buy more RAM now.
But hang on! This is ESXi-06 host in VMware ASEAN lab. We know who use them
19
20. vCenter Ops shows
a very different data.
Memory is only 32%.
Plenty of headroom.
It just saves us from a
costly RAM upgrade
project
20
22. Counters and Badges
A vCenter farm with 500 VM and 50 ESX will have
>10000 counters!
• It is not humanely possible to look at them, let alone
analyse them.
Derived Counters
vCenter presents raw counters
Standardises the scale into 0 -
• e.g. What does Ready Time of 1500 in Real Time chart
mean? Is value of 2000 in Real Time chart better than value 100.
of 75000 in Daily Chart? 1 universal unit. Minimise the
• e.g. Is memory.usage at 90% at ESXi level good or bad? “translation” in our head.
• E.g. Is IOPS of 300 good or bad for datastore XYZ? Can be >100 if demand is unmet
Single counter can be misleading Universal. Apply to CPU, RAM,
Disk, Net, etc.
• e.g. Low CPU usage does not mean VM is getting the CPU, if
there is Limit, Contention and Co-Stop. Counters derived using
sophisticated formula, not just
• e.g. To see disk performance, we need to see multiple
aggregated.
counters at multiple layers (VM, kernel, physical)
For the same counter, different
Different counters have different units objects use different formula.
• GHz, %, MB, kbps, ops/sec, ms
• This makes analysis even more complex
22
23. Samples of Derived Metric: Health
Health Score of an Object = MAX (Abnormal Workload, Faults)
• Abnormal Workload per Metric = Geometric Mean (MAX (Abnormality (Capacity/Entitlement), Abnormality (Demand/Usage)),
Workload)
• Abnormal Workload per Object = Score Aggregation (Abnormal Workload per Metric)
• Fault depends on the object:
Cluster = HA Issues = MAX (HA Insufficient Failover Resources, HA Failover In Progress, HA Cannot Find Master)
Host = MAX (Hardware Issues, HA Issues)
Hardware Issues = MAX (Network Issues, Storage Issues, Compute Issues, CIM Issues)
Network Issues = MAX (Network, DVPort, VMNic)
Network = Max_of_all_instances (Network Device)
DVPort = Max_of_all_instances (DVPort Device)
VMNic = Max_of_all_instances (VMNic Device)
Storage Issues = MAX(Storage, SCSI, VMFS heartbeat, NFS server, CIM Storage)
Storage = Max_of_all_instances (Storage Device)
SCSI = Max_of_all_instances (SCSI Device)
VMFS heartbeat = Max_of_all_instances (VMFS heartbeat Device)
NFS server = Max_of_all_instances (NFS server Device)
Compute Issues = MAX (Error, PCIe)
CIM Issues = MAX (Processor, Memory, Fan, Voltage, Temperature, Power, System Board, Battery, Other
Health, IPMI, BMC)
HA Issues = HA Host Status
VM = MAX (FT Issues, HA Issues)
23
24. Threshold: a shift in mindset needed
vCenter sets “static” threshold, which can be misleading
• During peak, it is common for VM to reach high utilisation.
• Static threshold will generate alerts when they should not.
• vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with.
• During non-peak, it might be abnormal for VM to reach even 50% utilisation.
• Static threshold will not generate alerts when they should have.
vCenter only sets high threshold
• Do you set static threshold when CPU or RAM utilisation drops below 5%?
• A drop in entire array storage IOPS might be a sign of terrible day ahead.
• Will not alert when these happen:
• Utilisation drops from 75% to 1% when it should not.
• Utilisation change from 5% to 70% when it should not.
• We need to plots both upper range and lower range
But each VM differs. And the same VM differs depending on day/time…
• Intelligence required to analyse each metrics and their expected “normal” behaviour.
24
25. Dynamic threshold & alerts
vCenter Operations uses dynamic threshold
• It is dynamic and personalised down to individual metric.
• Varies from object to object. 1000 VM will have their own threshold.
• Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the
chart below.
• Varies from metric to metric. An ESX with 12 cores, each core has its own CPU Usage threshold.
• You can fix hard thresholds if you need to.
• This needs Enterprise edition. It comes with no static threshold defined.
• Steps http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html
Notice the range varies
in size
25
26. Dynamic Threshold Analysis
For each metric
DT analysis runs nightly
• New dynamic thresholds are computed for
Data
Categorization each metric
Data categorization
• Tries to identify stat as linear,
Linear DT
Multinomial Sparse Step Function Quantile multinomial, step function, etc
DT Sigma DT DT Sigma DT
• If one of those matches, that DT function
is used
CCPD
Otherwise: competition
• Sigma: assumes hourly cycles
ACPD
• CCPD: tries to find normal cycles
• ACPD: tries to find abnormal cycles
DT Scoring
• Winner is assigned based on metric
trending accuracy
The same metric may get different DT
function on different day
Dynamic
Thresholds
26
27. Dynamic Threshlold: Algorithm
m 1 m 1 m
0,0 i , j i , j m 1 m 1 0,0 1
i , j 1 m 1 m 1
m m
pi , j i 1 pi , j 1 pi , j i 1 pi , j
i 1 j 1 i m, j 1 i , j 1
P1,1,P1,2 ,...,Pm,m ( p1,1, p1,2 ,..., pm,m ) m 1 m 1
0,0 i , j i , j i 1 j 1
m
m, j i 1 j 1
m, j
i 1 j 1 i m , j 1
m 1 m 1 m
where pi , j
i 1 j 1
i m , j 1
pi , j 1 0 pi , j 1 and z t z 1e t dt
,
0
The marginal distribution of the i th row of J is:
m 1
Dirichlet i , j , i ,1, i ,2 ,..., i ,m 1 for i 1 m 1
,...,
j 1
( pi ,1,..., pi ,m 1 )
m
Dirichlet 0,0 m, j , m,1, m,2 ,..., m,m , 0,0 for i m
j 1
m 1 m 1 m
where 0,0 i , j i , j
i 1 j 1 i m , j 1
It is pretty difficult for a human to beat the computer in analysis of the data..
The above is one of the many algorithms applied by vCenter Operations.
27
28. Analytics
7 different analytics areas.
For DT feature, there are 8
algorithms.
Only in
Enterprise Edition
These advance
features create
Smart Alert.
28
29. Discussion Point
Raw Counters vs Derived Counters
Dynamic Threshold vs Static Threshold
29
30. Badge – Health
Answer complex questions like:
• How is the entire virtual data center doing? What’s the
degree of their health?
• For every cluster, host, datastore, what’s their health?
Health is a current Operational State.
• It represents what is wrong now that should be
addressed within 1 day. Thus Health needs to be scored
such that if it is red, then it really needs attention.
Weather Map
• Simple way to check that entire farm is healthy
• For child object, it is replaced with Health Trend
• Shows Health of all parent and child objects
• Each square can be VM, ESX, datastore, cluster, datacenter,
vCenter.
Value Explanation
75 – 100 Normal behaviour
50 – 75 The object experience some problems.
The object might have serious problems.
25 – 50
Check and take action as soon as possible.
The object is either not functioning properly or
30 0 – 25
will stop functioning soon.
31. 95
Badge – Workload
Answer complex questions like:
• For every object, how is Demand vs Supply?
• For every single VM, is CPU/Memory/Disk/Network
bound?
• Any VM is not getting what they are entitled?
• What’s the normal workload range for every object in our
vDC?
Workload is not utilisation or usage
• More accurate than utilisation as it takes many factors
than just utilisation.
Workload = (Demand/Entitlement)
Value Explanation
• Entitlement is dynamic. Affected by shares, limit, etc.
0 – 80 Workload is not high.
• Demand ≠ Usage.
The object is experiencing some
• Usage may mean passive usage. E.g. the RAM page is there but 80 – 90
high resource workloads.
no write/read.
Workload on the object is
• Score is Max (CPU, RAM, Disk IO, Net IO) 90 – 95
approaching its capacity in ≥1 area.
• To bring up the attention Workload on the object is at or over its
>95
capacity in ≥1 areas.
31
32. Derived Metric: Demand
The chart below shows Demand in action.
I generated IOPS which on a local datastore,
resulting in spike in latency (read latency when
up from 3 ms to 60 ms.
Demand correspondingly go up from 4 to 100!
32
33. Badge – Anomalies
Answer complex questions like:
• Is our vDC doing business as usual today? Or is it a
dynamic environment with lots of unexpected
changes?
• Which VMs, ESX, cluster, datastore, etc are behaving
abnormally?
• …. and exactly which counters are the culprits?
Identifying metric abnormalities
• It need to learn dynamic ranges of “Normal” for each
metric, so give it >3 cycle per metric.
• A month-end job means it needs 3 months.
• Normal range changes after configuration or application
changes. Value Explanation
Anomalies score 0 – 50 Normal Anomaly range
• A high number of anomalies: 50 – 75 The score exceeds the normal range.
• Usually an indication of a problem 75 – 90 The score is very high.
• Demand change Most of the metrics are beyond their
• Application team change code/app thresholds. This object might not be
> 90
working properly or will stop working
• KPI metrics impacts the Anomalies score more than soon.
non-KPI metrics.
33
34. This virtual DC spans multiple vCenters.
vCenter Ops show all the counters that
are behaving abnormally.
34
35. Badge – Faults
Answer complex questions like:
• What faults do we experience in our vDC?
• For every object, what faults does it have?
Specific knowledge of which vCenter Events
• Which events affect Availability and Performance of
which object?
• Pulled from active vCenter events
• Example:
• Loss of redundancy in NICs or HBAs
• Memory checksum errors
• HA failover problems
• Each fault has a default score (e.g. 25, 50, 75, 100) Value Explanation
• Highest individual Fault Score drives the Fault object 0 – 25 No fault is registered on the object
Score Faults of low importance happens on
25 – 50
object.
Best Practices:
Faults of high importance happens on
50 – 75
• Do not change the Faults Threshold object.
• Use Alerts View to manage Faults. Filter it to just show > 75
Faults of critical importance happens on
Fault. object
35
36. Badge – Risk
Answer complex questions like:
• Do we have risk from performance and capacity in
our vDC? If yes, where are they and can you
quantify the seriousness?
• Which objects are at risk? What is the specific
risk?
Risk Score takes into account
• Time Remaining
• Capacity Remaining
• Stress
Risk is an early warning system.
• Identifies potential problems that could eventually Value Explanation
hurt the performance 0 – 50 No problems are expected in the future.
• The Risk Chart shows Risk score over the last 7 There is a low chance of future problems or a
50 – 75
days, giving a view of the trend. potential problem might occur in the far future.
There is a chance of a more serious problem or a
75 – 100
problem might occur in the medium-term future.
The chances of a serious future problem are high
100
or a problem might occur in the near future
36
37. Badge – Time Remaining
Answer complex questions like:
• How much time do we have before we need
to buy more server, storage, network before
performance starts to degrade or we run out
of capacity?
• For every cluster, VM, datastore, how much
time do we have?
Measures time remaining before each
resource type reaches its capacity
• CPU
• Memory
• Disk (IOPS & Space)
• Network I/O
Value Time remaining
Early warning of upcoming provisioning 50 – 100 > 2x SP Buffer (60 days)
needs
25 – 50 < 2x SP Buffer
• Based on Score Provisioning buffer. Default
value is 30 days. <25 Near SP Buffer
• Set in “Capacity & Time Remaining” section 0 < SP buffer (30 days)
37
38. Badge – Capacity Remaining
Answer complex questions like:
• How many more VM can we put without impacting
performance or using up capacity?
• For every cluster, VM, datastore, which components
(CPU, RAM, Disk, Network) would run out first?
Early warning system 333 More VMs correlates to 77% Capacity
Remaining for this object
• A low score of 1 mean you still have >30 days.
• Measures how many more VMs can be placed on the
object
Percentage of Total VM “Slots” Remaining
• Based on the average size of the VM on the object
(e.g. VM profile) Value Capacity remaining
• Each object has its OWN VM profile size: Host, >10 >120 days
Cluster, Datacenter, Etc.
5 – 10 60 – 120 days
From the table, notice value is not linear 0–5 30 – 60 days
• It is also not the same with Time Remaining
0 <30 days
threshold.
• A value of 30 means >120 days for capacity but
around 40 days for time.
38
39. Capacity Remaining Calculation
Determine Capacity Constraint Resource
Deployed or Powered On VMs
• Powered Off VMs only use disk space resources
• Powered On VMs uses ALL of the 4 resources
Calculation Example Shown:
• Limiting Resource is Disk Space with 333 VMs
available
• Use the Deployed VM number of 99 to do the
calculation for percentage space remaining
• Determine Capacity Remaining
• 333 / (333 + 99) = 77%
39
40. Capacity and Time details
You can drill down to see details
• You can check the 9 components, as
shown on the right.
• This helps answer the question which
components have how many days or
VM left!
• Summary = Min (all 9 components)
40
41. Badge – Stress
Answer complex questions like:
• In our vDC, do we have stress points or
periods? How bad is it?
• For every cluster, VM, datastore, which ones
are experiencing stress and how bad is it?
Measures long-term or chronic workload
(6 weeks)
• Chart shows weeks break down of Stress for
each day/hour averaged over the last 6 Weeks
• Workloads > 70% = “Stressed”
• Threshold Configurable as per screenshot below Value Explanation
0–1 Normal score. No action needed
Some of the object resources are
1–5
not enough to meet the demands.
The object is experiencing regular
5 – 30
resource shortage.
Most of the resources on the object are
>30 constantly insufficient. The object might
stop functioning properly.
41
42. Stress Calculation
100 Stress Zone
12%
70
Workload
Line
0
6 Weeks
Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold
compared to the Total Capacity of the object
• Stress Score = (Stress area / Stress Zone) *100
• But max value can be > 100% as the workload can be >100.
Example
• Stress Line is 70% Workload
• 12% of the area is above the 70% threshold
• Stress Score is 12
42
43. Badge – Efficiency
Answer complex questions like:
• Are there optimization opportunities in our
vDC?
• How well do we do in terms of VM
provisioning? Do we get them right?
Efficiency Score factors
• Reclaimable waste
• Density ratio
Graph Depicts VMs by Percent
• Optimal – Optimally Provisioned VMs Value Explanation
• Waste – Over Provisioned VMs Three Resources Considered use
>25
The efficiency is good. The resource
on the selected object is optimal.
• CPU
• Stress – Under Provisioned VMs • 10 – 25 The efficiency is good, but can be
Memory improved. Some resources are not fully
• Not used in Efficiency Calculation (see Risk) • Disk Space
used.
The resources on the selected object are
Note: VMs can appear in Stress and
0 – 10
not used in the most optimal way.
Waste
0
The efficiency is bad. Many resources are
wasted.
43
44. Badge – Reclaimable Waste
Answer complex questions like:
• Do we over provisioned the VMs in terms of CPU,
RAM and Disk? If yes, what’s the degree of over
provisioning?
• For every cluster, VM, datastore, what can we
reclaim?
It identifies the amount of reclaimable
resources
• CPU
• Memory
• Disk
Reclaimable Waste = Reclaimable Capacity / Value Explanation
Deployed Capacity No resources are wasted on the
0 – 50
• Waste Score = Max(CPU Waste Score, RAM Waste selected object.
Score, Disk Space Waste Score) 50 – 75 Some resource can be used better.
• Disk calculation can also include old snapshots and 75 – 100 Many resources are underused
templates
Most of the resources on the selected
100
object are wasted.
44
45. Badge – Density
Answer complex questions like:
• How high can we push our consolidation
ratio before we experience performance
problem?
• Now that’s a million dollar question!
• For every datacenter, cluster, ESXi, what
are our key ratios and how much head
room do we have?
Contrasts Actual vs Ideal Density
• Identify Optimal Resource Deployment
Before Contention Occurs
• Ideal is based on demand, not simple
configuration.
• High Density is good. 100 is not too high. Value Explanation
>25 Good consolidation
10 – 25 Some resources are not fully consolidated
0 – 10 The consolidation for many resources is low
0 The resource consolidation is extremely low.
45
46. Badge Thresholds
There are 2 different threshold:
VM and Infra (ESXi, Cluster,
Datastore, etc)
Notice that Major badge has
different threshold to its minor
badges
Even “similar” badges have
different threshold. Notice Time
remaining and Capacity
remaining have very different
thresholds.
Disable Color Threshold by
Clicking the Level Off
46
47. Using badges together
Workload High & Anomalies Low & Stress High
• Workload – Object is Running Hot. Potentially Starving
for Resources
• Anomalies – Normal Behavior for this timeframe Add resources
• Stress – Object is often running under high Workload.
Workload High & Anomalies Low & Stress Low
• Workload – Object is Running Hot. Potentially Starving
for Resources
Not likely a big problem…
• Anomalies – Normal Behavior for this timeframe a cyclical workload spike?
• Stress – Object usually has enough resources
Workload High & Anomalies High
• Workload – Object is Running Hot. Potentially Starving
for Resources Something is amiss!
Immediate attention.
• Anomalies – Abnormal behavior for this timeframe
If there are Alert and Fault too, then it is a sign of
major issue
47
48. Discussion Point
Is Badge the way to go?
Are these the right 11 badges?
What other badges do you need?
48
49. Heat Map
Built-in heat maps
• Basic: A great way to show a lot of information on 1
screen.
• Storage: space, IO
Heat map can quickly highlight information,
• CPU as it can present relative information.
• RAM It is good for relative comparison among
• Network VMs.
• Advance (or composite)
• Health
• Workload
• Capacity
Heat map is a 2 dimensional chart. So it takes
Custom heat map or cold map 2 parameters. You cannot choose >2 data.
For example, you cannot show the following
• Since we can change the color, we can actually at the same time:
create cold map. • IOPS, Latency and Throughput. Also,
• In cold map, the bigger the size, the colder it is these 3 have different units so it’s hard
(less utilised it is). The bluer it is, the less utilised it to combine using Super Metric.
is. • ESX, VM and Datastore.
• Hence it focuses on Waste
49
50. Storage: Datastore + VM vs workload + latency
Since all the datastores are on the same array, how do we quickly tell the relative
workload generated by every one of them?
• This answers: which datastores are heavily loaded?
For each of these datastores, how do we know the relative workload generated by
the VM?
• This answers: which VMs dominate within a datastore?
For every VM, how do we performance is reasonable number?
• This answers: which VM has storage bottlenect?
How do we show all the above data in one page, without the need to show a lot of
numbers?
• And we still want to be able to drill down to each VM and datastore.
50
51. Each square is a VM. They are grouped by datastore.
Bigger square: bigger throughput
Color: latency.
51
52. Storage: Throughput vs Latency at cluster level
Which cluster is generating high storage workload?
Are they getting the SLA they ask? What’s the latency? The cluster owner wants to
know that his entire cluster is getting <10 ms latency.
We expect these X, Y, Z clusters to be doing little work. Can we prove this?
Basically, the same concept from
previous slide, but looking from cluster
point of view as Cluster & Datastore has
a Many-to-Many relationship.
52
55. Storage: Throughput vs Latency at VM level
Can we show at VM level now?
That’s why you need a 24” monitor
55
56. Storage: Space vs Latency
Any big VM that is not getting the SLA we agreed on?
56
57. Storage: Datastore space contention
Do we have space contention at any of the datastore? If yes, how bad is the
contention?
• While we use thick provision at vSphere level (and thin at array level), we still have risk of space
from snapshots, vRAM increase, new VM, new vDisk, storage vMotion, storage DRS, etc.
Are the datastore uniformly sized?
57
59. CPU: Contention vs Usage at cluster level
Which clusters are doing the most work? Which are not doing much?
How is the CPU workload on every cluster?
For each of those clusters, can we see if there is CPU contention?
59
60. CPU: Contention vs Usage at host level
Same questions with previous, but for host.
We can expect some “drill down” in this heat map
60
61. CPU: Contention vs Usage at VM level
Can we show at VM level now?
That’s why you need a 24” full HD
monitor
61
62. VM Health
Current Health
• Are all the VMs healthy? Especially those VMs which have high workload!
• Which VMs are experiencing problems?
• Are more demanding VMs less healthy?
• Can we see this by cluster? By host?
Future Health
• Will all the VMs be okay in future (30 days)? Need to check CPU, RAM, Disk IO, Disk Space and
network for every single VM!
• For those VMs which are not ok, can we be specific on which value will run out first? Can we
“drill down” to individual VM?
62
64. VM: color by capacity, size by workload
This is now showing future projection. We can see that the VM vCenter 5 is having red color. Its capacity will run out within 30
days. So we click on it to drill down.
64
65. Drill down to specific VM
Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in 10 days.
We can go as far as 6 months. This is good enough as you should not buy hardware >6 months in advance. It makes sense in the
physical world as it’s fixed, but unwise in virtual world.
65
66. Drill down to specific VM
Showing value in absolute terms is good, but can be confusing. vCenter Ops can also
show in %
66
67. Discussion Point
Which heat maps are useful for you?
What other heat maps or cold maps do you need?
67
68. Smart Alert vs Normal Alert
Smart Alert
• Relies on the advanced analytics instead of simple raw counters.
• Not static, as it based on Dynamic Threshold
• Examples:
• Early warning alerts: use total anomalies to predict when a problem is happening, sometimes before users are impacted
• KPI predictive: prediction that a KPI might soon go abnormal due to an event occurring that has preceded the KPI going
abnormal on previous occasions
• Fingerprint: set of metric anomalies matches previously seen problem (and associated resolution)
Comparison
Advanced Edition Enterprise Edition
provide alert on Minor Badges badge. E.g. Workload Provide alert on any counters (raw, badge, super
YES, Health NO metric)
Can only do infrastructure level alert Can do application-level alert
good for Alerts on single objects (e.g. VM) Good for single or multi objects
driven by the badge’s changing color Driven by threshold anomaly breaches and KPI
Threshold Breaches
Not customiseable Highly customisable
Cannot do alert at Resource Pool or Folder Can do it
68
70. Alert
When does Alert happen
• When a badge change color
• When a fault happens
• VC Ops own alert
• A component in VC Ops itself has failed.
• VC Ops cannot get data
Can do SNMP and SMTP
• Both are set at set on the Administration Web page. The URL format is https://VM-IP/admin/
70
71. Advance edition: Alert main window
Filter by the 11 badges
Filter the VC Ops own alert: system or environment
71
73. Enterprise edition: Alert main window
New alerts: Early Warning, KPI Breach, KPI Prediction, KPI High Threshold breach, Classic (static)
We can also color the row by criticality, and specify period (start – end)
73
78. Anomalies – Symptoms Window
The example is from an ESXi host with 11 VM.
Example of an ESXi Anomalies symptom window.
• It shows 3 resource type: VM, Datastore, Host System
• The VM resource kind has 7 metric groups with anomaly.
The VM resource kind (30 out of 71 Symptom)
• 71 – Total number of Symptoms under VM object
• We’re reporting on an ESX here, and VM is a child of host. So all children
metrics are included.
• The metric group comes from the vSphere adapter + VC Ops own.
• 30 – Total number of Displayed Symptoms
• Based on the limit of 5 metrics shown for each Metric Group
• The metric group (CPU Usage, network, Summary, etc) are specified by the
adapter
• Subcategory Network (3 of 11)
• 11 – The total number of VMs associated with this ESX. This is not the
number of symptoms.
• 3 – The total number of VMs that have one or more Network symptoms.
Metrics will not be identical common among VM.
Most will be similar though.
Multi vCPU VM will have more vCPU metrics than 1 vCPU VM.
Different VM will have different anomalies
They have different workload.
78
80. Storage in vCenter Operations
Automatic learning of storage
performance.
Calculating both Demand and
Normal rate.
80
81. vSphere 5 Performance Chart (fat client)
Can only
choose 1
component
at a time.
e.g. cannot
show CPU
and RAM at
the same
time.
81
82. vSphere 5 Performance Chart (fat client)
Can only show 1 chart at a time.
Hence can only show 2 units at a time.
82
83. vCenter Operation charts
Can show >1 charts at a time. Can combine/split charts.
Can show different data type from different objects.
Line is color coded, showing when threshold is breached.
83
84. Capacity Management in vSphere is hard
CPU Optimizations Reserved
Capacity
vSMP, Shares, Reservations, Limits
Memory Optimizations
Transparent Page Sharing,
Memory Ballooning, Memory Compression
? Remaining
Capacity
Storage Optimizations Usable
Thin Provisioning, Linked-Clones Capacity
Clusters
DRS, HA, FT, vMotion, Storage vMotion
Workload Flux Used
VMs growing/shrinking, added/removed Capacity
vSphere
36 days remaining
84
85. Capacity Management
What are my historical utilization trends?
What resources have been requested vs. needed?
How many more VMs will fit in my current farm?
Analyze
How can we use my resources more efficiently?
What VMs should be right-sized?
Can I reclaim over-provisioned or unused capacity?
Optimize
When will I run out of capacity?
What if I add, remove, reconfigure capacity?
Can I defer infrastructure investments?
Forecast
85
86. Understanding Behavior
Need to understand the weekly pattern
• Business week
• Weekend
• E.g. workload spike at 9am on Mondays
Year 1
Accomplish through roll-ups
• Roll-up weeks in a month to compute the typical week for the month
• Roll-up typical week in a month to a typical week in the quarter
Quarter 1
Differs from performance management roll-ups
• Older performance data gets less granular. vCenter loses accuracy
• Older capacity data maintains its granularity
Month 1 Month 2 Month 3
86
94. What-if
Visualise
• Add or remove VMs.
• Add based on existing VMs as profiles
• Add based on spec you supply
• Add, remove, or update hosts.
• Modify CPU and RAM only. No Network.
• Add, remove, or update datastores.
• Update means increase or decrease size.
• No IOPS yet.
At a cluster level or host level
• Cannot do at datacenter or higher level
• Host level does not make sense when host has HA & DRS turned on
You can add multiple what-if scenario
• You can combine them or compare them on the same chart
• You cannot save. Changes lost upon log-off.
• You can export the scenario results to an Adobe PDF or CSV file.
94
117. Capacity Planning: Is the VM really sized properly?
Setting a threshold of under-utilisation alone is not enough
We need to calculate the degree of under-utilisation.
117
121. Tips
No of intervals and data points used for analysis
• Tied to your business cycles.
• Pick correct number of data points and the interval type to represent a typical business cycle.
• Match no of intervals used for trend view and no of data points used for forecasting
• Stay with default forecasting algorithm settings
Leverage buffer settings to accommodate for unforeseen usage spikes or future
business growth.
• VC Ops 5 does not yet have “future incoming VM” concept
Leverage business hours to eliminate off-peak usage
Don’t be afraid, play with global settings
• They are just knobs used for data analysis
• Raw data is not modified when global settings are changed
121
122. Change Events Correlated with Performance
Overview
• Integration between vCM and vC Ops Mgr for change events
• Overlay Guest OS configuration changes from vCM in vC Ops performance trend graphs
• Launch in context into vCM to see full details of changes and potentially remediate them
Benefits
• Enable Operations to quickly understand and resolve performance issues arising from
configuration changes (reduce MTTR)
• Drive efficient & effective troubleshooting by correlating Guest OS configuration changes w/
VM performance degradations
• In larger enterprise, help bridge gap between VMware Admin and Guest OS Admin
122
123. VCM Events in vC Ops – Event Collected
vC Ops does not pull in every event from vCenter
• Only events that could affect health or workload (vSphere Knowledge!)
Adapter only pulls in change events for Guest OSs
• No ESX/i Host configurations changes (these come from vCenter Adapter)
• Guest OS has to be by managed by VCM
Event Collected
Reboot
Software Install/Uninstall
Windows Registry
IP/Networking changes
Device Driver changes
Memory/CPU changes
Windows Firewall
Patches
123
124. Event Types in vC Ops Mgr
Circle Events are vCM Initiated
• Change log in vCM updated when change is completed E
• Time = Occurred time
Diamonds are non-VCM-initiated
• Change log in vCM updated when vCM collects from VM
• Time = Collected time
E
Always Blue Events – “Might” have minimal impact
vCM events VMs follow the normal vC Ops display rules
• vCM Events appear for the VM Object itself
• vCM Events appear on an ESX host if you enable Child Events
124
129. Terms
The terms Attribute, Metric, Counter mean the same thing.
• CPU Ready Time is an attribute.
• CPU Ready Time from the VM ABC123 is a metric.
• vSphere uses the word Counter. VC Ops uses Attribute and Metric.
• As there are many attributes, they are grouped together. This is called Attribute Package.
Resource provides the Metrics.
• Example of resources: host, VM, datastore, cluster, etc.
• So a resource provides many attributes.
• Resource are pulled via Adapter.
Adapter
Kind
• In VC Ops, there are many kinds of resources.
So there is a term Resource Kind, that you need to get used to. Resource Resource Resource
• VC Ops uses different adapters to talk to different source. 1 type of
adapter per source. So there is a term Adapter Kind.
Attribute Attribute Attribute
Advance terms
• Container. Super Metrics. Application. Tier. KPI
129
130. Adapter, Resource, Attribute, Package
VC Ops Adapter Source of data
VMware Adapter vSphere 5
VCM Adapter VCM 5.4
VC Ops Adapter VC Ops 5
Container Adapter
Adapter Kind = adapter type. VMware Adapter is an example of Adapter Kind.
1 Adapter Kind can have many kind of objects that it pulls from the source.
This is called Resource Kind.
To make management of attributes easier, they are put into Package. Inside a
package, metris are grouped for ease of use.
This is the actual Resource Kind
Container Adapter is not actually an adapter. It’s a group or container that
brought by VMware Adapter
can hold other objects.
130
131. Actual Resource Kinds
Sample adapters with their associated resource kinds.
This is a special & built-in adapter. This is another special & built-in
“adapter”. Technically, this is
This monitor VC Ops itself!
actually not an adapter, as it’s just a
VC Ops is just an application, container.
which also needs monitoring.
131
132. vSphere resource kinds
Unlike the Advanced edition, we can utilise Folder and Resource Pool
• This means you can create Super Metric at this level.
• Complement vCenter.
Not used?
ESX Host
Not used?
No vApp, no Datastore Group, no vDS as at
VC Ops 5.
132
134. Attribute & Attribute Package
Package
• A collection of Attributes from 1 Resource with the same collection interval. That’s all!
• Need to map it to objects
• Super Metric must be placed into a package
• A package cannot come from multiple resources. See screen below.
• Cannot create a package that has both VM and ESXi
• There is a default package called All Attributes.
134
141. Resource Kind: Tags
What’s the difference between Applications and Application? Looks like
Application is from the Container adapter, which is built-in.
Maintenance schedule contains the time a particular object is on scheduled
downtime. It is used to tell VC Ops to ignore, else it would give alert as the
behaviour is unexpected. It would think the health drop!
So in this screen, ignore maintenance schedule as it should not be part of
Resource Kind.
The range for Health. This is not the same with the badge Health in VC Ops
Advance, as this is universal and apply to beyond vSphere. Health in Advance
edition include Fault, which is vSphere specific.
Tier is a special container. Again, this is universal, so name your tier properly to
avoid changing name later on.
Only 1 value here. This means the entire VC Ops.
141
142. Resource Kind: Tags
You can control which resource kinds
are shown
• In the picture below, ESX was hidden.
142
147. Monitoring the big workload
You have convinced your CIO to virtualise the remaining 50% of the servers.
Your CIO needs you to prove, supported by performance charts, that the platform has
served every VM well, meeting the SLA in the past 1 quarter.
• Tier 1 cluster SLA: 2% CPU Ready, 0 RAM Ballooning, 10 ms disk latency, 0 drop packets.
• Tier 2 cluster SLA: 4% CPU Ready, 5% RAM Ballooning, 20 ms disk latency, 0 drop packets.
• Tier 3 cluster SLA: 6% CPU Ready, 10% RAM Ballooning, 30 ms disk latency, 0 drop packets.
You have 500 VM on 50 ESXi, 8 clusters, 40 datastores, 5 RDM.
You must prove that:
• Not a single Tier 1 VM has >2% CPU Ready in the past 1 quarter. The underlying ESXi also has
<2% CPU contention.
• Not a single Tier 1 VM has >10 ms disk latency in the past 1 quarter. The underlying ESXi also has
<10 ms disk latency.
• Etc, for each Tier and each component (CPU, RAM, Disk, Net)
What kind of charts do you need to show?
147
149. Super Metric: Functions
2 types:
• looping functions: take multiple input value
• Average, sum, min, max, count, combine, etc.
• More practical or useful than single functions
• single functions: take 1 value
• Absolute, round up, round down, square root, etc.
The xxxN functions, instead of working on just the immediate children, it looks down
(or up) the number of levels specified in the formula.
• This ‘2’ tells the function to look
down for two levels for
the metric.
• Putting -2 means look up.
149
150. Super Metric: hierarchy
Example: super metric for Average CPU usage of a cluster
VM is 2 level down
from cluster.
150
153. Super Metric: Operators
To calculate a value for each VM based on metrics for that VM, use the ‘$This’
operator.
Another example: max ( $This:CPUavg, ESXi-Host-003:CPUavg, VM:CPUavg)
Finds the maximum value among these
• CPUavg metric for the resource to which the super metric is assigned (so this is dynamic)
• CPUavg metric for a specific resource called ESXi-Host-003 (so this is hardcoded)
• CPUavg metric for all resources of type VM (so this is universal for all VM)
153
159. Discussion Point
Think of super metrics that you need.
Explain why and how you will need them.
159
160. Applications and Application Tiers
App Team often view things from their own application-centric. We can create custom dashboard showing their
“Application”
Even better if we add non vSphere data, like Hyperic. This gives app-level info and GuestOS-level info, which is
not available in vSphere adapter.
Define your own hierarchy and relationship
160
164. What counters do you check?
Component ESX VM
Usage or Utilisation: Overall CPU utilisation (to
get overall utilisation of entire box)
Usage or Utilisation: Overall CPU utilisation
Usage or Utilisation: Individual core utilisation
Usage or Utilisation: Individual core utilisation
(to see distribution and if any particular core is
CPU max out) Wait (wait for IO. To see if it’s IO bound)
Wait (wait for IO. To see if it’s IO bound) Ready (VM unable to run, waiting for core)
Ready (VM unable to run, waiting for core) Co-Stop (if there are large VMs)
Co-Stop (if there are large VMs)
Ballooning Ballooning
RAM
Active or Active Write Active or Active Write
Latency: kernel latency, device latency.
Guest Latency
Device Latency
Storage Throughput
Throughput
IOPS
IOPS
Drop packets Drop packets
Network
Throughput Throughput
vSphere Replication?
Others System?
Cluster service?
164
165. Test your vSphere knowledge!
How are Disk, Datastore,
Adapter and Path related?
165
166. CPU counters
Test your vSphere knowledge!
Which one is ESX, which one
is VM? How do you know?
Test your vSphere knowledge!
What can stop/block a VM
from getting the CPU it was
configured?
No more Collection Level
limitation. VC-Ops collect
them all and analyse them
all.
Changing collection level in
vCenter does not impact VC
Ops as VC Ops gets from
“real-time” statistic.
166
167. %OVRLP and %SYS
Run
Wait Ready
Time
World 1 %RUN %SYS
%OVRLP %RUN continues to accumulate.
But %OVRLP kicks in.
World 2 %RUN
%OVRLP Overlapping time. A world still wants CPU but interrupted by another world.
High number normally means ESX is experiencing heavy IO
%USED = %RUN + %SYS - %OVRLP
As a result, the overlap value does not incorrectly inflate %USED.
%SYS A high no means heavy IO or interrupts
167
173. Storage counters: VM
Virtual Disk (VMDK, RDM)
VM
Drive 1 Drive 2 Drive 3
vDisk vDisk vDisk
scsi0:0 scsi0:2
Datastore VMFS NFS RDM
Datastore Datastore
Disk Disk
Disk
173
220. The VC Relationship
There are 2 widgets that are vSphere related.
Use the advanced edition instead.
• Enterprise edition can access Advanced edition UI at the same time. Just open another window
or tab.
221
221. Interaction between widget
Controlled at the dashboard level, not individual widget
Providing widget and Receiving widget
222
224. Practice session: creating your dashboard
Goal: have a dashboard to help you investigates all non-local datastores quickly
• Be able to plot chart for all non-local datastores for comparison.
Answer:
• Create a tag called Storage from the Environment screen.
• Create 1 tag value: Shared Datastore
• Tag all the non-local datastores with this tag value
• Done manually. Simply drag all the rows
• Create a dashboard with 4 widgets
• Health Status
• This is where you show the overall health of all Non-Local Datastores
• Resources
• This is where you show all the members of Non-Local Datastore tags
• Metric Selector
• All the metrics will appear here.
• Select the metric you want
• Metric Graph or Metric Sparklines
• Choose Sparklines if you have lots of graph.
225
234. Major Steps in implementation
Define who Create Create Create Create Create
needs what Super Metrics Applications Tags Heat Maps Dashboards
Begin with the end in mind
• Every Super Metric must serve a particular role
• Role, not individual. A person can & will have many heatmaps/dashboards.
• Decide if you need the following non-standard info
• Application-level & Guest-OS-level info
• Info from physical machines (UNIX, X64, etc)
• Info from physical storage and network (switch, FW, router, etc)
Think in terms of application
• A great way to complement vSphere as vCenter does not have this object.
235
235. Who needs to see what
Simple Dashboard.
Big picture. Tend to be application focused.
CIO or CTO No absolute data. Normalised to 0-100.
Focus on long term.
Averaged data. A 30-minute spike will not show up.
Updated daily.
Group Head
e.g. Head of Infra, Head of Apps
Dept Head
e.g. Head of Storage, Head of Server,
Head of Network, Head of Databases
Rich Dashboard. Ideally Full HD screen.
Admin/Architect Specific info.
e.g. Storage Admin, Network Admin,
App Owner, VM Owner
Absolute data + Normalised Data.
Focus on short term.
Actual data. A 5-minute spike will be visible.
Updated every 2 minutes.
236
236. Who needs to see what (samples)
Roles Info presented
Health of overall IT in the past 1 month
CIO
Health of key applications in the past 1 month
CTO As above, but with more technical content, and tailored to him.
Health of all key apps in the past 1 month, with the ability to do 1 level drill down for each app.
Head of Applications
Capacity projection for all key apps.
Health of Storage
Health of Network
Head of Infrastructure
Health of Servers (VMware and Physical)
Health of VM
Head of Storage A higher level, simpler dashboard than Storage Admin
Head of Network
VMware Team
An App Owner The infra is providing each of the VMs in my App with the resources it needs
237
237. Designing Super Metric
Leverage existing derived metrics
Leverage Objects that vCenter cannot provide performance data
• Application, Resource Pool, Folder, Location, can now have performance counters
Minimise static alert.
Know what a good range for the end result
Build a simple table to avoid super metric sprawl and duplicating existing metrics
• Below is an example, showing 2 Super Metrics.
Name Purpose Target Role Formula Good Range
VM SLA = 100% - Max (CPU, RAM, Disk, Network)
CPU = CPU Contention %.
RAM = RAM ballooning %.
Shows that a VM gets the
Disk = % above threshold latency. >99% (Tier 1 cluster)
resources it wants from
VM SLA VM Owner Network = Packet Drop %. >97 (Tier 2 cluster)
infrastructure based on the
>95% (Tier 3 cluster)
defined SLA.
Tier 1 Disk SLA is 10 ms.
Tier 2 Disk SLA is 20 ms.
Tier 3 Disk SLA is 30 ms.
Show that the underlying infra
VMware Infra SLA = 100% - Max (Host Cluster, Datastore
Infra SLA has the resources for all the
Admin Cluster)
VMs on it
238
238. Custom Heat Map or Cold Map
Component Heat Map Cold Map
Least utilised VM: size by vCPU count, color by RAM + CPU
CPU Resource pool: size by CPU utilisation,
usage (a Super Metric)
Most RAM intensive VMs, grouped by ESX. Size by RAM
RAM
utilisation, color by health
Most disk intensive VMs, grouped by ESX. Size by disk
Disk Least utilised disk: size by GB, color by % of free
utilisation, color by health
Most network intensive VMs, grouped by ESX. Size by
Network Most idle VMs, grouped by host
network utilisation, color by health
VMs with file system that will run out soon. Color by %
Capacity
left, size by GB left.
VM health, grouped by cluster. Color by health, size by
Health
workload.
Design consideration
• Use Super Metric so the info is richer.
• Group VMs by 1 consistent hierarchy only. If you group by cluster, it won’t make sense to further group by datastore as 1
datastore can spans multiple cluster.
239
240. Choice of Tools
vCenter Operations
• 1-15 minutes accuracy (for other sources)
• 5 minutes accuracy (for vSphere)
• No need reproducible. But problem should last >5 minutes, preferably 15 minutes (3 sample)
vCenter
• 20 – 300 seconds accuracy
• Reproducable performance issue
• Requirements: you already have some idea what causes it
esxtop
• 2 – 20 seconds accuracy. Short burst problem.
• Reproducable performance issue
• Requirements: you already know which ESX & VM has the problem.
vSCSIStat
• Specific for storage, low level analysis
241