vCenter Operations 5: Level 300 training

vCenter Operations 5: Level 300 training
Singapore, Q2 2012
Iwan ‘e1’ RahabokVCAP-DCD

Staff SE, Strategic Accounts

e1@vmware.com | Skype: e1_ang | 9119-9226 | Linkedin.com/in/e1ang

© 2010 VMware Inc. All rights reserved

Document Information
 This deck is part 2 of a series.
• Part 1 is Management in the Virtual World: a technical introduction.
• http://communities.vmware.com/docs/DOC-17841

 This deck has pre-requisite
• Intro video: http://www.youtube.com/watch?v=Z-DJuTiqKag
• VC Ops 5 technical introduction at Vault or Partner Central.
 This deck only covers vCenter Operation (enterprise + advance)
• Focus on concept & ‘under the hood’ to get you understand the product deeper.
• Does not cover: competitive, installation, configuration
• Does not run through feature after feature.
• See the official training deck for that at Vault or Partner Central. This is a very long
training material.
• vCenter Operations modules that it does not covers
Use the Section feature
• Chargeback
to see how it is
• Infrastructure Navigator organised.
• Configuration Manager

 Further reading
• virtual-red-dot.blogspot.com

2

Table of Contents

 Built for vCenter Standard
 Core: Metrics, Threshold, Analytics
 Badges
 Heat Map
 Smart Alert
 Details & Charts
 Capacity Management
 Settings
 VCM integration
 Concepts & Advance Concepts
 Deep dive into Metrics
 Dashboard and Widgets

3

Managing Performance/Capacity in vSphere: the basic

Is it healthy? Is it enough? Is it optimised?

• Every VM & ESX • Enough CPU, RAM, • Which VMs need
performing well? Network, Disk? adjustment?
CPU, RAM, Future risk? • What are my key
Network, Disk? • Time remaining? ratios?
• Are they behaving • Capacity • How much can I
expectedly? remaining? claim back from
• Any fault on any • Where are the “fat” VMs?
component? “Stress points” • How many more
in time? VMs can I put
without impacting
performance?

4

Direct Mapping by vCenter Operations
 Is it healthy = Health
• Workload
• Anomalies
• Faults
 Is it enough = Risk
• Time remaining
• Capacity remaining
• Stress period
 Is it optimised = Efficiency
• What can we reclaim?
• Density. Key ratios for management

 Daily update at midnight

5

Visibility across vCenters

Sample from ASEAN Lab:
6 vCenters.
Mixed of Appliance and Windows
2 are LinkedMode (SRM)

7

Performance Troubleshooting: a day in the life…
 You got an email from the app team, saying the main Intranet application was slow.
• The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that.
• So it was slow between 1-2 hours ago, but ok now.
• You did a check. Everything is indeed ok in the past 1 hour.
• The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM
• You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest
OS.
• Your environment: 1 VC, 4 clusters, 30 hosts, 300 VM, 20 datastores, 1 midrange array, 10 GE FCoE

Test your vSphere knowledge!
How do you solve/approach this with just vSphere?

What do you do?
 A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE 
 B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.
 C: SMS your wive, “Honey, I’m staying overnight at the datacenter  “
 D: Take a blood pressure medicine so it won’t shoot up.
 E: Buy the app team very nice dinner, and tell them to keep quiet.

8

Performance Troubleshooting: a day in the life…
 The minimum you need to prove
• Performance is not caused by your infrastructure, or at least not by your VMware.
• Infrastructure = VMware + Storage + Network
• Application = VM + App inside the VM

 What you need to prove
• For each of the 10 VM, the following was ok between 1-2 hours ago: CPU, RAM, Disk, network
• To strengthen the above, prove that:
• The shared infrastructure was also healthy: relevant ESX, relevant Datastore
• The overall platform was also healthy.
• No relevant faults that happened 1-2 hours ago.
• Give the list of ports (that the 10 VM use) to network team to ensure the firewall is not dropping them.
 What challenges do you face in vSphere to do the above?
• Group discussion: what limitations do you face, if vCenter + vMA + PowerGUI + RVTools is all you have?
 The ideal you need to prove
• Show the exact application-level counter that are slow, with the underlying infrastructure-level counter that
caused it. Another word, application-specific + root-cause-analysis

9

Challenge 1: details are lost after 1 hour

10


The following counters are lost:
1. Used
2. System
3. Idle
4. Latency
5. Overlap
6. Demand
7. Wait
8. Run
9. Swap wait
11 10. Max Limited


Memory Counters Disk Counters

<1 hour >1 hour <1 hour >1 hour

12

Challenge 2: no application awareness

14

Deep understanding of vCenter is required

Here is a common example of why
a deep understanding of vSphere counters make a huge difference.

Buy more RAM?

17


Yes, buy more RAM.
ESXi has 32 GB RAM.
It is highly used

18


vCenter Ops shows
a very different data.
Memory is only 32%.
Plenty of headroom.

What?! It’s been high constantly for the last 24 hours! Better buy more RAM now.

But hang on! This is ESXi-06 host in VMware ASEAN lab. We know who use them 

19

vCenter Ops shows
a very different data.
Memory is only 32%.
Plenty of headroom.

It just saves us from a
costly RAM upgrade
project

20

Live Demo
1 engine, 2 UI.
Dashboard..
Badges.
Configuration

21

Counters and Badges
 A vCenter farm with 500 VM and 50 ESX will have
>10000 counters!
• It is not humanely possible to look at them, let alone
analyse them.
Derived Counters
 vCenter presents raw counters
Standardises the scale into 0 -
• e.g. What does Ready Time of 1500 in Real Time chart
mean? Is value of 2000 in Real Time chart better than value 100.
of 75000 in Daily Chart? 1 universal unit. Minimise the
• e.g. Is memory.usage at 90% at ESXi level good or bad? “translation” in our head.
• E.g. Is IOPS of 300 good or bad for datastore XYZ? Can be >100 if demand is unmet

 Single counter can be misleading Universal. Apply to CPU, RAM,
Disk, Net, etc.
• e.g. Low CPU usage does not mean VM is getting the CPU, if
there is Limit, Contention and Co-Stop. Counters derived using
sophisticated formula, not just
• e.g. To see disk performance, we need to see multiple
aggregated.
counters at multiple layers (VM, kernel, physical)
For the same counter, different
 Different counters have different units objects use different formula.
• GHz, %, MB, kbps, ops/sec, ms
• This makes analysis even more complex

22

Samples of Derived Metric: Health
 Health Score of an Object = MAX (Abnormal Workload, Faults)
• Abnormal Workload per Metric = Geometric Mean (MAX (Abnormality (Capacity/Entitlement), Abnormality (Demand/Usage)),
Workload)
• Abnormal Workload per Object = Score Aggregation (Abnormal Workload per Metric)
• Fault depends on the object:
Cluster = HA Issues = MAX (HA Insufficient Failover Resources, HA Failover In Progress, HA Cannot Find Master)

Host = MAX (Hardware Issues, HA Issues)
Hardware Issues = MAX (Network Issues, Storage Issues, Compute Issues, CIM Issues)
Network Issues = MAX (Network, DVPort, VMNic)
Network = Max_of_all_instances (Network Device)
DVPort = Max_of_all_instances (DVPort Device)
VMNic = Max_of_all_instances (VMNic Device)
Storage Issues = MAX(Storage, SCSI, VMFS heartbeat, NFS server, CIM Storage)
Storage = Max_of_all_instances (Storage Device)
SCSI = Max_of_all_instances (SCSI Device)
VMFS heartbeat = Max_of_all_instances (VMFS heartbeat Device)
NFS server = Max_of_all_instances (NFS server Device)
Compute Issues = MAX (Error, PCIe)
CIM Issues = MAX (Processor, Memory, Fan, Voltage, Temperature, Power, System Board, Battery, Other
Health, IPMI, BMC)
HA Issues = HA Host Status

VM = MAX (FT Issues, HA Issues)

23

Threshold: a shift in mindset needed
 vCenter sets “static” threshold, which can be misleading
• During peak, it is common for VM to reach high utilisation.
• Static threshold will generate alerts when they should not.
• vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with.
• During non-peak, it might be abnormal for VM to reach even 50% utilisation.
• Static threshold will not generate alerts when they should have.
 vCenter only sets high threshold
• Do you set static threshold when CPU or RAM utilisation drops below 5%? 
• A drop in entire array storage IOPS might be a sign of terrible day ahead.
• Will not alert when these happen:
• Utilisation drops from 75% to 1% when it should not.
• Utilisation change from 5% to 70% when it should not.
• We need to plots both upper range and lower range
 But each VM differs. And the same VM differs depending on day/time… 
• Intelligence required to analyse each metrics and their expected “normal” behaviour.

24

Dynamic threshold & alerts
 vCenter Operations uses dynamic threshold
• It is dynamic and personalised down to individual metric.
• Varies from object to object. 1000 VM will have their own threshold.
• Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the
chart below.
• Varies from metric to metric. An ESX with 12 cores, each core has its own CPU Usage threshold.
• You can fix hard thresholds if you need to.
• This needs Enterprise edition. It comes with no static threshold defined.
• Steps  http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html

Notice the range varies
in size

25

Dynamic Threshold Analysis
For each metric
 DT analysis runs nightly
• New dynamic thresholds are computed for
Data
Categorization each metric

 Data categorization
• Tries to identify stat as linear,
Linear DT
Multinomial Sparse Step Function Quantile multinomial, step function, etc
DT Sigma DT DT Sigma DT
• If one of those matches, that DT function
is used
CCPD
 Otherwise: competition
• Sigma: assumes hourly cycles
ACPD
• CCPD: tries to find normal cycles
• ACPD: tries to find abnormal cycles
DT Scoring
• Winner is assigned based on metric
trending accuracy

 The same metric may get different DT
function on different day
Dynamic
Thresholds

26

Dynamic Threshlold: Algorithm

  m 1 m  1 m
 
   0,0     i , j    i , j   m 1 m 1 0,0 1
     i , j 1    m 1 m  1 
m m

  pi , j  i 1 pi , j   1     pi , j  i 1 pi , j  
i 1 j 1 i  m, j 1  i , j 1
 P1,1,P1,2 ,...,Pm,m ( p1,1, p1,2 ,..., pm,m )   m 1 m  1
  0,0      i , j      i , j    i 1 j 1
m
m, j    i 1 j 1
 m, j 

 i 1 j 1 i  m , j 1 

m 1 m  1 m 
where   pi , j 
i 1 j 1

i  m , j 1
pi , j  1 0  pi , j  1 and   z    t z 1e  t dt
,
0

The marginal distribution of the i th row of J is:
  m 1
 
Dirichlet      i , j , i ,1, i ,2 ,..., i ,m 1  for i  1 m  1
,...,
  j 1  
( pi ,1,..., pi ,m 1 )  
   m
  
Dirichlet     0,0   m, j  , m,1, m,2 ,..., m,m , 0,0  for i  m 
 
   j 1   
m 1 m  1 m
where   0,0     i , j   i , j
i 1 j 1 i  m , j 1

It is pretty difficult for a human to beat the computer in analysis of the data..
The above is one of the many algorithms applied by vCenter Operations.

27

Analytics

7 different analytics areas.
For DT feature, there are 8
algorithms.

Only in
Enterprise Edition

These advance
features create
Smart Alert.

28

Discussion Point

Raw Counters vs Derived Counters
Dynamic Threshold vs Static Threshold

29

Badge – Health
 Answer complex questions like:
• How is the entire virtual data center doing? What’s the
degree of their health?
• For every cluster, host, datastore, what’s their health?
 Health is a current Operational State.
• It represents what is wrong now that should be
addressed within 1 day. Thus Health needs to be scored
such that if it is red, then it really needs attention.

 Weather Map
• Simple way to check that entire farm is healthy
• For child object, it is replaced with Health Trend
• Shows Health of all parent and child objects
• Each square can be VM, ESX, datastore, cluster, datacenter,
vCenter.

Value Explanation

75 – 100 Normal behaviour

50 – 75 The object experience some problems.
The object might have serious problems.
25 – 50
Check and take action as soon as possible.
The object is either not functioning properly or
30 0 – 25
will stop functioning soon.

95
Badge – Workload
• For every object, how is Demand vs Supply?
• For every single VM, is CPU/Memory/Disk/Network
bound?
• Any VM is not getting what they are entitled?
• What’s the normal workload range for every object in our
vDC?

 Workload is not utilisation or usage
• More accurate than utilisation as it takes many factors
than just utilisation.

 Workload = (Demand/Entitlement)
Value Explanation
• Entitlement is dynamic. Affected by shares, limit, etc.
0 – 80 Workload is not high.
• Demand ≠ Usage.
The object is experiencing some
• Usage may mean passive usage. E.g. the RAM page is there but 80 – 90
high resource workloads.
no write/read.
Workload on the object is
• Score is Max (CPU, RAM, Disk IO, Net IO) 90 – 95
approaching its capacity in ≥1 area.
• To bring up the attention Workload on the object is at or over its
>95
capacity in ≥1 areas.

31

Derived Metric: Demand

The chart below shows Demand in action.
I generated IOPS which on a local datastore,
resulting in spike in latency (read latency when
up from 3 ms to 60 ms.
Demand correspondingly go up from 4 to 100!

32

Badge – Anomalies
• Is our vDC doing business as usual today? Or is it a
dynamic environment with lots of unexpected
changes?
• Which VMs, ESX, cluster, datastore, etc are behaving
abnormally?
• …. and exactly which counters are the culprits?
 Identifying metric abnormalities
• It need to learn dynamic ranges of “Normal” for each
metric, so give it >3 cycle per metric.
• A month-end job means it needs 3 months.
• Normal range changes after configuration or application
changes. Value Explanation

 Anomalies score 0 – 50 Normal Anomaly range

• A high number of anomalies: 50 – 75 The score exceeds the normal range.
• Usually an indication of a problem 75 – 90 The score is very high.
• Demand change Most of the metrics are beyond their
• Application team change code/app thresholds. This object might not be
> 90
working properly or will stop working
• KPI metrics impacts the Anomalies score more than soon.
non-KPI metrics.

33

This virtual DC spans multiple vCenters.
vCenter Ops show all the counters that
are behaving abnormally.

34

Badge – Faults
• What faults do we experience in our vDC?
• For every object, what faults does it have?
 Specific knowledge of which vCenter Events
• Which events affect Availability and Performance of
which object?
• Pulled from active vCenter events
• Example:
• Loss of redundancy in NICs or HBAs
• Memory checksum errors
• HA failover problems
• Each fault has a default score (e.g. 25, 50, 75, 100) Value Explanation

• Highest individual Fault Score drives the Fault object 0 – 25 No fault is registered on the object
Score Faults of low importance happens on
25 – 50
object.
 Best Practices:
Faults of high importance happens on
50 – 75
• Do not change the Faults Threshold object.

• Use Alerts View to manage Faults. Filter it to just show > 75
Faults of critical importance happens on
Fault. object

35

Badge – Risk
• Do we have risk from performance and capacity in
our vDC? If yes, where are they and can you
quantify the seriousness?
• Which objects are at risk? What is the specific
risk?

 Risk Score takes into account
• Time Remaining
• Capacity Remaining
• Stress
 Risk is an early warning system.
• Identifies potential problems that could eventually Value Explanation
hurt the performance 0 – 50 No problems are expected in the future.
• The Risk Chart shows Risk score over the last 7 There is a low chance of future problems or a
50 – 75
days, giving a view of the trend. potential problem might occur in the far future.
There is a chance of a more serious problem or a
75 – 100
problem might occur in the medium-term future.
The chances of a serious future problem are high
100
or a problem might occur in the near future

36

Badge – Time Remaining
• How much time do we have before we need
to buy more server, storage, network before
performance starts to degrade or we run out
of capacity?
• For every cluster, VM, datastore, how much
time do we have?

 Measures time remaining before each
resource type reaches its capacity
• CPU
• Memory
• Disk (IOPS & Space)
• Network I/O
Value Time remaining
 Early warning of upcoming provisioning 50 – 100 > 2x SP Buffer (60 days)
needs
25 – 50 < 2x SP Buffer
• Based on Score Provisioning buffer. Default
value is 30 days. <25 Near SP Buffer

• Set in “Capacity & Time Remaining” section 0 < SP buffer (30 days)

37

Badge – Capacity Remaining
• How many more VM can we put without impacting
performance or using up capacity?
• For every cluster, VM, datastore, which components
(CPU, RAM, Disk, Network) would run out first?

 Early warning system 333 More VMs correlates to 77% Capacity
Remaining for this object
• A low score of 1 mean you still have >30 days.
• Measures how many more VMs can be placed on the
object

 Percentage of Total VM “Slots” Remaining
• Based on the average size of the VM on the object
(e.g. VM profile) Value Capacity remaining
• Each object has its OWN VM profile size: Host, >10 >120 days
Cluster, Datacenter, Etc.
5 – 10 60 – 120 days
 From the table, notice value is not linear 0–5 30 – 60 days
• It is also not the same with Time Remaining
0 <30 days
threshold.
• A value of 30 means >120 days for capacity but
around 40 days for time.

38

Capacity Remaining Calculation

 Determine Capacity Constraint Resource
 Deployed or Powered On VMs
• Powered Off VMs only use disk space resources
• Powered On VMs uses ALL of the 4 resources
 Calculation Example Shown:
• Limiting Resource is Disk Space with 333 VMs
available
• Use the Deployed VM number of 99 to do the
calculation for percentage space remaining
• Determine Capacity Remaining
• 333 / (333 + 99) = 77%

39

Capacity and Time details

 You can drill down to see details
• You can check the 9 components, as
shown on the right.
• This helps answer the question which
components have how many days or
VM left!
• Summary = Min (all 9 components)

40

Badge – Stress

• In our vDC, do we have stress points or
periods? How bad is it?
• For every cluster, VM, datastore, which ones
are experiencing stress and how bad is it?
 Measures long-term or chronic workload
(6 weeks)
• Chart shows weeks break down of Stress for
each day/hour averaged over the last 6 Weeks
• Workloads > 70% = “Stressed”
• Threshold Configurable as per screenshot below Value Explanation

0–1 Normal score. No action needed
Some of the object resources are
1–5
not enough to meet the demands.
The object is experiencing regular
5 – 30
resource shortage.
Most of the resources on the object are
>30 constantly insufficient. The object might
stop functioning properly.

41

Stress Calculation

100 Stress Zone

12%

70

Workload
Line

0
6 Weeks
 Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold
compared to the Total Capacity of the object
• Stress Score = (Stress area / Stress Zone) *100
• But max value can be > 100% as the workload can be >100.
 Example
• Stress Line is 70% Workload
• 12% of the area is above the 70% threshold
• Stress Score is 12

42

Badge – Efficiency
• Are there optimization opportunities in our
vDC?
• How well do we do in terms of VM
provisioning? Do we get them right?
 Efficiency Score factors
• Reclaimable waste
• Density ratio
 Graph Depicts VMs by Percent
• Optimal – Optimally Provisioned VMs Value Explanation

• Waste – Over Provisioned VMs  Three Resources Considered use
>25
The efficiency is good. The resource
on the selected object is optimal.
• CPU
• Stress – Under Provisioned VMs • 10 – 25 The efficiency is good, but can be
Memory improved. Some resources are not fully
• Not used in Efficiency Calculation (see Risk) • Disk Space
used.
The resources on the selected object are
 Note: VMs can appear in Stress and
0 – 10
not used in the most optimal way.

Waste
0
The efficiency is bad. Many resources are
wasted.

43

Badge – Reclaimable Waste
• Do we over provisioned the VMs in terms of CPU,
RAM and Disk? If yes, what’s the degree of over
provisioning?
• For every cluster, VM, datastore, what can we
reclaim?

 It identifies the amount of reclaimable
resources
• CPU
• Memory
• Disk
 Reclaimable Waste = Reclaimable Capacity / Value Explanation
Deployed Capacity No resources are wasted on the
0 – 50
• Waste Score = Max(CPU Waste Score, RAM Waste selected object.
Score, Disk Space Waste Score) 50 – 75 Some resource can be used better.

• Disk calculation can also include old snapshots and 75 – 100 Many resources are underused
templates
Most of the resources on the selected
100
object are wasted.

44

Badge – Density

• How high can we push our consolidation
ratio before we experience performance
problem?
• Now that’s a million dollar question! 
• For every datacenter, cluster, ESXi, what
are our key ratios and how much head
room do we have?
 Contrasts Actual vs Ideal Density
• Identify Optimal Resource Deployment
Before Contention Occurs
• Ideal is based on demand, not simple
configuration.
• High Density is good. 100 is not too high. Value Explanation

>25 Good consolidation

10 – 25 Some resources are not fully consolidated

0 – 10 The consolidation for many resources is low

0 The resource consolidation is extremely low.

45

Badge Thresholds

There are 2 different threshold:
VM and Infra (ESXi, Cluster,
Datastore, etc)

Notice that Major badge has
different threshold to its minor
badges

Even “similar” badges have
different threshold. Notice Time
remaining and Capacity
remaining have very different
thresholds.

Disable Color Threshold by
Clicking the Level Off

46

Using badges together
 Workload High & Anomalies Low & Stress High
• Workload – Object is Running Hot. Potentially Starving
for Resources
• Anomalies – Normal Behavior for this timeframe Add resources

• Stress – Object is often running under high Workload.
 Workload High & Anomalies Low & Stress Low
for Resources
Not likely a big problem…
• Anomalies – Normal Behavior for this timeframe a cyclical workload spike?
• Stress – Object usually has enough resources
 Workload High & Anomalies High
for Resources Something is amiss!
Immediate attention.
• Anomalies – Abnormal behavior for this timeframe
 If there are Alert and Fault too, then it is a sign of
major issue

47

Discussion Point

Is Badge the way to go?
Are these the right 11 badges?
What other badges do you need?

48

Heat Map

 Built-in heat maps
• Basic: A great way to show a lot of information on 1
screen.
• Storage: space, IO
Heat map can quickly highlight information,
• CPU as it can present relative information.
• RAM It is good for relative comparison among
• Network VMs.
• Advance (or composite)
• Health
• Workload
• Capacity
Heat map is a 2 dimensional chart. So it takes
 Custom heat map or cold map 2 parameters. You cannot choose >2 data.
For example, you cannot show the following
• Since we can change the color, we can actually at the same time:
create cold map. • IOPS, Latency and Throughput. Also,
• In cold map, the bigger the size, the colder it is these 3 have different units so it’s hard
(less utilised it is). The bluer it is, the less utilised it to combine using Super Metric.
is. • ESX, VM and Datastore.

• Hence it focuses on Waste

49

Storage: Datastore + VM vs workload + latency

 Since all the datastores are on the same array, how do we quickly tell the relative
workload generated by every one of them?
• This answers: which datastores are heavily loaded?
 For each of these datastores, how do we know the relative workload generated by
the VM?
• This answers: which VMs dominate within a datastore?
 For every VM, how do we performance is reasonable number?
• This answers: which VM has storage bottlenect?
 How do we show all the above data in one page, without the need to show a lot of
numbers?
• And we still want to be able to drill down to each VM and datastore.

50

Each square is a VM. They are grouped by datastore.
Bigger square: bigger throughput
Color: latency.

51

Storage: Throughput vs Latency at cluster level

 Which cluster is generating high storage workload?
 Are they getting the SLA they ask? What’s the latency? The cluster owner wants to
know that his entire cluster is getting <10 ms latency.
 We expect these X, Y, Z clusters to be doing little work. Can we prove this?

Basically, the same concept from
previous slide, but looking from cluster
point of view as Cluster & Datastore has
a Many-to-Many relationship.

52

Storage: Throughput vs Latency at cluster level

53

Storage: Throughput vs Latency at host level

54

Storage: Throughput vs Latency at VM level

Can we show at VM level now?
That’s why you need a 24” monitor 

55

Storage: Space vs Latency

 Any big VM that is not getting the SLA we agreed on?

56

Storage: Datastore space contention

 Do we have space contention at any of the datastore? If yes, how bad is the
contention?
• While we use thick provision at vSphere level (and thin at array level), we still have risk of space
from snapshots, vRAM increase, new VM, new vDisk, storage vMotion, storage DRS, etc.
 Are the datastore uniformly sized?

57

Storage: Space contention

 We use thin provisioning

58

CPU: Contention vs Usage at cluster level

 Which clusters are doing the most work? Which are not doing much?
 How is the CPU workload on every cluster?
 For each of those clusters, can we see if there is CPU contention?

59

CPU: Contention vs Usage at host level

 Same questions with previous, but for host.
 We can expect some “drill down” in this heat map

60

CPU: Contention vs Usage at VM level

Can we show at VM level now?
That’s why you need a 24” full HD
monitor 

61

VM Health

 Current Health
• Are all the VMs healthy? Especially those VMs which have high workload!
• Which VMs are experiencing problems?
• Are more demanding VMs less healthy?
• Can we see this by cluster? By host?
 Future Health
• Will all the VMs be okay in future (30 days)? Need to check CPU, RAM, Disk IO, Disk Space and
network for every single VM!
• For those VMs which are not ok, can we be specific on which value will run out first? Can we
“drill down” to individual VM?

62

VM: color by health, size by workload

63

VM: color by capacity, size by workload
 This is now showing future projection. We can see that the VM vCenter 5 is having red color. Its capacity will run out within 30
days. So we click on it to drill down.

64

Drill down to specific VM
 Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in 10 days.
 We can go as far as 6 months. This is good enough as you should not buy hardware >6 months in advance. It makes sense in the
physical world as it’s fixed, but unwise in virtual world.

65

Drill down to specific VM

 Showing value in absolute terms is good, but can be confusing. vCenter Ops can also
show in %

66

Discussion Point

Which heat maps are useful for you?
What other heat maps or cold maps do you need?

67

Smart Alert vs Normal Alert
 Smart Alert
• Relies on the advanced analytics instead of simple raw counters.
• Not static, as it based on Dynamic Threshold
• Examples:
• Early warning alerts: use total anomalies to predict when a problem is happening, sometimes before users are impacted
• KPI predictive: prediction that a KPI might soon go abnormal due to an event occurring that has preceded the KPI going
abnormal on previous occasions
• Fingerprint: set of metric anomalies matches previously seen problem (and associated resolution)

 Comparison
Advanced Edition Enterprise Edition
provide alert on Minor Badges badge. E.g. Workload Provide alert on any counters (raw, badge, super
YES, Health NO metric)
Can only do infrastructure level alert Can do application-level alert
good for Alerts on single objects (e.g. VM) Good for single or multi objects
driven by the badge’s changing color Driven by threshold anomaly breaches and KPI
Threshold Breaches
Not customiseable Highly customisable
Cannot do alert at Resource Pool or Folder Can do it

68

Application-level smart alert

 Needs Enterprise edition.

69

Alert

 When does Alert happen
• When a badge change color
• When a fault happens
• VC Ops own alert
• A component in VC Ops itself has failed.
• VC Ops cannot get data
 Can do SNMP and SMTP
• Both are set at set on the Administration Web page. The URL format is https://VM-IP/admin/

70

Advance edition: Alert main window

 Filter by the 11 badges
 Filter the VC Ops own alert: system or environment

71

Enterprise edition: Alert main window
 New alerts: Early Warning, KPI Breach, KPI Prediction, KPI High Threshold breach, Classic (static)
 We can also color the row by criticality, and specify period (start – end)

73

Enterprise edition: alert detail

74

Email Notification Rules

76

Email Notification Rules

77

Anomalies – Symptoms Window
 The example is from an ESXi host with 11 VM.
Example of an ESXi Anomalies symptom window.
• It shows 3 resource type: VM, Datastore, Host System
• The VM resource kind has 7 metric groups with anomaly.
 The VM resource kind (30 out of 71 Symptom)
• 71 – Total number of Symptoms under VM object
• We’re reporting on an ESX here, and VM is a child of host. So all children
metrics are included.
• The metric group comes from the vSphere adapter + VC Ops own.
• 30 – Total number of Displayed Symptoms
• Based on the limit of 5 metrics shown for each Metric Group
• The metric group (CPU Usage, network, Summary, etc) are specified by the
adapter

• Subcategory Network (3 of 11)
• 11 – The total number of VMs associated with this ESX. This is not the
number of symptoms.
• 3 – The total number of VMs that have one or more Network symptoms.

Metrics will not be identical common among VM.
Most will be similar though.
Multi vCPU VM will have more vCPU metrics than 1 vCPU VM.
Different VM will have different anomalies
They have different workload.

78

vCenter Operations presents
datastore with all the details

79

Storage in vCenter Operations

Automatic learning of storage
performance.
Calculating both Demand and
Normal rate.

80

vSphere 5 Performance Chart (fat client)

Can only
choose 1
component
at a time.
e.g. cannot
show CPU
and RAM at
the same
time.

81

vSphere 5 Performance Chart (fat client)
Can only show 1 chart at a time.
Hence can only show 2 units at a time.

82

vCenter Operation charts

Can show >1 charts at a time. Can combine/split charts.
Can show different data type from different objects.
Line is color coded, showing when threshold is breached.

83

Capacity Management in vSphere is hard

CPU Optimizations Reserved
Capacity
vSMP, Shares, Reservations, Limits

Memory Optimizations
Transparent Page Sharing,
Memory Ballooning, Memory Compression
? Remaining
Capacity

Storage Optimizations Usable
Thin Provisioning, Linked-Clones Capacity

Clusters
DRS, HA, FT, vMotion, Storage vMotion

Workload Flux Used
VMs growing/shrinking, added/removed Capacity

vSphere
36 days remaining

84

Capacity Management

 What are my historical utilization trends?
 What resources have been requested vs. needed?
 How many more VMs will fit in my current farm?
Analyze

 How can we use my resources more efficiently?
 What VMs should be right-sized?
 Can I reclaim over-provisioned or unused capacity?
Optimize

 When will I run out of capacity?
 What if I add, remove, reconfigure capacity?
 Can I defer infrastructure investments?
Forecast

85

Understanding Behavior
 Need to understand the weekly pattern
• Business week
• Weekend
• E.g. workload spike at 9am on Mondays
Year 1
 Accomplish through roll-ups
• Roll-up weeks in a month to compute the typical week for the month
• Roll-up typical week in a month to a typical week in the quarter
Quarter 1
 Differs from performance management roll-ups
• Older performance data gets less granular. vCenter loses accuracy
• Older capacity data maintains its granularity
Month 1 Month 2 Month 3

86

Planning  Summary  Export

90

Planning  Summary  Resources

91


92


93

What-if

 Visualise
• Add or remove VMs.
• Add based on existing VMs as profiles
• Add based on spec you supply
• Add, remove, or update hosts.
• Modify CPU and RAM only. No Network.
• Add, remove, or update datastores.
• Update means increase or decrease size.
• No IOPS yet.
 At a cluster level or host level
• Cannot do at datacenter or higher level
• Host level does not make sense when host has HA & DRS turned on
 You can add multiple what-if scenario
• You can combine them or compare them on the same chart
• You cannot save. Changes lost upon log-off.
• You can export the scenario results to an Adobe PDF or CSV file.

94

Average VM Capacity (trend view)

96

Modeling a what-if scenario

Change Supply Change Host/Datastore

Based on existing VMs
Change Demand Change VM
New VM spec

99


100

Modeling a what-if scenario – Specifying VM Configuration

101

Modeling a what-if scenario – Using Existing VMs

Columns you can see

102


103


104

Modeling a what-if scenario – Changing hosts

105

Modeling a what-if scenario – Changing datastores

106


107

Capacity state
today

VM count
capacity

Current capacity
cross-over point

Actual VMs
deployed

109

Common VM distribution

110

Reclaim waste capacity

113

VMs can appear in Stress and Waste at the Same Time

Undersized for CPU

Oversized for Memory

114

Powered-Off VM and Idle VM: setting

115

Capacity Planning: Is the VM really sized properly?

 Setting a threshold of under-utilisation alone is not enough

We need to calculate the degree of under-utilisation.

117

Oversized VM & Undersized VM

118

Oversized VMs - Calculation

Same concept applies to undersize.
Same concept applies to idle VM.

119

Planning  Summary tab

Planning  Views tab

120

Tips

 No of intervals and data points used for analysis
• Tied to your business cycles.
• Pick correct number of data points and the interval type to represent a typical business cycle.
• Match no of intervals used for trend view and no of data points used for forecasting
• Stay with default forecasting algorithm settings
 Leverage buffer settings to accommodate for unforeseen usage spikes or future
business growth.
• VC Ops 5 does not yet have “future incoming VM” concept
 Leverage business hours to eliminate off-peak usage
 Don’t be afraid, play with global settings
• They are just knobs used for data analysis
• Raw data is not modified when global settings are changed

121

Change Events Correlated with Performance

 Overview
• Integration between vCM and vC Ops Mgr for change events
• Overlay Guest OS configuration changes from vCM in vC Ops performance trend graphs
• Launch in context into vCM to see full details of changes and potentially remediate them
 Benefits
• Enable Operations to quickly understand and resolve performance issues arising from
configuration changes (reduce MTTR)
• Drive efficient & effective troubleshooting by correlating Guest OS configuration changes w/
VM performance degradations
• In larger enterprise, help bridge gap between VMware Admin and Guest OS Admin

122

VCM Events in vC Ops – Event Collected

 vC Ops does not pull in every event from vCenter
• Only events that could affect health or workload (vSphere Knowledge!)
 Adapter only pulls in change events for Guest OSs
• No ESX/i Host configurations changes (these come from vCenter Adapter)
• Guest OS has to be by managed by VCM
Event Collected

Reboot

Software Install/Uninstall

Windows Registry

IP/Networking changes

Device Driver changes

Memory/CPU changes

Windows Firewall

Patches

123

Event Types in vC Ops Mgr

 Circle Events are vCM Initiated
• Change log in vCM updated when change is completed E
• Time = Occurred time

 Diamonds are non-VCM-initiated
• Change log in vCM updated when vCM collects from VM
• Time = Collected time
E
 Always Blue Events – “Might” have minimal impact
 vCM events VMs follow the normal vC Ops display rules
• vCM Events appear for the VM Object itself
• vCM Events appear on an ESX host if you enable Child Events

124

vCM Change Events Correlated with Performance

 A pop-up for a vCM event related to uninstalling a piece of software on the VM
in question

127

vCM Change Events Correlated with Performance

128

Terms
 The terms Attribute, Metric, Counter mean the same thing.
• CPU Ready Time is an attribute.
• CPU Ready Time from the VM ABC123 is a metric.
• vSphere uses the word Counter. VC Ops uses Attribute and Metric.
• As there are many attributes, they are grouped together. This is called Attribute Package.
 Resource provides the Metrics.
• Example of resources: host, VM, datastore, cluster, etc.
• So a resource provides many attributes.
• Resource are pulled via Adapter.
Adapter
 Kind
• In VC Ops, there are many kinds of resources.
So there is a term Resource Kind, that you need to get used to. Resource Resource Resource
• VC Ops uses different adapters to talk to different source. 1 type of
adapter per source. So there is a term Adapter Kind.
Attribute Attribute Attribute
 Advance terms
• Container. Super Metrics. Application. Tier. KPI

129

Adapter, Resource, Attribute, Package

VC Ops Adapter Source of data

VMware Adapter vSphere 5

VCM Adapter VCM 5.4

VC Ops Adapter VC Ops 5

Container Adapter

Adapter Kind = adapter type. VMware Adapter is an example of Adapter Kind.
1 Adapter Kind can have many kind of objects that it pulls from the source.
This is called Resource Kind.
To make management of attributes easier, they are put into Package. Inside a
package, metris are grouped for ease of use.
This is the actual Resource Kind
Container Adapter is not actually an adapter. It’s a group or container that
brought by VMware Adapter
can hold other objects.

130

Actual Resource Kinds
 Sample adapters with their associated resource kinds.

This is a special & built-in adapter. This is another special & built-in
“adapter”. Technically, this is
This monitor VC Ops itself!
actually not an adapter, as it’s just a
VC Ops is just an application, container.
which also needs monitoring.

131

vSphere resource kinds

 Unlike the Advanced edition, we can utilise Folder and Resource Pool
• This means you can create Super Metric at this level.
• Complement vCenter.

Not used?

ESX Host

Not used?

No vApp, no Datastore Group, no vDS as at
VC Ops 5.

132

Resource Kind: default settings

133

Attribute & Attribute Package
 Package
• A collection of Attributes from 1 Resource with the same collection interval. That’s all!
• Need to map it to objects
• Super Metric must be placed into a package
• A package cannot come from multiple resources. See screen below.
• Cannot create a package that has both VM and ESXi
• There is a default package called All Attributes.

134

Editing a resource property

139

Resource Kind: Tags

What’s the difference between Applications and Application? Looks like
Application is from the Container adapter, which is built-in.

Maintenance schedule contains the time a particular object is on scheduled
downtime. It is used to tell VC Ops to ignore, else it would give alert as the
behaviour is unexpected. It would think the health drop!
So in this screen, ignore maintenance schedule as it should not be part of
Resource Kind.

The range for Health. This is not the same with the badge Health in VC Ops
Advance, as this is universal and apply to beyond vSphere. Health in Advance
edition include Fault, which is vSphere specific.

Tier is a special container. Again, this is universal, so name your tier properly to
avoid changing name later on.

Only 1 value here. This means the entire VC Ops.

141

Resource Kind: Tags

 You can control which resource kinds
are shown
• In the picture below, ESX was hidden.

142

Drag selected objects to the tag value

144

VC Ops generated metrics

146

Monitoring the big workload

 You have convinced your CIO to virtualise the remaining 50% of the servers.
 Your CIO needs you to prove, supported by performance charts, that the platform has
served every VM well, meeting the SLA in the past 1 quarter.
• Tier 1 cluster SLA: 2% CPU Ready, 0 RAM Ballooning, 10 ms disk latency, 0 drop packets.
• Tier 2 cluster SLA: 4% CPU Ready, 5% RAM Ballooning, 20 ms disk latency, 0 drop packets.
• Tier 3 cluster SLA: 6% CPU Ready, 10% RAM Ballooning, 30 ms disk latency, 0 drop packets.
 You have 500 VM on 50 ESXi, 8 clusters, 40 datastores, 5 RDM.
 You must prove that:
• Not a single Tier 1 VM has >2% CPU Ready in the past 1 quarter. The underlying ESXi also has
<2% CPU contention.
• Not a single Tier 1 VM has >10 ms disk latency in the past 1 quarter. The underlying ESXi also has
<10 ms disk latency.
• Etc, for each Tier and each component (CPU, RAM, Disk, Net)

What kind of charts do you need to show?

147

Super Metric: Functions
 2 types:
• looping functions: take multiple input value
• Average, sum, min, max, count, combine, etc.
• More practical or useful than single functions
• single functions: take 1 value
• Absolute, round up, round down, square root, etc.
 The xxxN functions, instead of working on just the immediate children, it looks down
(or up) the number of levels specified in the formula.
• This ‘2’ tells the function to look
down for two levels for
the metric.
• Putting -2 means look up.

149

Super Metric: hierarchy

 Example: super metric for Average CPU usage of a cluster

VM is 2 level down
from cluster.

150

Super Metric: Operators

 To calculate a value for each VM based on metrics for that VM, use the ‘$This’
operator.

 Another example: max ( $This:CPUavg, ESXi-Host-003:CPUavg, VM:CPUavg)
 Finds the maximum value among these
• CPUavg metric for the resource to which the super metric is assigned (so this is dynamic)
• CPUavg metric for a specific resource called ESXi-Host-003 (so this is hardcoded)
• CPUavg metric for all resources of type VM (so this is universal for all VM)

153

Discussion Point

Think of super metrics that you need.
Explain why and how you will need them.

159

Applications and Application Tiers
 App Team often view things from their own application-centric. We can create custom dashboard showing their
“Application”
 Even better if we add non vSphere data, like Hyperic. This gives app-level info and GuestOS-level info, which is
not available in vSphere adapter.
 Define your own hierarchy and relationship

160

Drag selected objects to the tag va

161

Parent-Child Resource Relationships

162

What counters do you check?

Component ESX VM
Usage or Utilisation: Overall CPU utilisation (to
get overall utilisation of entire box)
Usage or Utilisation: Overall CPU utilisation
Usage or Utilisation: Individual core utilisation
Usage or Utilisation: Individual core utilisation
(to see distribution and if any particular core is
CPU max out) Wait (wait for IO. To see if it’s IO bound)
Wait (wait for IO. To see if it’s IO bound) Ready (VM unable to run, waiting for core)
Ready (VM unable to run, waiting for core) Co-Stop (if there are large VMs)
Co-Stop (if there are large VMs)
Ballooning Ballooning
RAM
Active or Active Write Active or Active Write
Latency: kernel latency, device latency.
Guest Latency
Device Latency
Storage Throughput
Throughput
IOPS
IOPS
Drop packets Drop packets
Network
Throughput Throughput
vSphere Replication?
Others System?
Cluster service?
164

How are Disk, Datastore,
Adapter and Path related?

165

CPU counters

Which one is ESX, which one
is VM? How do you know?

What can stop/block a VM
from getting the CPU it was
configured?

No more Collection Level
limitation. VC-Ops collect
them all and analyse them
all.
Changing collection level in
vCenter does not impact VC
Ops as VC Ops gets from
“real-time” statistic.

166

%OVRLP and %SYS

Run

Wait Ready

Time

World 1 %RUN %SYS

%OVRLP %RUN continues to accumulate.
But %OVRLP kicks in.

World 2 %RUN

%OVRLP Overlapping time. A world still wants CPU but interrupted by another world.
High number normally means ESX is experiencing heavy IO
%USED = %RUN + %SYS - %OVRLP
As a result, the overlap value does not incorrectly inflate %USED.
%SYS A high no means heavy IO or interrupts

167

Memory counters
ESXi VM

168

Storage counters: ESXi host
Datastore Disk

Storage Adapter or Storage Path

169

ESXi: Adapter, Device and Path

1 adapter can many Devices (LUN).
1 Device is accessed via many paths.
1 path can only access 1 Device.

170

ESXi: Adapter, Device and Path

ESXi 5.0
vmnic Storage Adapter 1 Storage Adapter 2
vmhba2 vmhba3

Storage Path Storage Path Storage Path Storage Path Storage Path Storage Path
vmhba3

NFS VMFS VMFS RDM
Datastore Datastore Datastore

Disk Disk Disk

172

Storage counters: VM
Virtual Disk (VMDK, RDM)

VM

Drive 1 Drive 2 Drive 3
vDisk vDisk vDisk
scsi0:0 scsi0:2

Datastore VMFS NFS RDM
Datastore Datastore

Disk Disk

Disk

173

Network counters

ESXi

VM

174

Other Counters: ESXi Host
vSphere Replication System (vmkernel)

See
next
2 slides
for info

Cluster Service

Power

175

A long list of vmkernel
resources. Some are familiar,
such as vMotion, FT, hostd,
Vpxa, DCUI, logging

177

Dashboard: creating a new Tab

181

Application Overview and Application Detail

183

Scoreboard: Health or Workload

196

Metric Graph (Rolling View)

208

The VC Relationship

 There are 2 widgets that are vSphere related.
 Use the advanced edition instead.
• Enterprise edition can access Advanced edition UI at the same time. Just open another window
or tab.

221

Interaction between widget

 Controlled at the dashboard level, not individual widget
 Providing widget and Receiving widget

222


223


224

Practice session: creating your dashboard

 Goal: have a dashboard to help you investigates all non-local datastores quickly
• Be able to plot chart for all non-local datastores for comparison.
 Answer:
• Create a tag called Storage from the Environment screen.
• Create 1 tag value: Shared Datastore
• Tag all the non-local datastores with this tag value
• Done manually. Simply drag all the rows
• Create a dashboard with 4 widgets
• Health Status
• This is where you show the overall health of all Non-Local Datastores
• Resources
• This is where you show all the members of Non-Local Datastore tags
• Metric Selector
• All the metrics will appear here.
• Select the metric you want
• Metric Graph or Metric Sparklines
• Choose Sparklines if you have lots of graph.

225

vCenter “equivalent” dashboard

227

Major Steps in implementation

Define who Create Create Create Create Create
needs what Super Metrics Applications Tags Heat Maps Dashboards

 Begin with the end in mind
• Every Super Metric must serve a particular role
• Role, not individual. A person can & will have many heatmaps/dashboards.
• Decide if you need the following non-standard info
• Application-level & Guest-OS-level info
• Info from physical machines (UNIX, X64, etc)
• Info from physical storage and network (switch, FW, router, etc)
 Think in terms of application
• A great way to complement vSphere as vCenter does not have this object.

235

Who needs to see what

Simple Dashboard.
Big picture. Tend to be application focused.
CIO or CTO No absolute data. Normalised to 0-100.
Focus on long term.
Averaged data. A 30-minute spike will not show up.
Updated daily.

Group Head
e.g. Head of Infra, Head of Apps

Dept Head
e.g. Head of Storage, Head of Server,
Head of Network, Head of Databases

Rich Dashboard. Ideally Full HD screen.
Admin/Architect Specific info.
e.g. Storage Admin, Network Admin,
App Owner, VM Owner
Absolute data + Normalised Data.
Focus on short term.
Actual data. A 5-minute spike will be visible.
Updated every 2 minutes.
236

Who needs to see what (samples)

Roles Info presented
Health of overall IT in the past 1 month
CIO
Health of key applications in the past 1 month

CTO As above, but with more technical content, and tailored to him.

Health of all key apps in the past 1 month, with the ability to do 1 level drill down for each app.
Head of Applications
Capacity projection for all key apps.
Health of Storage
Health of Network
Head of Infrastructure
Health of Servers (VMware and Physical)
Health of VM

Head of Storage A higher level, simpler dashboard than Storage Admin

Head of Network

VMware Team

An App Owner The infra is providing each of the VMs in my App with the resources it needs

237

Designing Super Metric
 Leverage existing derived metrics
 Leverage Objects that vCenter cannot provide performance data
• Application, Resource Pool, Folder, Location, can now have performance counters
 Minimise static alert.
 Know what a good range for the end result
 Build a simple table to avoid super metric sprawl and duplicating existing metrics
• Below is an example, showing 2 Super Metrics.

Name Purpose Target Role Formula Good Range
VM SLA = 100% - Max (CPU, RAM, Disk, Network)
CPU = CPU Contention %.
RAM = RAM ballooning %.
Shows that a VM gets the
Disk = % above threshold latency. >99% (Tier 1 cluster)
resources it wants from
VM SLA VM Owner Network = Packet Drop %. >97 (Tier 2 cluster)
infrastructure based on the
>95% (Tier 3 cluster)
defined SLA.
Tier 1 Disk SLA is 10 ms.
Show that the underlying infra
VMware Infra SLA = 100% - Max (Host Cluster, Datastore
Infra SLA has the resources for all the
Admin Cluster)
VMs on it

238

Custom Heat Map or Cold Map

Component Heat Map Cold Map
Least utilised VM: size by vCPU count, color by RAM + CPU
CPU Resource pool: size by CPU utilisation,
usage (a Super Metric)
Most RAM intensive VMs, grouped by ESX. Size by RAM
RAM
utilisation, color by health
Most disk intensive VMs, grouped by ESX. Size by disk
Disk Least utilised disk: size by GB, color by % of free
utilisation, color by health
Most network intensive VMs, grouped by ESX. Size by
Network Most idle VMs, grouped by host
network utilisation, color by health
VMs with file system that will run out soon. Color by %
Capacity
left, size by GB left.
VM health, grouped by cluster. Color by health, size by
Health
workload.

 Design consideration
• Use Super Metric so the info is richer.
• Group VMs by 1 consistent hierarchy only. If you group by cluster, it won’t make sense to further group by datastore as 1
datastore can spans multiple cluster.

239

vCenter: network impact of vCenter Ops

240

Choice of Tools

 vCenter Operations
• 1-15 minutes accuracy (for other sources)
• 5 minutes accuracy (for vSphere)
• No need reproducible. But problem should last >5 minutes, preferably 15 minutes (3 sample)
 vCenter
• 20 – 300 seconds accuracy
• Reproducable performance issue
• Requirements: you already have some idea what causes it
 esxtop
• 2 – 20 seconds accuracy. Short burst problem.
• Reproducable performance issue
• Requirements: you already know which ESX & VM has the problem.
 vSCSIStat
• Specific for storage, low level analysis

241

vCenter Operations 5: Level 300 training

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à vCenter Operations 5: Level 300 training

Similaire à vCenter Operations 5: Level 300 training (20)

Plus de Eric Sloof

Plus de Eric Sloof (16)

Dernier

Dernier (20)

vCenter Operations 5: Level 300 training