SlideShare une entreprise Scribd logo
1  sur  244
vCenter Operations 5: Level 300 training
Singapore, Q2 2012
Iwan ‘e1’ RahabokVCAP-DCD

Staff SE, Strategic Accounts

e1@vmware.com | Skype: e1_ang | 9119-9226 | Linkedin.com/in/e1ang




                                                                    © 2010 VMware Inc. All rights reserved
Document Information
 This deck is part 2 of a series.
    • Part 1 is Management in the Virtual World: a technical introduction.
       • http://communities.vmware.com/docs/DOC-17841

 This deck has pre-requisite
    • Intro video: http://www.youtube.com/watch?v=Z-DJuTiqKag
    • VC Ops 5 technical introduction at Vault or Partner Central.
 This deck only covers vCenter Operation (enterprise + advance)
  • Focus on concept & ‘under the hood’ to get you understand the product deeper.
       • Does not cover: competitive, installation, configuration
    • Does not run through feature after feature.
       • See the official training deck for that at Vault or Partner Central.   This is a very long
                                                                                training material.
    • vCenter Operations modules that it does not covers
                                                                                Use the Section feature
       • Chargeback
                                                                                to see how it is
       • Infrastructure Navigator                                               organised.
       • Configuration Manager

 Further reading
    • virtual-red-dot.blogspot.com

2
Table of Contents

 Built for vCenter Standard
 Core: Metrics, Threshold, Analytics
 Badges
 Heat Map
 Smart Alert
 Details & Charts
 Capacity Management
 Settings
 VCM integration
 Concepts & Advance Concepts
 Deep dive into Metrics
 Dashboard and Widgets



3
Managing Performance/Capacity in vSphere: the basic



     Is it healthy?       Is it enough?        Is it optimised?

    • Every VM & ESX      • Enough CPU, RAM,   • Which VMs need
      performing well?      Network, Disk?       adjustment?
      CPU, RAM,             Future risk?       • What are my key
      Network, Disk?      • Time remaining?      ratios?
    • Are they behaving   • Capacity           • How much can I
      expectedly?           remaining?           claim back from
    • Any fault on any    • Where are the        “fat” VMs?
      component?            “Stress points”    • How many more
                            in time?             VMs can I put
                                                 without impacting
                                                 performance?




4
Direct Mapping by vCenter Operations
                                        Is it healthy = Health
                                          • Workload
                                          • Anomalies
                                          • Faults
                                        Is it enough = Risk
                                          • Time remaining
                                          • Capacity remaining
                                          • Stress period
                                        Is it optimised = Efficiency
                                          • What can we reclaim?
                                          • Density. Key ratios for management

                                        Daily update at midnight




5
Bird-eye view




6
Visibility across vCenters




                                Sample from ASEAN Lab:
                                       6 vCenters.
                             Mixed of Appliance and Windows
                                2 are LinkedMode (SRM)




7
Performance Troubleshooting: a day in the life…
 You got an email from the app team, saying the main Intranet application was slow.
    • The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that.
       • So it was slow between 1-2 hours ago, but ok now.
       • You did a check. Everything is indeed ok in the past 1 hour.
    • The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM
       • You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest
          OS.
       • Your environment: 1 VC, 4 clusters, 30 hosts, 300 VM, 20 datastores, 1 midrange array, 10 GE FCoE



                                       Test your vSphere knowledge!
                              How do you solve/approach this with just vSphere?

What do you do?
 A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE 
 B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.
 C: SMS your wive, “Honey, I’m staying overnight at the datacenter  “
 D: Take a blood pressure medicine so it won’t shoot up.
 E: Buy the app team very nice dinner, and tell them to keep quiet.


8
Performance Troubleshooting: a day in the life…
 The minimum you need to prove
    • Performance is not caused by your infrastructure, or at least not by your VMware.
       • Infrastructure = VMware + Storage + Network
       • Application = VM + App inside the VM

 What you need to prove
    • For each of the 10 VM, the following was ok between 1-2 hours ago: CPU, RAM, Disk, network
    • To strengthen the above, prove that:
       • The shared infrastructure was also healthy: relevant ESX, relevant Datastore
       • The overall platform was also healthy.
    • No relevant faults that happened 1-2 hours ago.
    • Give the list of ports (that the 10 VM use) to network team to ensure the firewall is not dropping them.
 What challenges do you face in vSphere to do the above?
    • Group discussion: what limitations do you face, if vCenter + vMA + PowerGUI + RVTools is all you have?
 The ideal you need to prove
    • Show the exact application-level counter that are slow, with the underlying infrastructure-level counter that
      caused it. Another word, application-specific + root-cause-analysis




9
Challenge 1: details are lost after 1 hour




10
Challenge 1: details are lost after 1 hour




                                             The following counters are lost:
                                             1. Used
                                             2. System
                                             3. Idle
                                             4. Latency
                                             5. Overlap
                                             6. Demand
                                             7. Wait
                                             8. Run
                                             9. Swap wait
11                                           10. Max Limited
Challenge 1: details are lost after 1 hour

        Memory Counters                      Disk Counters

     <1 hour        >1 hour            <1 hour               >1 hour




12
13
Challenge 2: no application awareness




14
15
16
Deep understanding of vCenter is required




                       Here is a common example of why
        a deep understanding of vSphere counters make a huge difference.




Buy more RAM?




 17
Deep understanding of vCenter is required




                                             Yes, buy more RAM.
                                            ESXi has 32 GB RAM.
                                               It is highly used




18
Deep understanding of vCenter is required




                                                        vCenter Ops shows
                                                        a very different data.
                                                        Memory is only 32%.
                                                        Plenty of headroom.




      What?! It’s been high constantly for the last 24 hours! Better buy more RAM now.


     But hang on! This is ESXi-06 host in VMware ASEAN lab. We know who use them 

19
vCenter Ops shows
     a very different data.
     Memory is only 32%.
     Plenty of headroom.


     It just saves us from a
      costly RAM upgrade
              project




20
Live Demo
     1 engine, 2 UI.
     Dashboard..
     Badges.
     Configuration




21
Counters and Badges
 A vCenter farm with 500 VM and 50 ESX will have
     >10000 counters!
     • It is not humanely possible to look at them, let alone
       analyse them.
                                                                        Derived Counters
 vCenter presents raw counters
                                                                     Standardises the scale into 0 -
     • e.g. What does Ready Time of 1500 in Real Time chart
       mean? Is value of 2000 in Real Time chart better than value   100.
       of 75000 in Daily Chart?                                      1 universal unit. Minimise the
     • e.g. Is memory.usage at 90% at ESXi level good or bad?        “translation” in our head.
     • E.g. Is IOPS of 300 good or bad for datastore XYZ?            Can be >100 if demand is unmet

 Single counter can be misleading                                   Universal. Apply to CPU, RAM,
                                                                     Disk, Net, etc.
     • e.g. Low CPU usage does not mean VM is getting the CPU, if
       there is Limit, Contention and Co-Stop.                       Counters derived using
                                                                     sophisticated formula, not just
     • e.g. To see disk performance, we need to see multiple
                                                                     aggregated.
       counters at multiple layers (VM, kernel, physical)
                                                                     For the same counter, different
 Different counters have different units                            objects use different formula.
     • GHz, %, MB, kbps, ops/sec, ms
     • This makes analysis even more complex




22
Samples of Derived Metric: Health
 Health Score of an Object = MAX (Abnormal Workload, Faults)
     • Abnormal Workload per Metric = Geometric Mean (MAX (Abnormality (Capacity/Entitlement), Abnormality (Demand/Usage)),
       Workload)
     • Abnormal Workload per Object = Score Aggregation (Abnormal Workload per Metric)
     • Fault depends on the object:
          Cluster = HA Issues = MAX (HA Insufficient Failover Resources, HA Failover In Progress, HA Cannot Find Master)

          Host = MAX (Hardware Issues, HA Issues)
                 Hardware Issues = MAX (Network Issues, Storage Issues, Compute Issues, CIM Issues)
                        Network Issues = MAX (Network, DVPort, VMNic)
                              Network = Max_of_all_instances (Network Device)
                              DVPort = Max_of_all_instances (DVPort Device)
                              VMNic = Max_of_all_instances (VMNic Device)
                        Storage Issues = MAX(Storage, SCSI, VMFS heartbeat, NFS server, CIM Storage)
                               Storage = Max_of_all_instances (Storage Device)
                               SCSI = Max_of_all_instances (SCSI Device)
                               VMFS heartbeat = Max_of_all_instances (VMFS heartbeat Device)
                               NFS server = Max_of_all_instances (NFS server Device)
                        Compute Issues = MAX (Error, PCIe)
                        CIM Issues = MAX (Processor, Memory, Fan, Voltage, Temperature, Power, System Board, Battery, Other
                        Health, IPMI, BMC)
                 HA Issues = HA Host Status

          VM = MAX (FT Issues, HA Issues)


23
Threshold: a shift in mindset needed
 vCenter sets “static” threshold, which can be misleading
     • During peak, it is common for VM to reach high utilisation.
       • Static threshold will generate alerts when they should not.
       • vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with.
     • During non-peak, it might be abnormal for VM to reach even 50% utilisation.
       • Static threshold will not generate alerts when they should have.
 vCenter only sets high threshold
     • Do you set static threshold when CPU or RAM utilisation drops below 5%? 
       • A drop in entire array storage IOPS might be a sign of terrible day ahead.
     • Will not alert when these happen:
       • Utilisation drops from 75% to 1% when it should not.
       • Utilisation change from 5% to 70% when it should not.
     • We need to plots both upper range and lower range
 But each VM differs. And the same VM differs depending on day/time… 
     • Intelligence required to analyse each metrics and their expected “normal” behaviour.



24
Dynamic threshold & alerts
 vCenter Operations uses dynamic threshold
     • It is dynamic and personalised down to individual metric.
        • Varies from object to object. 1000 VM will have their own threshold.
        • Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the
          chart below.
        • Varies from metric to metric. An ESX with 12 cores, each core has its own CPU Usage threshold.
     • You can fix hard thresholds if you need to.
        • This needs Enterprise edition. It comes with no static threshold defined.
        • Steps  http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html




                                                                                                         Notice the range varies
                                                                                                         in size




25
Dynamic Threshold Analysis
            For each metric
                                                                         DT analysis runs nightly
                                                                          • New dynamic thresholds are computed for
                                  Data
                              Categorization                                each metric

                                                                         Data categorization
                                                                          • Tries to identify stat as linear,
Linear DT
             Multinomial       Sparse      Step Function   Quantile         multinomial, step function, etc
                 DT           Sigma DT          DT         Sigma DT
                                                                          • If one of those matches, that DT function
                                                                            is used
                                                             CCPD
                                                                         Otherwise: competition
                                                                          • Sigma: assumes hourly cycles
                                                             ACPD
                                                                          • CCPD: tries to find normal cycles
                                                                          • ACPD: tries to find abnormal cycles
                                                           DT Scoring
                                                                          • Winner is assigned based on metric
                                                                            trending accuracy

                                                                         The same metric may get different DT
                                                                          function on different day
                               Dynamic
                              Thresholds


26
Dynamic Threshlold: Algorithm

                                                                       m 1 m  1       m
                                                                                                    
                                                          0,0     i , j    i , j   m 1 m 1                                                                    0,0 1
                                                                                                                               i , j 1    m 1 m  1          
                                                                                                                               m                                m

                                                                                                        pi , j  i 1 pi , j   1     pi , j  i 1 pi , j  
                                                                          i 1 j 1   i  m, j 1                    i , j 1
      P1,1,P1,2 ,...,Pm,m ( p1,1, p1,2 ,..., pm,m )               m 1 m  1
                                                         0,0      i , j      i , j    i 1 j 1
                                                                                           m
                                                                                                                               m, j               i 1 j 1
                                                                                                                                                               m, j     
                                                       
                                                                    i 1 j 1        i  m , j 1    
                                                                                                      
              m 1 m  1          m                                                     
     where       pi , j 
               i 1 j 1
                                
                              i  m , j 1
                                             pi , j  1 0  pi , j  1 and   z    t z 1e  t dt
                                                       ,
                                                                                        0




       The marginal distribution of the i th row of J is:
                                                                       m 1
                                                                                                                                               
                                              Dirichlet                i , j , i ,1, i ,2 ,..., i ,m 1               for i  1 m  1
                                                                                                                                          ,...,
                                                                       j 1                                                                  
       ( pi ,1,..., pi ,m 1 )                                                                                                                 
                                                                                        m
                                                                                                                                              
                                              Dirichlet               0,0           m, j  , m,1, m,2 ,..., m,m , 0,0  for i  m  
                                                                                                                                  
                                                                                       j 1                                                
                                                                                  m 1 m  1                m
                                                where   0,0     i , j                                     i , j
                                                                                   i 1 j 1            i  m , j 1



                    It is pretty difficult for a human to beat the computer in analysis of the data..
                    The above is one of the many algorithms applied by vCenter Operations.


27
Analytics

7 different analytics areas.
For DT feature, there are 8
algorithms.




Only in
Enterprise Edition




These advance
features create
Smart Alert.




28
Discussion Point



                   Raw Counters vs Derived Counters
              Dynamic Threshold vs Static Threshold




29
Badge – Health
 Answer complex questions like:
     • How is the entire virtual data center doing? What’s the
       degree of their health?
     • For every cluster, host, datastore, what’s their health?
 Health is a current Operational State.
     • It represents what is wrong now that should be
       addressed within 1 day. Thus Health needs to be scored
       such that if it is red, then it really needs attention.

 Weather Map
     • Simple way to check that entire farm is healthy
        • For child object, it is replaced with Health Trend
     • Shows Health of all parent and child objects
        • Each square can be VM, ESX, datastore, cluster, datacenter,
          vCenter.

       Value                              Explanation

      75 – 100        Normal behaviour

      50 – 75         The object experience some problems.
                      The object might have serious problems.
      25 – 50
                      Check and take action as soon as possible.
                      The object is either not functioning properly or
30     0 – 25
                      will stop functioning soon.
95
Badge – Workload
 Answer complex questions like:
     • For every object, how is Demand vs Supply?
     • For every single VM, is CPU/Memory/Disk/Network
       bound?
     • Any VM is not getting what they are entitled?
     • What’s the normal workload range for every object in our
       vDC?

 Workload is not utilisation or usage
     • More accurate than utilisation as it takes many factors
       than just utilisation.

 Workload = (Demand/Entitlement)
                                                                         Value                    Explanation
     • Entitlement is dynamic. Affected by shares, limit, etc.
                                                                         0 – 80    Workload is not high.
     • Demand ≠ Usage.
                                                                                   The object is experiencing some
        • Usage may mean passive usage. E.g. the RAM page is there but   80 – 90
                                                                                   high resource workloads.
          no write/read.
                                                                                   Workload on the object is
     • Score is Max (CPU, RAM, Disk IO, Net IO)                          90 – 95
                                                                                   approaching its capacity in ≥1 area.
        • To bring up the attention                                                Workload on the object is at or over its
                                                                          >95
                                                                                   capacity in ≥1 areas.



31
Derived Metric: Demand




                The chart below shows Demand in action.
                I generated IOPS which on a local datastore,
                resulting in spike in latency (read latency when
                up from 3 ms to 60 ms.
                Demand correspondingly go up from 4 to 100!




32
Badge – Anomalies
 Answer complex questions like:
     • Is our vDC doing business as usual today? Or is it a
       dynamic environment with lots of unexpected
       changes?
     • Which VMs, ESX, cluster, datastore, etc are behaving
       abnormally?
     • …. and exactly which counters are the culprits?
 Identifying metric abnormalities
     • It need to learn dynamic ranges of “Normal” for each
       metric, so give it >3 cycle per metric.
        • A month-end job means it needs 3 months.
        • Normal range changes after configuration or application
          changes.                                                  Value                    Explanation

 Anomalies score                                                   0 – 50    Normal Anomaly range

     • A high number of anomalies:                                  50 – 75   The score exceeds the normal range.
        • Usually an indication of a problem                        75 – 90   The score is very high.
        • Demand change                                                       Most of the metrics are beyond their
        • Application team change code/app                                    thresholds. This object might not be
                                                                     > 90
                                                                              working properly or will stop working
     • KPI metrics impacts the Anomalies score more than                      soon.
       non-KPI metrics.

33
This virtual DC spans multiple vCenters.
     vCenter Ops show all the counters that
     are behaving abnormally.




34
Badge – Faults
 Answer complex questions like:
     • What faults do we experience in our vDC?
     • For every object, what faults does it have?
 Specific knowledge of which vCenter Events
     • Which events affect Availability and Performance of
       which object?
     • Pulled from active vCenter events
     • Example:
        • Loss of redundancy in NICs or HBAs
        • Memory checksum errors
        • HA failover problems
     • Each fault has a default score (e.g. 25, 50, 75, 100)      Value                    Explanation

     • Highest individual Fault Score drives the Fault object     0 – 25    No fault is registered on the object
       Score                                                                Faults of low importance happens on
                                                                  25 – 50
                                                                            object.
 Best Practices:
                                                                            Faults of high importance happens on
                                                                  50 – 75
     • Do not change the Faults Threshold                                   object.

     • Use Alerts View to manage Faults. Filter it to just show    > 75
                                                                            Faults of critical importance happens on
       Fault.                                                               object



35
Badge – Risk
 Answer complex questions like:
     • Do we have risk from performance and capacity in
       our vDC? If yes, where are they and can you
       quantify the seriousness?
     • Which objects are at risk? What is the specific
       risk?

 Risk Score takes into account
     • Time Remaining
     • Capacity Remaining
     • Stress
 Risk is an early warning system.
     • Identifies potential problems that could eventually    Value                        Explanation
       hurt the performance                                   0 – 50    No problems are expected in the future.
     • The Risk Chart shows Risk score over the last 7                  There is a low chance of future problems or a
                                                             50 – 75
       days, giving a view of the trend.                                potential problem might occur in the far future.
                                                                        There is a chance of a more serious problem or a
                                                             75 – 100
                                                                        problem might occur in the medium-term future.
                                                                        The chances of a serious future problem are high
                                                               100
                                                                        or a problem might occur in the near future



36
Badge – Time Remaining
 Answer complex questions like:
     • How much time do we have before we need
       to buy more server, storage, network before
       performance starts to degrade or we run out
       of capacity?
     • For every cluster, VM, datastore, how much
       time do we have?

 Measures time remaining before each
     resource type reaches its capacity
     • CPU
     • Memory
     • Disk (IOPS & Space)
     • Network I/O
                                                      Value        Time remaining
 Early warning of upcoming provisioning             50 – 100   > 2x SP Buffer (60 days)
     needs
                                                     25 – 50        < 2x SP Buffer
     • Based on Score Provisioning buffer. Default
       value is 30 days.                               <25          Near SP Buffer

     • Set in “Capacity & Time Remaining” section       0        < SP buffer (30 days)




37
Badge – Capacity Remaining
 Answer complex questions like:
     • How many more VM can we put without impacting
       performance or using up capacity?
     • For every cluster, VM, datastore, which components
       (CPU, RAM, Disk, Network) would run out first?

 Early warning system                                      333 More VMs correlates to 77% Capacity
                                                            Remaining for this object
     • A low score of 1 mean you still have >30 days.
     • Measures how many more VMs can be placed on the
       object

 Percentage of Total VM “Slots” Remaining
     • Based on the average size of the VM on the object
       (e.g. VM profile)                                             Value            Capacity remaining
     • Each object has its OWN VM profile size: Host,                 >10                  >120 days
       Cluster, Datacenter, Etc.
                                                                     5 – 10              60 – 120 days
 From the table, notice value is not linear                         0–5                  30 – 60 days
     • It is also not the same with Time Remaining
                                                                       0                   <30 days
       threshold.
     • A value of 30 means >120 days for capacity but
       around 40 days for time.

38
Capacity Remaining Calculation

 Determine Capacity Constraint Resource
 Deployed or Powered On VMs
     • Powered Off VMs only use disk space resources
     • Powered On VMs uses ALL of the 4 resources
 Calculation Example Shown:
     • Limiting Resource is Disk Space with 333 VMs
       available
     • Use the Deployed VM number of 99 to do the
       calculation for percentage space remaining
     • Determine Capacity Remaining
       • 333 / (333 + 99) = 77%




39
Capacity and Time details

 You can drill down to see details
     • You can check the 9 components, as
      shown on the right.
       • This helps answer the question which
         components have how many days or
         VM left!
     • Summary = Min (all 9 components)




40
Badge – Stress

 Answer complex questions like:
     • In our vDC, do we have stress points or
       periods? How bad is it?
     • For every cluster, VM, datastore, which ones
       are experiencing stress and how bad is it?
 Measures long-term or chronic workload
     (6 weeks)
     • Chart shows weeks break down of Stress for
       each day/hour averaged over the last 6 Weeks
     • Workloads > 70% = “Stressed”
       • Threshold Configurable as per screenshot below   Value                   Explanation

                                                          0–1      Normal score. No action needed
                                                                   Some of the object resources are
                                                          1–5
                                                                   not enough to meet the demands.
                                                                   The object is experiencing regular
                                                          5 – 30
                                                                   resource shortage.
                                                                   Most of the resources on the object are
                                                           >30     constantly insufficient. The object might
                                                                   stop functioning properly.


41
Stress Calculation

                            100                                        Stress Zone

                                                 12%

                            70


                                                          Workload
                                                          Line


                            0
                                                                     6 Weeks
 Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold
     compared to the Total Capacity of the object
     • Stress Score = (Stress area / Stress Zone) *100
     • But max value can be > 100% as the workload can be >100.
 Example
     • Stress Line is 70% Workload
     • 12% of the area is above the 70% threshold
     • Stress Score is 12

42
Badge – Efficiency
 Answer complex questions like:
     • Are there optimization opportunities in our
      vDC?
     • How well do we do in terms of VM
      provisioning? Do we get them right?
 Efficiency Score factors
     • Reclaimable waste
     • Density ratio
 Graph Depicts VMs by Percent
     • Optimal – Optimally Provisioned VMs                    Value                    Explanation

     • Waste – Over Provisioned VMs                       Three Resources Considered use
                                                             >25
                                                                    The efficiency is good. The resource
                                                                      on the selected object is optimal.
                                                           • CPU
     • Stress – Under Provisioned VMs                      • 10 – 25  The efficiency is good, but can be
                                                              Memory improved. Some resources are not fully
       • Not used in Efficiency Calculation (see Risk)     • Disk Space
                                                                      used.
                                                                         The resources on the selected object are
                                                          Note: VMs can appear in Stress and
                                                              0 – 10
                                                                         not used in the most optimal way.

                                                           Waste
                                                             0
                                                                         The efficiency is bad. Many resources are
                                                                         wasted.

43
Badge – Reclaimable Waste
 Answer complex questions like:
     • Do we over provisioned the VMs in terms of CPU,
       RAM and Disk? If yes, what’s the degree of over
       provisioning?
     • For every cluster, VM, datastore, what can we
       reclaim?

 It identifies the amount of reclaimable
     resources
     • CPU
     • Memory
     • Disk
 Reclaimable Waste = Reclaimable Capacity /                  Value                   Explanation
     Deployed Capacity                                                  No resources are wasted on the
                                                              0 – 50
     • Waste Score = Max(CPU Waste Score, RAM Waste                     selected object.
       Score, Disk Space Waste Score)                        50 – 75    Some resource can be used better.

     • Disk calculation can also include old snapshots and   75 – 100   Many resources are underused
       templates
                                                                        Most of the resources on the selected
                                                               100
                                                                        object are wasted.




44
Badge – Density

 Answer complex questions like:
     • How high can we push our consolidation
       ratio before we experience performance
       problem?
        • Now that’s a million dollar question! 
     • For every datacenter, cluster, ESXi, what
       are our key ratios and how much head
       room do we have?
 Contrasts Actual vs Ideal Density
     • Identify Optimal Resource Deployment
       Before Contention Occurs
        • Ideal is based on demand, not simple
          configuration.
       • High Density is good. 100 is not too high.   Value                      Explanation

                                                       >25      Good consolidation

                                                      10 – 25   Some resources are not fully consolidated

                                                      0 – 10    The consolidation for many resources is low

                                                        0       The resource consolidation is extremely low.


45
Badge Thresholds




There are 2 different threshold:
VM and Infra (ESXi, Cluster,
Datastore, etc)


Notice that Major badge has
different threshold to its minor
badges



Even “similar” badges have
different threshold. Notice Time
remaining and Capacity
remaining have very different
thresholds.


                                   Disable Color Threshold by
                                   Clicking the Level Off

46
Using badges together
 Workload High & Anomalies Low & Stress High
     • Workload – Object is Running Hot. Potentially Starving
       for Resources
     • Anomalies – Normal Behavior for this timeframe           Add resources

     • Stress – Object is often running under high Workload.
 Workload High & Anomalies Low & Stress Low
     • Workload – Object is Running Hot. Potentially Starving
       for Resources
                                                                Not likely a big problem…
     • Anomalies – Normal Behavior for this timeframe           a cyclical workload spike?
     • Stress – Object usually has enough resources
 Workload High & Anomalies High
     • Workload – Object is Running Hot. Potentially Starving
       for Resources                                            Something is amiss!
                                                                Immediate attention.
     • Anomalies – Abnormal behavior for this timeframe
 If there are Alert and Fault too, then it is a sign of
     major issue




47
Discussion Point


                       Is Badge the way to go?
                    Are these the right 11 badges?
                   What other badges do you need?




48
Heat Map

 Built-in heat maps
     • Basic:                                                         A great way to show a lot of information on 1
                                                                      screen.
       •   Storage: space, IO
                                                                      Heat map can quickly highlight information,
       •   CPU                                                        as it can present relative information.
       •   RAM                                                        It is good for relative comparison among
       •   Network                                                    VMs.
     • Advance (or composite)
       • Health
       • Workload
       • Capacity
                                                                      Heat map is a 2 dimensional chart. So it takes
 Custom heat map or cold map                                         2 parameters. You cannot choose >2 data.
                                                                      For example, you cannot show the following
     • Since we can change the color, we can actually                 at the same time:
       create cold map.                                               •   IOPS, Latency and Throughput. Also,
     • In cold map, the bigger the size, the colder it is                 these 3 have different units so it’s hard
       (less utilised it is). The bluer it is, the less utilised it       to combine using Super Metric.
       is.                                                            •   ESX, VM and Datastore.

     • Hence it focuses on Waste

49
Storage: Datastore + VM vs workload + latency

 Since all the datastores are on the same array, how do we quickly tell the relative
     workload generated by every one of them?
     • This answers: which datastores are heavily loaded?
 For each of these datastores, how do we know the relative workload generated by
     the VM?
     • This answers: which VMs dominate within a datastore?
 For every VM, how do we performance is reasonable number?
     • This answers: which VM has storage bottlenect?
 How do we show all the above data in one page, without the need to show a lot of
     numbers?
     • And we still want to be able to drill down to each VM and datastore.




50
Each square is a VM. They are grouped by datastore.
     Bigger square: bigger throughput
     Color: latency.




51
Storage: Throughput vs Latency at cluster level

 Which cluster is generating high storage workload?
 Are they getting the SLA they ask? What’s the latency? The cluster owner wants to
     know that his entire cluster is getting <10 ms latency.
 We expect these X, Y, Z clusters to be doing little work. Can we prove this?




                     Basically, the same concept from
                     previous slide, but looking from cluster
                     point of view as Cluster & Datastore has
                     a Many-to-Many relationship.




52
Storage: Throughput vs Latency at cluster level




53
Storage: Throughput vs Latency at host level




54
Storage: Throughput vs Latency at VM level




             Can we show at VM level now?
             That’s why you need a 24” monitor 




55
Storage: Space vs Latency

 Any big VM that is not getting the SLA we agreed on?




56
Storage: Datastore space contention

 Do we have space contention at any of the datastore? If yes, how bad is the
     contention?
     • While we use thick provision at vSphere level (and thin at array level), we still have risk of space
       from snapshots, vRAM increase, new VM, new vDisk, storage vMotion, storage DRS, etc.
 Are the datastore uniformly sized?




57
Storage: Space contention

 We use thin provisioning




58
CPU: Contention vs Usage at cluster level

 Which clusters are doing the most work? Which are not doing much?
 How is the CPU workload on every cluster?
 For each of those clusters, can we see if there is CPU contention?




59
CPU: Contention vs Usage at host level

 Same questions with previous, but for host.
 We can expect some “drill down” in this heat map




60
CPU: Contention vs Usage at VM level




             Can we show at VM level now?
             That’s why you need a 24” full HD
             monitor 




61
VM Health

 Current Health
     • Are all the VMs healthy? Especially those VMs which have high workload!
     • Which VMs are experiencing problems?
     • Are more demanding VMs less healthy?
     • Can we see this by cluster? By host?
 Future Health
     • Will all the VMs be okay in future (30 days)? Need to check CPU, RAM, Disk IO, Disk Space and
       network for every single VM!
     • For those VMs which are not ok, can we be specific on which value will run out first? Can we
       “drill down” to individual VM?




62
VM: color by health, size by workload




63
VM: color by capacity, size by workload
 This is now showing future projection. We can see that the VM vCenter 5 is having red color. Its capacity will run out within 30
     days. So we click on it to drill down.




64
Drill down to specific VM
 Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in 10 days.
 We can go as far as 6 months. This is good enough as you should not buy hardware >6 months in advance. It makes sense in the
     physical world as it’s fixed, but unwise in virtual world.




65
Drill down to specific VM

 Showing value in absolute terms is good, but can be confusing. vCenter Ops can also
     show in %




66
Discussion Point



                Which heat maps are useful for you?
          What other heat maps or cold maps do you need?




67
Smart Alert vs Normal Alert
 Smart Alert
     • Relies on the advanced analytics instead of simple raw counters.
     • Not static, as it based on Dynamic Threshold
     • Examples:
        • Early warning alerts: use total anomalies to predict when a problem is happening, sometimes before users are impacted
        • KPI predictive: prediction that a KPI might soon go abnormal due to an event occurring that has preceded the KPI going
          abnormal on previous occasions
        • Fingerprint: set of metric anomalies matches previously seen problem (and associated resolution)

 Comparison
                           Advanced Edition                                                 Enterprise Edition
      provide alert on Minor Badges badge. E.g. Workload               Provide alert on any counters (raw, badge, super
      YES, Health NO                                                   metric)
      Can only do infrastructure level alert                           Can do application-level alert
      good for Alerts on single objects (e.g. VM)                      Good for single or multi objects
      driven by the badge’s changing color                             Driven by threshold anomaly breaches and KPI
                                                                       Threshold Breaches
      Not customiseable                                                Highly customisable
      Cannot do alert at Resource Pool or Folder                       Can do it


68
Application-level smart alert

 Needs Enterprise edition.




69
Alert

 When does Alert happen
     • When a badge change color
     • When a fault happens
     • VC Ops own alert
       • A component in VC Ops itself has failed.
       • VC Ops cannot get data
 Can do SNMP and SMTP
     • Both are set at set on the Administration Web page. The URL format is https://VM-IP/admin/




70
Advance edition: Alert main window

 Filter by the 11 badges
 Filter the VC Ops own alert: system or environment




71
72
Enterprise edition: Alert main window
 New alerts: Early Warning, KPI Breach, KPI Prediction, KPI High Threshold breach, Classic (static)
 We can also color the row by criticality, and specify period (start – end)




73
Enterprise edition: alert detail




74
75
Email Notification Rules




76
Email Notification Rules




77
Anomalies – Symptoms Window
 The example is from an ESXi host with 11 VM.
                                                                                       Example of an ESXi Anomalies symptom window.
     • It shows 3 resource type: VM, Datastore, Host System
     • The VM resource kind has 7 metric groups with anomaly.
 The VM resource kind (30 out of 71 Symptom)
     • 71 – Total number of Symptoms under VM object
        • We’re reporting on an ESX here, and VM is a child of host. So all children
          metrics are included.
        • The metric group comes from the vSphere adapter + VC Ops own.
     • 30 – Total number of Displayed Symptoms
        • Based on the limit of 5 metrics shown for each Metric Group
        • The metric group (CPU Usage, network, Summary, etc) are specified by the
          adapter

     • Subcategory Network (3 of 11)
        • 11 – The total number of VMs associated with this ESX. This is not the
          number of symptoms.
        • 3 – The total number of VMs that have one or more Network symptoms.

           Metrics will not be identical common among VM.
                  Most will be similar though.
                  Multi vCPU VM will have more vCPU metrics than 1 vCPU VM.
           Different VM will have different anomalies
                  They have different workload.

78
vCenter Operations presents
     datastore with all the details




79
Storage in vCenter Operations

             Automatic learning of storage
             performance.
             Calculating both Demand and
             Normal rate.




80
vSphere 5 Performance Chart (fat client)




Can only
choose 1
component
at a time.
e.g. cannot
show CPU
and RAM at
the same
time.




81
vSphere 5 Performance Chart (fat client)
         Can only show 1 chart at a time.
         Hence can only show 2 units at a time.




82
vCenter Operation charts

      Can show >1 charts at a time. Can combine/split charts.
      Can show different data type from different objects.
      Line is color coded, showing when threshold is breached.




83
Capacity Management in vSphere is hard


     CPU Optimizations                           Reserved
                                                 Capacity
     vSMP, Shares, Reservations, Limits

     Memory Optimizations
     Transparent Page Sharing,
     Memory Ballooning, Memory Compression
                                             ?                          Remaining
                                                                        Capacity

     Storage Optimizations                        Usable
     Thin Provisioning, Linked-Clones            Capacity



     Clusters
     DRS, HA, FT, vMotion, Storage vMotion

     Workload Flux                                                      Used
     VMs growing/shrinking, added/removed                               Capacity


      vSphere
                                                        36 days remaining




84
Capacity Management


                  What are my historical utilization trends?
                  What resources have been requested vs. needed?
                  How many more VMs will fit in my current farm?
      Analyze


                  How can we use my resources more efficiently?
                  What VMs should be right-sized?
                  Can I reclaim over-provisioned or unused capacity?
     Optimize

                  When will I run out of capacity?
                  What if I add, remove, reconfigure capacity?
                  Can I defer infrastructure investments?
      Forecast


85
Understanding Behavior
 Need to understand the weekly pattern
     • Business week
     • Weekend
     • E.g. workload spike at 9am on Mondays
                                                                                                Year 1
 Accomplish through roll-ups
     • Roll-up weeks in a month to compute the typical week for the month
     • Roll-up typical week in a month to a typical week in the quarter
                                                                                      Quarter 1
 Differs from performance management roll-ups
     • Older performance data gets less granular. vCenter loses accuracy
     • Older capacity data maintains its granularity
                                                                            Month 1   Month 2       Month 3




86
87
88
89
Planning  Summary  Export




90
Planning  Summary  Resources




91
Planning  Summary  Resources




92
Planning  Summary  Resources




93
What-if

 Visualise
     • Add or remove VMs.
       • Add based on existing VMs as profiles
       • Add based on spec you supply
     • Add, remove, or update hosts.
       • Modify CPU and RAM only. No Network.
     • Add, remove, or update datastores.
       • Update means increase or decrease size.
       • No IOPS yet.
 At a cluster level or host level
     • Cannot do at datacenter or higher level
     • Host level does not make sense when host has HA & DRS turned on
 You can add multiple what-if scenario
     • You can combine them or compare them on the same chart
     • You cannot save. Changes lost upon log-off.
     • You can export the scenario results to an Adobe PDF or CSV file.

94
3 choice of views




95
Average VM Capacity (trend view)




96
97
98
Modeling a what-if scenario




           Change Supply   Change Host/Datastore


                                                   Based on existing VMs
           Change Demand        Change VM
                                                       New VM spec




99
Modeling a what-if scenario




100
Modeling a what-if scenario – Specifying VM Configuration




101
Modeling a what-if scenario – Using Existing VMs




                                            Columns you can see




102
Modeling a what-if scenario – Using Existing VMs




103
Modeling a what-if scenario – Using Existing VMs




104
Modeling a what-if scenario – Changing hosts




105
Modeling a what-if scenario – Changing datastores




106
Modeling a what-if scenario




107
108
Capacity state
                          today

               VM count
               capacity




                                      Current capacity
                                      cross-over point


  Actual VMs
  deployed




109
Common VM distribution




110
Datastore waste




111
112
Reclaim waste capacity




113
VMs can appear in Stress and Waste at the Same Time




                            Undersized for CPU




                         Oversized for Memory

114
Powered-Off VM and Idle VM: setting




115
Powered-off VMs




116
Capacity Planning: Is the VM really sized properly?

 Setting a threshold of under-utilisation alone is not enough




                  We need to calculate the degree of under-utilisation.




117
Oversized VM & Undersized VM




118
Oversized VMs - Calculation




                  Same concept applies to undersize.
                   Same concept applies to idle VM.




119
Planning  Summary tab




      Planning  Views tab




120
Tips

 No of intervals and data points used for analysis
      • Tied to your business cycles.
      • Pick correct number of data points and the interval type to represent a typical business cycle.
      • Match no of intervals used for trend view and no of data points used for forecasting
      • Stay with default forecasting algorithm settings
 Leverage buffer settings to accommodate for unforeseen usage spikes or future
      business growth.
      • VC Ops 5 does not yet have “future incoming VM” concept
 Leverage business hours to eliminate off-peak usage
 Don’t be afraid, play with global settings
      • They are just knobs used for data analysis
      • Raw data is not modified when global settings are changed




121
Change Events Correlated with Performance

 Overview
      • Integration between vCM and vC Ops Mgr for change events
      • Overlay Guest OS configuration changes from vCM in vC Ops performance trend graphs
      • Launch in context into vCM to see full details of changes and potentially remediate them
 Benefits
      • Enable Operations to quickly understand and resolve performance issues arising from
        configuration changes (reduce MTTR)
      • Drive efficient & effective troubleshooting by correlating Guest OS configuration changes w/
        VM performance degradations
      • In larger enterprise, help bridge gap between VMware Admin and Guest OS Admin




122
VCM Events in vC Ops – Event Collected

 vC Ops does not pull in every event from vCenter
      • Only events that could affect health or workload (vSphere Knowledge!)
 Adapter only pulls in change events for Guest OSs
      • No ESX/i Host configurations changes (these come from vCenter Adapter)
      • Guest OS has to be by managed by VCM
             Event Collected

             Reboot

             Software Install/Uninstall

             Windows Registry

             IP/Networking changes

             Device Driver changes

             Memory/CPU changes

             Windows Firewall

             Patches



123
Event Types in vC Ops Mgr

 Circle Events are vCM Initiated
      • Change log in vCM updated when change is completed            E
      • Time = Occurred time


 Diamonds are non-VCM-initiated
      • Change log in vCM updated when vCM collects from VM
      • Time = Collected time
                                                                      E
 Always Blue Events – “Might” have minimal impact
 vCM events VMs follow the normal vC Ops display rules
      • vCM Events appear for the VM Object itself
      • vCM Events appear on an ESX host if you enable Child Events




124
125
126
vCM Change Events Correlated with Performance


   A pop-up for a vCM event related to uninstalling a piece of software on the VM
       in question




 127
vCM Change Events Correlated with Performance




 128
Terms
 The terms Attribute, Metric, Counter mean the same thing.
      • CPU Ready Time is an attribute.
      • CPU Ready Time from the VM ABC123 is a metric.
      • vSphere uses the word Counter. VC Ops uses Attribute and Metric.
      • As there are many attributes, they are grouped together. This is called Attribute Package.
 Resource provides the Metrics.
      • Example of resources: host, VM, datastore, cluster, etc.
      • So a resource provides many attributes.
      • Resource are pulled via Adapter.
                                                                                                Adapter
 Kind
      • In VC Ops, there are many kinds of resources.
        So there is a term Resource Kind, that you need to get used to.            Resource     Resource    Resource
      • VC Ops uses different adapters to talk to different source. 1 type of
        adapter per source. So there is a term Adapter Kind.
                                                                                    Attribute   Attribute   Attribute
 Advance terms
      • Container. Super Metrics. Application. Tier. KPI



129
Adapter, Resource, Attribute, Package


      VC Ops                              Adapter                          Source of data

                                     VMware Adapter                       vSphere 5

                                     VCM Adapter                          VCM 5.4

                                     VC Ops Adapter                       VC Ops 5

                                     Container Adapter




 Adapter Kind = adapter type. VMware Adapter is an example of Adapter Kind.
 1 Adapter Kind can have many kind of objects that it pulls from the source.
 This is called Resource Kind.
 To make management of attributes easier, they are put into Package. Inside a
 package, metris are grouped for ease of use.
                                                                                            This is the actual Resource Kind
 Container Adapter is not actually an adapter. It’s a group or container that
                                                                                            brought by VMware Adapter
 can hold other objects.




130
Actual Resource Kinds
 Sample adapters with their associated resource kinds.




                                 This is a special & built-in adapter.   This is another special & built-in
                                                                         “adapter”. Technically, this is
                                 This monitor VC Ops itself!
                                                                         actually not an adapter, as it’s just a
                                 VC Ops is just an application,          container.
                                 which also needs monitoring.




131
vSphere resource kinds

 Unlike the Advanced edition, we can utilise Folder and Resource Pool
      • This means you can create Super Metric at this level.
      • Complement vCenter.


                                                                                        Not used?


                                                                                        ESX Host

                                                                                        Not used?




                                            No vApp, no Datastore Group, no vDS as at
                                            VC Ops 5.




132
Resource Kind: default settings




133
Attribute & Attribute Package
 Package
      • A collection of Attributes from 1 Resource with the same collection interval. That’s all!
         • Need to map it to objects
         • Super Metric must be placed into a package
      • A package cannot come from multiple resources. See screen below.
         • Cannot create a package that has both VM and ESXi
      • There is a default package called All Attributes.




134
135
136
137
138
Editing a resource property




139
140
Resource Kind: Tags


                      What’s the difference between Applications and Application? Looks like
                      Application is from the Container adapter, which is built-in.



                      Maintenance schedule contains the time a particular object is on scheduled
                      downtime. It is used to tell VC Ops to ignore, else it would give alert as the
                      behaviour is unexpected. It would think the health drop!
                      So in this screen, ignore maintenance schedule as it should not be part of
                      Resource Kind.


                      The range for Health. This is not the same with the badge Health in VC Ops
                      Advance, as this is universal and apply to beyond vSphere. Health in Advance
                      edition include Fault, which is vSphere specific.




                      Tier is a special container. Again, this is universal, so name your tier properly to
                      avoid changing name later on.




                      Only 1 value here. This means the entire VC Ops.




141
Resource Kind: Tags

 You can control which resource kinds
      are shown
      • In the picture below, ESX was hidden.




142
Predefined Tags




143
Drag selected objects to the tag value


144
Resource Kind: Tags




145
VC Ops generated metrics




146
Monitoring the big workload

 You have convinced your CIO to virtualise the remaining 50% of the servers.
 Your CIO needs you to prove, supported by performance charts, that the platform has
      served every VM well, meeting the SLA in the past 1 quarter.
      • Tier 1 cluster SLA: 2% CPU Ready, 0 RAM Ballooning, 10 ms disk latency, 0 drop packets.
      • Tier 2 cluster SLA: 4% CPU Ready, 5% RAM Ballooning, 20 ms disk latency, 0 drop packets.
      • Tier 3 cluster SLA: 6% CPU Ready, 10% RAM Ballooning, 30 ms disk latency, 0 drop packets.
 You have 500 VM on 50 ESXi, 8 clusters, 40 datastores, 5 RDM.
 You must prove that:
      • Not a single Tier 1 VM has >2% CPU Ready in the past 1 quarter. The underlying ESXi also has
        <2% CPU contention.
      • Not a single Tier 1 VM has >10 ms disk latency in the past 1 quarter. The underlying ESXi also has
        <10 ms disk latency.
      • Etc, for each Tier and each component (CPU, RAM, Disk, Net)


                         What kind of charts do you need to show?


147
Super Metrics




148
Super Metric: Functions
 2 types:
      • looping functions: take multiple input value
         • Average, sum, min, max, count, combine, etc.
         • More practical or useful than single functions
      • single functions: take 1 value
         • Absolute, round up, round down, square root, etc.
 The xxxN functions, instead of working on just the immediate children, it looks down
      (or up) the number of levels specified in the formula.
      • This ‘2’ tells the function to look
        down for two levels for
        the metric.
      • Putting -2 means look up.




149
Super Metric: hierarchy

 Example: super metric for Average CPU usage of a cluster




                                                             VM is 2 level down
                                                             from cluster.




150
151
152
Super Metric: Operators

 To calculate a value for each VM based on metrics for that VM, use the ‘$This’
      operator.




 Another example: max ( $This:CPUavg, ESXi-Host-003:CPUavg, VM:CPUavg)
 Finds the maximum value among these
      • CPUavg metric for the resource to which the super metric is assigned (so this is dynamic)
      • CPUavg metric for a specific resource called ESXi-Host-003 (so this is hardcoded)
      • CPUavg metric for all resources of type VM (so this is universal for all VM)




153
154
155
Super Metric: package




156
157
158
Discussion Point



               Think of super metrics that you need.
              Explain why and how you will need them.




159
Applications and Application Tiers
 App Team often view things from their own application-centric. We can create custom dashboard showing their
      “Application”
 Even better if we add non vSphere data, like Hyperic. This gives app-level info and GuestOS-level info, which is
      not available in vSphere adapter.
 Define your own hierarchy and relationship




160
Drag selected objects to the tag va




161
Parent-Child Resource Relationships




162
163
What counters do you check?

Component                            ESX                                                  VM
            Usage or Utilisation: Overall CPU utilisation (to
            get overall utilisation of entire box)
                                                                 Usage or Utilisation: Overall CPU utilisation
            Usage or Utilisation: Individual core utilisation
                                                                 Usage or Utilisation: Individual core utilisation
            (to see distribution and if any particular core is
CPU         max out)                                             Wait (wait for IO. To see if it’s IO bound)
            Wait (wait for IO. To see if it’s IO bound)          Ready (VM unable to run, waiting for core)
            Ready (VM unable to run, waiting for core)           Co-Stop (if there are large VMs)
            Co-Stop (if there are large VMs)
            Ballooning                                           Ballooning
RAM
            Active or Active Write                               Active or Active Write
            Latency: kernel latency, device latency.
                                                                 Guest Latency
            Device Latency
Storage                                                          Throughput
            Throughput
                                                                 IOPS
            IOPS
            Drop packets                                         Drop packets
Network
            Throughput                                           Throughput
            vSphere Replication?
Others                                                           System?
            Cluster service?
164
Test your vSphere knowledge!
      How are Disk, Datastore,
      Adapter and Path related?




165
CPU counters




               Test your vSphere knowledge!
               Which one is ESX, which one
               is VM? How do you know?

               Test your vSphere knowledge!
               What can stop/block a VM
               from getting the CPU it was
               configured?


               No more Collection Level
               limitation. VC-Ops collect
               them all and analyse them
               all.
               Changing collection level in
               vCenter does not impact VC
               Ops as VC Ops gets from
               “real-time” statistic.




166
%OVRLP and %SYS


        Run


 Wait         Ready

                                                             Time

                 World 1                            %RUN                             %SYS

                                                                 %OVRLP                %RUN continues to accumulate.
                                                                                       But %OVRLP kicks in.


                 World 2                                          %RUN



              %OVRLP   Overlapping time. A world still wants CPU but interrupted by another world.
                       High number normally means ESX is experiencing heavy IO
                       %USED = %RUN + %SYS - %OVRLP
                       As a result, the overlap value does not incorrectly inflate %USED.
              %SYS     A high no means heavy IO or interrupts



167
Memory counters
         ESXi     VM




168
Storage counters: ESXi host
           Datastore                    Disk




      Storage Adapter or Storage Path




169
ESXi: Adapter, Device and Path




                                  1 adapter can many Devices (LUN).
                                 1 Device is accessed via many paths.
                                   1 path can only access 1 Device.




170
ESXi: Disk




171
ESXi: Adapter, Device and Path




                                                     ESXi 5.0
      vmnic         Storage Adapter 1                                                Storage Adapter 2
                                           vmhba2                                                          vmhba3




              Storage Path          Storage Path    Storage Path          Storage Path                   Storage Path          Storage Path
                                                                               vmhba3



      NFS                    VMFS                                  VMFS                                                 RDM
  Datastore             Datastore                             Datastore




                             Disk                                  Disk                                                 Disk




172
Storage counters: VM
       Virtual Disk (VMDK, RDM)


                                                VM

                                   Drive 1     Drive 2            Drive 3
                                    vDisk       vDisk             vDisk
                                                        scsi0:0           scsi0:2



           Datastore               VMFS         NFS                RDM
                                  Datastore   Datastore




                                    Disk                           Disk


             Disk




173
Network counters

        ESXi




        VM




174
Other Counters: ESXi Host
        vSphere Replication   System (vmkernel)




                                       See
                                       next
                                     2 slides
                                     for info




          Cluster Service




              Power




175
176
A long list of vmkernel
      resources. Some are familiar,
      such as vMotion, FT, hostd,
      Vpxa, DCUI, logging




177
178
Widget




179
Widget: Full List




180
Dashboard: creating a new Tab




181
Alerts




182
Application Overview and Application Detail




183
184
185
Data Distribution




186
187
188
Health Status




189
Health Status




190
Health Tree




191
Health Tree




192
Health Tree




193
Advanced Health Tree




194
Advanced Health Tree




195
Scoreboard: Health or Workload




196
197
Scoreboard: Generic




198
199
Heat Map




200
201
202
Mashup Charts




203
204
Mashup Chart




205
Metric Graph




206
207
Metric Graph (Rolling View)




208
209
Metric Selector




210
Metric Sparklines




211
212
214
Resources




215
216
Tag Selector




217
Top-N Analysis




218
219
Geographic




220
The VC Relationship

 There are 2 widgets that are vSphere related.
 Use the advanced edition instead.
      • Enterprise edition can access Advanced edition UI at the same time. Just open another window
        or tab.




221
Interaction between widget

 Controlled at the dashboard level, not individual widget
 Providing widget and Receiving widget




222
Interaction between widget




223
Interaction between widget




224
Practice session: creating your dashboard

 Goal: have a dashboard to help you investigates all non-local datastores quickly
      • Be able to plot chart for all non-local datastores for comparison.
 Answer:
      • Create a tag called Storage from the Environment screen.
        • Create 1 tag value: Shared Datastore
      • Tag all the non-local datastores with this tag value
        • Done manually. Simply drag all the rows
      • Create a dashboard with 4 widgets
        • Health Status
           • This is where you show the overall health of all Non-Local Datastores
        • Resources
           • This is where you show all the members of Non-Local Datastore tags
        • Metric Selector
           • All the metrics will appear here.
           • Select the metric you want
        • Metric Graph or Metric Sparklines
           • Choose Sparklines if you have lots of graph.

225
226
vCenter “equivalent” dashboard




227
Configuration




228
229
230
Cross Silo




231
Fingerprint




232
Maintenance Mode




233
Maintenance Schedules




234
Major Steps in implementation


 Define who             Create               Create                Create             Create       Create
 needs what          Super Metrics         Applications             Tags             Heat Maps   Dashboards




 Begin with the end in mind
      • Every Super Metric must serve a particular role
        • Role, not individual. A person can & will have many heatmaps/dashboards.
      • Decide if you need the following non-standard info
        • Application-level & Guest-OS-level info
        • Info from physical machines (UNIX, X64, etc)
        • Info from physical storage and network (switch, FW, router, etc)
 Think in terms of application
      • A great way to complement vSphere as vCenter does not have this object.




235
Who needs to see what

                                              Simple Dashboard.
                                              Big picture. Tend to be application focused.
                  CIO or CTO                  No absolute data. Normalised to 0-100.
                                              Focus on long term.
                                              Averaged data. A 30-minute spike will not show up.
                                              Updated daily.


                 Group Head
        e.g. Head of Infra, Head of Apps




                  Dept Head
      e.g. Head of Storage, Head of Server,
      Head of Network, Head of Databases




                                              Rich Dashboard. Ideally Full HD screen.
              Admin/Architect                 Specific info.
      e.g. Storage Admin, Network Admin,
              App Owner, VM Owner
                                              Absolute data + Normalised Data.
                                              Focus on short term.
                                              Actual data. A 5-minute spike will be visible.
                                              Updated every 2 minutes.
236
Who needs to see what (samples)

Roles                    Info presented
                         Health of overall IT in the past 1 month
CIO
                         Health of key applications in the past 1 month

CTO                      As above, but with more technical content, and tailored to him.

                         Health of all key apps in the past 1 month, with the ability to do 1 level drill down for each app.
Head of Applications
                         Capacity projection for all key apps.
                         Health of Storage
                         Health of Network
Head of Infrastructure
                         Health of Servers (VMware and Physical)
                         Health of VM

Head of Storage          A higher level, simpler dashboard than Storage Admin

Head of Network

VMware Team

An App Owner             The infra is providing each of the VMs in my App with the resources it needs




237
Designing Super Metric
 Leverage existing derived metrics
 Leverage Objects that vCenter cannot provide performance data
      • Application, Resource Pool, Folder, Location, can now have performance counters
 Minimise static alert.
 Know what a good range for the end result
 Build a simple table to avoid super metric sprawl and duplicating existing metrics
      • Below is an example, showing 2 Super Metrics.

Name         Purpose                          Target Role   Formula                                           Good Range
                                                            VM SLA = 100% - Max (CPU, RAM, Disk, Network)
                                                            CPU = CPU Contention %.
                                                            RAM = RAM ballooning %.
             Shows that a VM gets the
                                                            Disk = % above threshold latency.                 >99% (Tier 1 cluster)
             resources it wants from
VM SLA                                        VM Owner      Network = Packet Drop %.                          >97 (Tier 2 cluster)
             infrastructure based on the
                                                                                                              >95% (Tier 3 cluster)
             defined SLA.
                                                            Tier 1 Disk SLA is 10 ms.
                                                            Tier 2 Disk SLA is 20 ms.
                                                            Tier 3 Disk SLA is 30 ms.
             Show that the underlying infra
                                              VMware        Infra SLA = 100% - Max (Host Cluster, Datastore
Infra SLA    has the resources for all the
                                              Admin         Cluster)
             VMs on it

238
Custom Heat Map or Cold Map

 Component                                Heat Map                                                      Cold Map
                                                                             Least utilised VM: size by vCPU count, color by RAM + CPU
CPU                Resource pool: size by CPU utilisation,
                                                                             usage (a Super Metric)
                   Most RAM intensive VMs, grouped by ESX. Size by RAM
RAM
                   utilisation, color by health
                   Most disk intensive VMs, grouped by ESX. Size by disk
Disk                                                                         Least utilised disk: size by GB, color by % of free
                   utilisation, color by health
                   Most network intensive VMs, grouped by ESX. Size by
Network                                                                      Most idle VMs, grouped by host
                   network utilisation, color by health
                   VMs with file system that will run out soon. Color by %
Capacity
                   left, size by GB left.
                   VM health, grouped by cluster. Color by health, size by
Health
                   workload.


 Design consideration
      • Use Super Metric so the info is richer.
      • Group VMs by 1 consistent hierarchy only. If you group by cluster, it won’t make sense to further group by datastore as 1
        datastore can spans multiple cluster.




239
vCenter: network impact of vCenter Ops




240
Choice of Tools

 vCenter Operations
      • 1-15 minutes accuracy (for other sources)
      • 5 minutes accuracy (for vSphere)
      • No need reproducible. But problem should last >5 minutes, preferably 15 minutes (3 sample)
 vCenter
      • 20 – 300 seconds accuracy
      • Reproducable performance issue
      • Requirements: you already have some idea what causes it
 esxtop
      • 2 – 20 seconds accuracy. Short burst problem.
      • Reproducable performance issue
      • Requirements: you already know which ESX & VM has the problem.
 vSCSIStat
      • Specific for storage, low level analysis


241
242
243
244
245

Contenu connexe

Tendances

virtualization tutorial at ACM bangalore Compute 2009
virtualization tutorial at ACM bangalore Compute 2009virtualization tutorial at ACM bangalore Compute 2009
virtualization tutorial at ACM bangalore Compute 2009ACMBangalore
 
VMware vSphere Storage Appliance (VSA) - Technical Presentation,Almacenamien...
VMware vSphere Storage Appliance (VSA) -  Technical Presentation,Almacenamien...VMware vSphere Storage Appliance (VSA) -  Technical Presentation,Almacenamien...
VMware vSphere Storage Appliance (VSA) - Technical Presentation,Almacenamien...Suministros Obras y Sistemas
 
VMware Esx Short Presentation
VMware Esx Short PresentationVMware Esx Short Presentation
VMware Esx Short PresentationBarcamp Cork
 
VMworld 2013: Virtualization 101
VMworld 2013: Virtualization 101 VMworld 2013: Virtualization 101
VMworld 2013: Virtualization 101 VMworld
 
VMware Vsphere Graduation Project Presentation
VMware Vsphere Graduation Project PresentationVMware Vsphere Graduation Project Presentation
VMware Vsphere Graduation Project PresentationRabbah Adel Ammar
 
Vsphere esxi-vcenter-server-50-installation-setup-guide
Vsphere esxi-vcenter-server-50-installation-setup-guideVsphere esxi-vcenter-server-50-installation-setup-guide
Vsphere esxi-vcenter-server-50-installation-setup-guideamirzahur
 
VMware vSphere 5 seminar
VMware vSphere 5 seminarVMware vSphere 5 seminar
VMware vSphere 5 seminarMarkiting_be
 
What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?Insight
 
Xen server 6.1 customer presentation
Xen server 6.1 customer presentationXen server 6.1 customer presentation
Xen server 6.1 customer presentationNuno Alves
 
VMware vSphere 6.0 - Troubleshooting Training - Day 1
VMware vSphere 6.0 - Troubleshooting Training - Day 1VMware vSphere 6.0 - Troubleshooting Training - Day 1
VMware vSphere 6.0 - Troubleshooting Training - Day 1Sanjeev Kumar
 
Virtualization VMWare technology
Virtualization VMWare technologyVirtualization VMWare technology
Virtualization VMWare technologysanjoysanyal
 
Partner Presentation vSphere6-VSAN-vCloud-vRealize
Partner Presentation vSphere6-VSAN-vCloud-vRealizePartner Presentation vSphere6-VSAN-vCloud-vRealize
Partner Presentation vSphere6-VSAN-vCloud-vRealizeErik Bussink
 
VMware vSphere 5.1 Overview
VMware vSphere 5.1 OverviewVMware vSphere 5.1 Overview
VMware vSphere 5.1 OverviewESXLab
 
Nashville VMUG Keynote April 8 2015 - vSphere 6
Nashville VMUG Keynote April 8 2015 - vSphere 6Nashville VMUG Keynote April 8 2015 - vSphere 6
Nashville VMUG Keynote April 8 2015 - vSphere 6Adam Eckerle
 
Vmware Certified Professional 6 2V0-621 Dumps
Vmware Certified Professional 6 2V0-621 DumpsVmware Certified Professional 6 2V0-621 Dumps
Vmware Certified Professional 6 2V0-621 DumpsShamar41
 

Tendances (20)

virtualization tutorial at ACM bangalore Compute 2009
virtualization tutorial at ACM bangalore Compute 2009virtualization tutorial at ACM bangalore Compute 2009
virtualization tutorial at ACM bangalore Compute 2009
 
VMware vSphere5.1 Training
VMware vSphere5.1 TrainingVMware vSphere5.1 Training
VMware vSphere5.1 Training
 
VMware vSphere Storage Appliance (VSA) - Technical Presentation,Almacenamien...
VMware vSphere Storage Appliance (VSA) -  Technical Presentation,Almacenamien...VMware vSphere Storage Appliance (VSA) -  Technical Presentation,Almacenamien...
VMware vSphere Storage Appliance (VSA) - Technical Presentation,Almacenamien...
 
VMware Esx Short Presentation
VMware Esx Short PresentationVMware Esx Short Presentation
VMware Esx Short Presentation
 
VMworld 2013: Virtualization 101
VMworld 2013: Virtualization 101 VMworld 2013: Virtualization 101
VMworld 2013: Virtualization 101
 
VMware Vsphere Graduation Project Presentation
VMware Vsphere Graduation Project PresentationVMware Vsphere Graduation Project Presentation
VMware Vsphere Graduation Project Presentation
 
Vsphere esxi-vcenter-server-50-installation-setup-guide
Vsphere esxi-vcenter-server-50-installation-setup-guideVsphere esxi-vcenter-server-50-installation-setup-guide
Vsphere esxi-vcenter-server-50-installation-setup-guide
 
VMware vSphere 5 seminar
VMware vSphere 5 seminarVMware vSphere 5 seminar
VMware vSphere 5 seminar
 
VMware Presentation
VMware PresentationVMware Presentation
VMware Presentation
 
What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?
 
VMWARE ESX
VMWARE ESXVMWARE ESX
VMWARE ESX
 
VMware vSphere 6 & Horizon View 6.1 – What's New ?
VMware vSphere 6 & Horizon View 6.1 – What's New ?VMware vSphere 6 & Horizon View 6.1 – What's New ?
VMware vSphere 6 & Horizon View 6.1 – What's New ?
 
Xen server 6.1 customer presentation
Xen server 6.1 customer presentationXen server 6.1 customer presentation
Xen server 6.1 customer presentation
 
VMware Virtualization 27 09 07
VMware Virtualization  27 09 07VMware Virtualization  27 09 07
VMware Virtualization 27 09 07
 
VMware vSphere 6.0 - Troubleshooting Training - Day 1
VMware vSphere 6.0 - Troubleshooting Training - Day 1VMware vSphere 6.0 - Troubleshooting Training - Day 1
VMware vSphere 6.0 - Troubleshooting Training - Day 1
 
Virtualization VMWare technology
Virtualization VMWare technologyVirtualization VMWare technology
Virtualization VMWare technology
 
Partner Presentation vSphere6-VSAN-vCloud-vRealize
Partner Presentation vSphere6-VSAN-vCloud-vRealizePartner Presentation vSphere6-VSAN-vCloud-vRealize
Partner Presentation vSphere6-VSAN-vCloud-vRealize
 
VMware vSphere 5.1 Overview
VMware vSphere 5.1 OverviewVMware vSphere 5.1 Overview
VMware vSphere 5.1 Overview
 
Nashville VMUG Keynote April 8 2015 - vSphere 6
Nashville VMUG Keynote April 8 2015 - vSphere 6Nashville VMUG Keynote April 8 2015 - vSphere 6
Nashville VMUG Keynote April 8 2015 - vSphere 6
 
Vmware Certified Professional 6 2V0-621 Dumps
Vmware Certified Professional 6 2V0-621 DumpsVmware Certified Professional 6 2V0-621 Dumps
Vmware Certified Professional 6 2V0-621 Dumps
 

Similaire à vCenter Operations 5: Level 300 training

Master VMware Performance and Capacity Management
Master VMware Performance and Capacity ManagementMaster VMware Performance and Capacity Management
Master VMware Performance and Capacity ManagementIwan Rahabok
 
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...VMworld
 
VMWare Winnipeg Forum - 2011
VMWare Winnipeg Forum - 2011VMWare Winnipeg Forum - 2011
VMWare Winnipeg Forum - 2011asedha
 
VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions VMworld
 
ALM@Work - Lab management for everyone
ALM@Work - Lab management for everyoneALM@Work - Lab management for everyone
ALM@Work - Lab management for everyoneDomusDotNet
 
Alarm vm sales playbook
Alarm vm sales playbookAlarm vm sales playbook
Alarm vm sales playbookJohn Milanski
 
V center operations standard presentation
V center operations standard presentationV center operations standard presentation
V center operations standard presentationsolarisyourep
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowAndrew Miller
 
Virtualization performance management
Virtualization performance managementVirtualization performance management
Virtualization performance managementKenneth de Brucq
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesJason TC HOU (侯宗成)
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEnkitec
 
E g innovations overview
E g innovations overviewE g innovations overview
E g innovations overviewNuno Alves
 
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part TwoVMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part TwoVMworld
 
vSphere APIs for performance monitoring
vSphere APIs for performance monitoringvSphere APIs for performance monitoring
vSphere APIs for performance monitoringAlan Renouf
 
Citrix XenServer Success
Citrix XenServer SuccessCitrix XenServer Success
Citrix XenServer SuccesseG Innovations
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Javalucenerevolution
 
Protecting Data in an Era of Content Creation – Presented by Softchoice + EMC
Protecting Data in an Era of Content Creation – Presented by Softchoice + EMCProtecting Data in an Era of Content Creation – Presented by Softchoice + EMC
Protecting Data in an Era of Content Creation – Presented by Softchoice + EMCSoftchoice Corporation
 

Similaire à vCenter Operations 5: Level 300 training (20)

Master VMware Performance and Capacity Management
Master VMware Performance and Capacity ManagementMaster VMware Performance and Capacity Management
Master VMware Performance and Capacity Management
 
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
 
VMWare Winnipeg Forum - 2011
VMWare Winnipeg Forum - 2011VMWare Winnipeg Forum - 2011
VMWare Winnipeg Forum - 2011
 
VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions VMworld 2013: DRS: New Features, Best Practices and Future Directions
VMworld 2013: DRS: New Features, Best Practices and Future Directions
 
5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator
 
ALM@Work - Lab management for everyone
ALM@Work - Lab management for everyoneALM@Work - Lab management for everyone
ALM@Work - Lab management for everyone
 
Alarm vm sales playbook
Alarm vm sales playbookAlarm vm sales playbook
Alarm vm sales playbook
 
V center operations standard presentation
V center operations standard presentationV center operations standard presentation
V center operations standard presentation
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - Varrow
 
Virtualization performance management
Virtualization performance managementVirtualization performance management
Virtualization performance management
 
VMware Solutions
VMware SolutionsVMware Solutions
VMware Solutions
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Ioug oow12 em12c
Ioug oow12 em12cIoug oow12 em12c
Ioug oow12 em12c
 
EM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance PagesEM12c Monitoring, Metric Extensions and Performance Pages
EM12c Monitoring, Metric Extensions and Performance Pages
 
E g innovations overview
E g innovations overviewE g innovations overview
E g innovations overview
 
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part TwoVMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two
VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two
 
vSphere APIs for performance monitoring
vSphere APIs for performance monitoringvSphere APIs for performance monitoring
vSphere APIs for performance monitoring
 
Citrix XenServer Success
Citrix XenServer SuccessCitrix XenServer Success
Citrix XenServer Success
 
Challenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in JavaChallenges in Maintaining a High Performance Search Engine Written in Java
Challenges in Maintaining a High Performance Search Engine Written in Java
 
Protecting Data in an Era of Content Creation – Presented by Softchoice + EMC
Protecting Data in an Era of Content Creation – Presented by Softchoice + EMCProtecting Data in an Era of Content Creation – Presented by Softchoice + EMC
Protecting Data in an Era of Content Creation – Presented by Softchoice + EMC
 

Plus de Eric Sloof

VMware HA deep Dive
VMware HA deep DiveVMware HA deep Dive
VMware HA deep DiveEric Sloof
 
What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?Eric Sloof
 
Mythbusting goes virtual What's new in vSphere 5.1
Mythbusting goes virtual   What's new in vSphere 5.1Mythbusting goes virtual   What's new in vSphere 5.1
Mythbusting goes virtual What's new in vSphere 5.1Eric Sloof
 
vCenter Infrastructure Navigator 1.1 - What's New
vCenter Infrastructure Navigator 1.1 - What's NewvCenter Infrastructure Navigator 1.1 - What's New
vCenter Infrastructure Navigator 1.1 - What's NewEric Sloof
 
E1000 is faster than VMXNET3
E1000 is faster than VMXNET3E1000 is faster than VMXNET3
E1000 is faster than VMXNET3Eric Sloof
 
vSphere 5 What's New - Profile Driven Storage
vSphere 5 What's New - Profile Driven StoragevSphere 5 What's New - Profile Driven Storage
vSphere 5 What's New - Profile Driven StorageEric Sloof
 
Introduction - vSphere 5 High Availability (HA)
Introduction - vSphere 5 High Availability (HA)Introduction - vSphere 5 High Availability (HA)
Introduction - vSphere 5 High Availability (HA)Eric Sloof
 
Introduction - vSphere Storage Appliance
Introduction - vSphere Storage ApplianceIntroduction - vSphere Storage Appliance
Introduction - vSphere Storage ApplianceEric Sloof
 
What’s new in vShield 5
What’s new in vShield 5What’s new in vShield 5
What’s new in vShield 5Eric Sloof
 
What’s New in vCloud Director 1.5
What’s New in vCloud Director 1.5What’s New in vCloud Director 1.5
What’s New in vCloud Director 1.5Eric Sloof
 
vSphere 5 - Image Builder and Auto Deploy
vSphere 5 - Image Builder and Auto DeployvSphere 5 - Image Builder and Auto Deploy
vSphere 5 - Image Builder and Auto DeployEric Sloof
 
What’s New in VMware vCenter Site Recovery Manager v5.0
What’s New in VMware vCenter Site Recovery Manager v5.0What’s New in VMware vCenter Site Recovery Manager v5.0
What’s New in VMware vCenter Site Recovery Manager v5.0Eric Sloof
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause AnalysisEric Sloof
 
Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...
Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...
Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...Eric Sloof
 
Managing V Sphere With The Vesi
Managing V Sphere With The VesiManaging V Sphere With The Vesi
Managing V Sphere With The VesiEric Sloof
 
Managing V Sphere With The Vesi
Managing V Sphere With The VesiManaging V Sphere With The Vesi
Managing V Sphere With The VesiEric Sloof
 

Plus de Eric Sloof (16)

VMware HA deep Dive
VMware HA deep DiveVMware HA deep Dive
VMware HA deep Dive
 
What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?
 
Mythbusting goes virtual What's new in vSphere 5.1
Mythbusting goes virtual   What's new in vSphere 5.1Mythbusting goes virtual   What's new in vSphere 5.1
Mythbusting goes virtual What's new in vSphere 5.1
 
vCenter Infrastructure Navigator 1.1 - What's New
vCenter Infrastructure Navigator 1.1 - What's NewvCenter Infrastructure Navigator 1.1 - What's New
vCenter Infrastructure Navigator 1.1 - What's New
 
E1000 is faster than VMXNET3
E1000 is faster than VMXNET3E1000 is faster than VMXNET3
E1000 is faster than VMXNET3
 
vSphere 5 What's New - Profile Driven Storage
vSphere 5 What's New - Profile Driven StoragevSphere 5 What's New - Profile Driven Storage
vSphere 5 What's New - Profile Driven Storage
 
Introduction - vSphere 5 High Availability (HA)
Introduction - vSphere 5 High Availability (HA)Introduction - vSphere 5 High Availability (HA)
Introduction - vSphere 5 High Availability (HA)
 
Introduction - vSphere Storage Appliance
Introduction - vSphere Storage ApplianceIntroduction - vSphere Storage Appliance
Introduction - vSphere Storage Appliance
 
What’s new in vShield 5
What’s new in vShield 5What’s new in vShield 5
What’s new in vShield 5
 
What’s New in vCloud Director 1.5
What’s New in vCloud Director 1.5What’s New in vCloud Director 1.5
What’s New in vCloud Director 1.5
 
vSphere 5 - Image Builder and Auto Deploy
vSphere 5 - Image Builder and Auto DeployvSphere 5 - Image Builder and Auto Deploy
vSphere 5 - Image Builder and Auto Deploy
 
What’s New in VMware vCenter Site Recovery Manager v5.0
What’s New in VMware vCenter Site Recovery Manager v5.0What’s New in VMware vCenter Site Recovery Manager v5.0
What’s New in VMware vCenter Site Recovery Manager v5.0
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...
Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...
Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...
 
Managing V Sphere With The Vesi
Managing V Sphere With The VesiManaging V Sphere With The Vesi
Managing V Sphere With The Vesi
 
Managing V Sphere With The Vesi
Managing V Sphere With The VesiManaging V Sphere With The Vesi
Managing V Sphere With The Vesi
 

Dernier

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Dernier (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

vCenter Operations 5: Level 300 training

  • 1. vCenter Operations 5: Level 300 training Singapore, Q2 2012 Iwan ‘e1’ RahabokVCAP-DCD Staff SE, Strategic Accounts e1@vmware.com | Skype: e1_ang | 9119-9226 | Linkedin.com/in/e1ang © 2010 VMware Inc. All rights reserved
  • 2. Document Information  This deck is part 2 of a series. • Part 1 is Management in the Virtual World: a technical introduction. • http://communities.vmware.com/docs/DOC-17841  This deck has pre-requisite • Intro video: http://www.youtube.com/watch?v=Z-DJuTiqKag • VC Ops 5 technical introduction at Vault or Partner Central.  This deck only covers vCenter Operation (enterprise + advance) • Focus on concept & ‘under the hood’ to get you understand the product deeper. • Does not cover: competitive, installation, configuration • Does not run through feature after feature. • See the official training deck for that at Vault or Partner Central. This is a very long training material. • vCenter Operations modules that it does not covers Use the Section feature • Chargeback to see how it is • Infrastructure Navigator organised. • Configuration Manager  Further reading • virtual-red-dot.blogspot.com 2
  • 3. Table of Contents  Built for vCenter Standard  Core: Metrics, Threshold, Analytics  Badges  Heat Map  Smart Alert  Details & Charts  Capacity Management  Settings  VCM integration  Concepts & Advance Concepts  Deep dive into Metrics  Dashboard and Widgets 3
  • 4. Managing Performance/Capacity in vSphere: the basic Is it healthy? Is it enough? Is it optimised? • Every VM & ESX • Enough CPU, RAM, • Which VMs need performing well? Network, Disk? adjustment? CPU, RAM, Future risk? • What are my key Network, Disk? • Time remaining? ratios? • Are they behaving • Capacity • How much can I expectedly? remaining? claim back from • Any fault on any • Where are the “fat” VMs? component? “Stress points” • How many more in time? VMs can I put without impacting performance? 4
  • 5. Direct Mapping by vCenter Operations  Is it healthy = Health • Workload • Anomalies • Faults  Is it enough = Risk • Time remaining • Capacity remaining • Stress period  Is it optimised = Efficiency • What can we reclaim? • Density. Key ratios for management  Daily update at midnight 5
  • 7. Visibility across vCenters Sample from ASEAN Lab: 6 vCenters. Mixed of Appliance and Windows 2 are LinkedMode (SRM) 7
  • 8. Performance Troubleshooting: a day in the life…  You got an email from the app team, saying the main Intranet application was slow. • The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that. • So it was slow between 1-2 hours ago, but ok now. • You did a check. Everything is indeed ok in the past 1 hour. • The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM • You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest OS. • Your environment: 1 VC, 4 clusters, 30 hosts, 300 VM, 20 datastores, 1 midrange array, 10 GE FCoE Test your vSphere knowledge! How do you solve/approach this with just vSphere? What do you do?  A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE   B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.  C: SMS your wive, “Honey, I’m staying overnight at the datacenter  “  D: Take a blood pressure medicine so it won’t shoot up.  E: Buy the app team very nice dinner, and tell them to keep quiet. 8
  • 9. Performance Troubleshooting: a day in the life…  The minimum you need to prove • Performance is not caused by your infrastructure, or at least not by your VMware. • Infrastructure = VMware + Storage + Network • Application = VM + App inside the VM  What you need to prove • For each of the 10 VM, the following was ok between 1-2 hours ago: CPU, RAM, Disk, network • To strengthen the above, prove that: • The shared infrastructure was also healthy: relevant ESX, relevant Datastore • The overall platform was also healthy. • No relevant faults that happened 1-2 hours ago. • Give the list of ports (that the 10 VM use) to network team to ensure the firewall is not dropping them.  What challenges do you face in vSphere to do the above? • Group discussion: what limitations do you face, if vCenter + vMA + PowerGUI + RVTools is all you have?  The ideal you need to prove • Show the exact application-level counter that are slow, with the underlying infrastructure-level counter that caused it. Another word, application-specific + root-cause-analysis 9
  • 10. Challenge 1: details are lost after 1 hour 10
  • 11. Challenge 1: details are lost after 1 hour The following counters are lost: 1. Used 2. System 3. Idle 4. Latency 5. Overlap 6. Demand 7. Wait 8. Run 9. Swap wait 11 10. Max Limited
  • 12. Challenge 1: details are lost after 1 hour Memory Counters Disk Counters <1 hour >1 hour <1 hour >1 hour 12
  • 13. 13
  • 14. Challenge 2: no application awareness 14
  • 15. 15
  • 16. 16
  • 17. Deep understanding of vCenter is required Here is a common example of why a deep understanding of vSphere counters make a huge difference. Buy more RAM? 17
  • 18. Deep understanding of vCenter is required Yes, buy more RAM. ESXi has 32 GB RAM. It is highly used 18
  • 19. Deep understanding of vCenter is required vCenter Ops shows a very different data. Memory is only 32%. Plenty of headroom. What?! It’s been high constantly for the last 24 hours! Better buy more RAM now. But hang on! This is ESXi-06 host in VMware ASEAN lab. We know who use them  19
  • 20. vCenter Ops shows a very different data. Memory is only 32%. Plenty of headroom. It just saves us from a costly RAM upgrade project 20
  • 21. Live Demo 1 engine, 2 UI. Dashboard.. Badges. Configuration 21
  • 22. Counters and Badges  A vCenter farm with 500 VM and 50 ESX will have >10000 counters! • It is not humanely possible to look at them, let alone analyse them. Derived Counters  vCenter presents raw counters Standardises the scale into 0 - • e.g. What does Ready Time of 1500 in Real Time chart mean? Is value of 2000 in Real Time chart better than value 100. of 75000 in Daily Chart? 1 universal unit. Minimise the • e.g. Is memory.usage at 90% at ESXi level good or bad? “translation” in our head. • E.g. Is IOPS of 300 good or bad for datastore XYZ? Can be >100 if demand is unmet  Single counter can be misleading Universal. Apply to CPU, RAM, Disk, Net, etc. • e.g. Low CPU usage does not mean VM is getting the CPU, if there is Limit, Contention and Co-Stop. Counters derived using sophisticated formula, not just • e.g. To see disk performance, we need to see multiple aggregated. counters at multiple layers (VM, kernel, physical) For the same counter, different  Different counters have different units objects use different formula. • GHz, %, MB, kbps, ops/sec, ms • This makes analysis even more complex 22
  • 23. Samples of Derived Metric: Health  Health Score of an Object = MAX (Abnormal Workload, Faults) • Abnormal Workload per Metric = Geometric Mean (MAX (Abnormality (Capacity/Entitlement), Abnormality (Demand/Usage)), Workload) • Abnormal Workload per Object = Score Aggregation (Abnormal Workload per Metric) • Fault depends on the object: Cluster = HA Issues = MAX (HA Insufficient Failover Resources, HA Failover In Progress, HA Cannot Find Master) Host = MAX (Hardware Issues, HA Issues) Hardware Issues = MAX (Network Issues, Storage Issues, Compute Issues, CIM Issues) Network Issues = MAX (Network, DVPort, VMNic) Network = Max_of_all_instances (Network Device) DVPort = Max_of_all_instances (DVPort Device) VMNic = Max_of_all_instances (VMNic Device) Storage Issues = MAX(Storage, SCSI, VMFS heartbeat, NFS server, CIM Storage) Storage = Max_of_all_instances (Storage Device) SCSI = Max_of_all_instances (SCSI Device) VMFS heartbeat = Max_of_all_instances (VMFS heartbeat Device) NFS server = Max_of_all_instances (NFS server Device) Compute Issues = MAX (Error, PCIe) CIM Issues = MAX (Processor, Memory, Fan, Voltage, Temperature, Power, System Board, Battery, Other Health, IPMI, BMC) HA Issues = HA Host Status VM = MAX (FT Issues, HA Issues) 23
  • 24. Threshold: a shift in mindset needed  vCenter sets “static” threshold, which can be misleading • During peak, it is common for VM to reach high utilisation. • Static threshold will generate alerts when they should not. • vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with. • During non-peak, it might be abnormal for VM to reach even 50% utilisation. • Static threshold will not generate alerts when they should have.  vCenter only sets high threshold • Do you set static threshold when CPU or RAM utilisation drops below 5%?  • A drop in entire array storage IOPS might be a sign of terrible day ahead. • Will not alert when these happen: • Utilisation drops from 75% to 1% when it should not. • Utilisation change from 5% to 70% when it should not. • We need to plots both upper range and lower range  But each VM differs. And the same VM differs depending on day/time…  • Intelligence required to analyse each metrics and their expected “normal” behaviour. 24
  • 25. Dynamic threshold & alerts  vCenter Operations uses dynamic threshold • It is dynamic and personalised down to individual metric. • Varies from object to object. 1000 VM will have their own threshold. • Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the chart below. • Varies from metric to metric. An ESX with 12 cores, each core has its own CPU Usage threshold. • You can fix hard thresholds if you need to. • This needs Enterprise edition. It comes with no static threshold defined. • Steps  http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html Notice the range varies in size 25
  • 26. Dynamic Threshold Analysis For each metric  DT analysis runs nightly • New dynamic thresholds are computed for Data Categorization each metric  Data categorization • Tries to identify stat as linear, Linear DT Multinomial Sparse Step Function Quantile multinomial, step function, etc DT Sigma DT DT Sigma DT • If one of those matches, that DT function is used CCPD  Otherwise: competition • Sigma: assumes hourly cycles ACPD • CCPD: tries to find normal cycles • ACPD: tries to find abnormal cycles DT Scoring • Winner is assigned based on metric trending accuracy  The same metric may get different DT function on different day Dynamic Thresholds 26
  • 27. Dynamic Threshlold: Algorithm   m 1 m  1 m      0,0     i , j    i , j   m 1 m 1 0,0 1      i , j 1    m 1 m  1  m m   pi , j  i 1 pi , j   1     pi , j  i 1 pi , j   i 1 j 1 i  m, j 1  i , j 1  P1,1,P1,2 ,...,Pm,m ( p1,1, p1,2 ,..., pm,m )   m 1 m  1   0,0      i , j      i , j    i 1 j 1 m m, j    i 1 j 1  m, j    i 1 j 1 i  m , j 1   m 1 m  1 m  where   pi , j  i 1 j 1  i  m , j 1 pi , j  1 0  pi , j  1 and   z    t z 1e  t dt , 0 The marginal distribution of the i th row of J is:   m 1   Dirichlet      i , j , i ,1, i ,2 ,..., i ,m 1  for i  1 m  1 ,...,   j 1   ( pi ,1,..., pi ,m 1 )      m    Dirichlet     0,0   m, j  , m,1, m,2 ,..., m,m , 0,0  for i  m       j 1    m 1 m  1 m where   0,0     i , j   i , j i 1 j 1 i  m , j 1 It is pretty difficult for a human to beat the computer in analysis of the data.. The above is one of the many algorithms applied by vCenter Operations. 27
  • 28. Analytics 7 different analytics areas. For DT feature, there are 8 algorithms. Only in Enterprise Edition These advance features create Smart Alert. 28
  • 29. Discussion Point Raw Counters vs Derived Counters Dynamic Threshold vs Static Threshold 29
  • 30. Badge – Health  Answer complex questions like: • How is the entire virtual data center doing? What’s the degree of their health? • For every cluster, host, datastore, what’s their health?  Health is a current Operational State. • It represents what is wrong now that should be addressed within 1 day. Thus Health needs to be scored such that if it is red, then it really needs attention.  Weather Map • Simple way to check that entire farm is healthy • For child object, it is replaced with Health Trend • Shows Health of all parent and child objects • Each square can be VM, ESX, datastore, cluster, datacenter, vCenter. Value Explanation 75 – 100 Normal behaviour 50 – 75 The object experience some problems. The object might have serious problems. 25 – 50 Check and take action as soon as possible. The object is either not functioning properly or 30 0 – 25 will stop functioning soon.
  • 31. 95 Badge – Workload  Answer complex questions like: • For every object, how is Demand vs Supply? • For every single VM, is CPU/Memory/Disk/Network bound? • Any VM is not getting what they are entitled? • What’s the normal workload range for every object in our vDC?  Workload is not utilisation or usage • More accurate than utilisation as it takes many factors than just utilisation.  Workload = (Demand/Entitlement) Value Explanation • Entitlement is dynamic. Affected by shares, limit, etc. 0 – 80 Workload is not high. • Demand ≠ Usage. The object is experiencing some • Usage may mean passive usage. E.g. the RAM page is there but 80 – 90 high resource workloads. no write/read. Workload on the object is • Score is Max (CPU, RAM, Disk IO, Net IO) 90 – 95 approaching its capacity in ≥1 area. • To bring up the attention Workload on the object is at or over its >95 capacity in ≥1 areas. 31
  • 32. Derived Metric: Demand The chart below shows Demand in action. I generated IOPS which on a local datastore, resulting in spike in latency (read latency when up from 3 ms to 60 ms. Demand correspondingly go up from 4 to 100! 32
  • 33. Badge – Anomalies  Answer complex questions like: • Is our vDC doing business as usual today? Or is it a dynamic environment with lots of unexpected changes? • Which VMs, ESX, cluster, datastore, etc are behaving abnormally? • …. and exactly which counters are the culprits?  Identifying metric abnormalities • It need to learn dynamic ranges of “Normal” for each metric, so give it >3 cycle per metric. • A month-end job means it needs 3 months. • Normal range changes after configuration or application changes. Value Explanation  Anomalies score 0 – 50 Normal Anomaly range • A high number of anomalies: 50 – 75 The score exceeds the normal range. • Usually an indication of a problem 75 – 90 The score is very high. • Demand change Most of the metrics are beyond their • Application team change code/app thresholds. This object might not be > 90 working properly or will stop working • KPI metrics impacts the Anomalies score more than soon. non-KPI metrics. 33
  • 34. This virtual DC spans multiple vCenters. vCenter Ops show all the counters that are behaving abnormally. 34
  • 35. Badge – Faults  Answer complex questions like: • What faults do we experience in our vDC? • For every object, what faults does it have?  Specific knowledge of which vCenter Events • Which events affect Availability and Performance of which object? • Pulled from active vCenter events • Example: • Loss of redundancy in NICs or HBAs • Memory checksum errors • HA failover problems • Each fault has a default score (e.g. 25, 50, 75, 100) Value Explanation • Highest individual Fault Score drives the Fault object 0 – 25 No fault is registered on the object Score Faults of low importance happens on 25 – 50 object.  Best Practices: Faults of high importance happens on 50 – 75 • Do not change the Faults Threshold object. • Use Alerts View to manage Faults. Filter it to just show > 75 Faults of critical importance happens on Fault. object 35
  • 36. Badge – Risk  Answer complex questions like: • Do we have risk from performance and capacity in our vDC? If yes, where are they and can you quantify the seriousness? • Which objects are at risk? What is the specific risk?  Risk Score takes into account • Time Remaining • Capacity Remaining • Stress  Risk is an early warning system. • Identifies potential problems that could eventually Value Explanation hurt the performance 0 – 50 No problems are expected in the future. • The Risk Chart shows Risk score over the last 7 There is a low chance of future problems or a 50 – 75 days, giving a view of the trend. potential problem might occur in the far future. There is a chance of a more serious problem or a 75 – 100 problem might occur in the medium-term future. The chances of a serious future problem are high 100 or a problem might occur in the near future 36
  • 37. Badge – Time Remaining  Answer complex questions like: • How much time do we have before we need to buy more server, storage, network before performance starts to degrade or we run out of capacity? • For every cluster, VM, datastore, how much time do we have?  Measures time remaining before each resource type reaches its capacity • CPU • Memory • Disk (IOPS & Space) • Network I/O Value Time remaining  Early warning of upcoming provisioning 50 – 100 > 2x SP Buffer (60 days) needs 25 – 50 < 2x SP Buffer • Based on Score Provisioning buffer. Default value is 30 days. <25 Near SP Buffer • Set in “Capacity & Time Remaining” section 0 < SP buffer (30 days) 37
  • 38. Badge – Capacity Remaining  Answer complex questions like: • How many more VM can we put without impacting performance or using up capacity? • For every cluster, VM, datastore, which components (CPU, RAM, Disk, Network) would run out first?  Early warning system 333 More VMs correlates to 77% Capacity Remaining for this object • A low score of 1 mean you still have >30 days. • Measures how many more VMs can be placed on the object  Percentage of Total VM “Slots” Remaining • Based on the average size of the VM on the object (e.g. VM profile) Value Capacity remaining • Each object has its OWN VM profile size: Host, >10 >120 days Cluster, Datacenter, Etc. 5 – 10 60 – 120 days  From the table, notice value is not linear 0–5 30 – 60 days • It is also not the same with Time Remaining 0 <30 days threshold. • A value of 30 means >120 days for capacity but around 40 days for time. 38
  • 39. Capacity Remaining Calculation  Determine Capacity Constraint Resource  Deployed or Powered On VMs • Powered Off VMs only use disk space resources • Powered On VMs uses ALL of the 4 resources  Calculation Example Shown: • Limiting Resource is Disk Space with 333 VMs available • Use the Deployed VM number of 99 to do the calculation for percentage space remaining • Determine Capacity Remaining • 333 / (333 + 99) = 77% 39
  • 40. Capacity and Time details  You can drill down to see details • You can check the 9 components, as shown on the right. • This helps answer the question which components have how many days or VM left! • Summary = Min (all 9 components) 40
  • 41. Badge – Stress  Answer complex questions like: • In our vDC, do we have stress points or periods? How bad is it? • For every cluster, VM, datastore, which ones are experiencing stress and how bad is it?  Measures long-term or chronic workload (6 weeks) • Chart shows weeks break down of Stress for each day/hour averaged over the last 6 Weeks • Workloads > 70% = “Stressed” • Threshold Configurable as per screenshot below Value Explanation 0–1 Normal score. No action needed Some of the object resources are 1–5 not enough to meet the demands. The object is experiencing regular 5 – 30 resource shortage. Most of the resources on the object are >30 constantly insufficient. The object might stop functioning properly. 41
  • 42. Stress Calculation 100 Stress Zone 12% 70 Workload Line 0 6 Weeks  Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold compared to the Total Capacity of the object • Stress Score = (Stress area / Stress Zone) *100 • But max value can be > 100% as the workload can be >100.  Example • Stress Line is 70% Workload • 12% of the area is above the 70% threshold • Stress Score is 12 42
  • 43. Badge – Efficiency  Answer complex questions like: • Are there optimization opportunities in our vDC? • How well do we do in terms of VM provisioning? Do we get them right?  Efficiency Score factors • Reclaimable waste • Density ratio  Graph Depicts VMs by Percent • Optimal – Optimally Provisioned VMs Value Explanation • Waste – Over Provisioned VMs  Three Resources Considered use >25 The efficiency is good. The resource on the selected object is optimal. • CPU • Stress – Under Provisioned VMs • 10 – 25 The efficiency is good, but can be Memory improved. Some resources are not fully • Not used in Efficiency Calculation (see Risk) • Disk Space used. The resources on the selected object are  Note: VMs can appear in Stress and 0 – 10 not used in the most optimal way. Waste 0 The efficiency is bad. Many resources are wasted. 43
  • 44. Badge – Reclaimable Waste  Answer complex questions like: • Do we over provisioned the VMs in terms of CPU, RAM and Disk? If yes, what’s the degree of over provisioning? • For every cluster, VM, datastore, what can we reclaim?  It identifies the amount of reclaimable resources • CPU • Memory • Disk  Reclaimable Waste = Reclaimable Capacity / Value Explanation Deployed Capacity No resources are wasted on the 0 – 50 • Waste Score = Max(CPU Waste Score, RAM Waste selected object. Score, Disk Space Waste Score) 50 – 75 Some resource can be used better. • Disk calculation can also include old snapshots and 75 – 100 Many resources are underused templates Most of the resources on the selected 100 object are wasted. 44
  • 45. Badge – Density  Answer complex questions like: • How high can we push our consolidation ratio before we experience performance problem? • Now that’s a million dollar question!  • For every datacenter, cluster, ESXi, what are our key ratios and how much head room do we have?  Contrasts Actual vs Ideal Density • Identify Optimal Resource Deployment Before Contention Occurs • Ideal is based on demand, not simple configuration. • High Density is good. 100 is not too high. Value Explanation >25 Good consolidation 10 – 25 Some resources are not fully consolidated 0 – 10 The consolidation for many resources is low 0 The resource consolidation is extremely low. 45
  • 46. Badge Thresholds There are 2 different threshold: VM and Infra (ESXi, Cluster, Datastore, etc) Notice that Major badge has different threshold to its minor badges Even “similar” badges have different threshold. Notice Time remaining and Capacity remaining have very different thresholds. Disable Color Threshold by Clicking the Level Off 46
  • 47. Using badges together  Workload High & Anomalies Low & Stress High • Workload – Object is Running Hot. Potentially Starving for Resources • Anomalies – Normal Behavior for this timeframe Add resources • Stress – Object is often running under high Workload.  Workload High & Anomalies Low & Stress Low • Workload – Object is Running Hot. Potentially Starving for Resources Not likely a big problem… • Anomalies – Normal Behavior for this timeframe a cyclical workload spike? • Stress – Object usually has enough resources  Workload High & Anomalies High • Workload – Object is Running Hot. Potentially Starving for Resources Something is amiss! Immediate attention. • Anomalies – Abnormal behavior for this timeframe  If there are Alert and Fault too, then it is a sign of major issue 47
  • 48. Discussion Point Is Badge the way to go? Are these the right 11 badges? What other badges do you need? 48
  • 49. Heat Map  Built-in heat maps • Basic: A great way to show a lot of information on 1 screen. • Storage: space, IO Heat map can quickly highlight information, • CPU as it can present relative information. • RAM It is good for relative comparison among • Network VMs. • Advance (or composite) • Health • Workload • Capacity Heat map is a 2 dimensional chart. So it takes  Custom heat map or cold map 2 parameters. You cannot choose >2 data. For example, you cannot show the following • Since we can change the color, we can actually at the same time: create cold map. • IOPS, Latency and Throughput. Also, • In cold map, the bigger the size, the colder it is these 3 have different units so it’s hard (less utilised it is). The bluer it is, the less utilised it to combine using Super Metric. is. • ESX, VM and Datastore. • Hence it focuses on Waste 49
  • 50. Storage: Datastore + VM vs workload + latency  Since all the datastores are on the same array, how do we quickly tell the relative workload generated by every one of them? • This answers: which datastores are heavily loaded?  For each of these datastores, how do we know the relative workload generated by the VM? • This answers: which VMs dominate within a datastore?  For every VM, how do we performance is reasonable number? • This answers: which VM has storage bottlenect?  How do we show all the above data in one page, without the need to show a lot of numbers? • And we still want to be able to drill down to each VM and datastore. 50
  • 51. Each square is a VM. They are grouped by datastore. Bigger square: bigger throughput Color: latency. 51
  • 52. Storage: Throughput vs Latency at cluster level  Which cluster is generating high storage workload?  Are they getting the SLA they ask? What’s the latency? The cluster owner wants to know that his entire cluster is getting <10 ms latency.  We expect these X, Y, Z clusters to be doing little work. Can we prove this? Basically, the same concept from previous slide, but looking from cluster point of view as Cluster & Datastore has a Many-to-Many relationship. 52
  • 53. Storage: Throughput vs Latency at cluster level 53
  • 54. Storage: Throughput vs Latency at host level 54
  • 55. Storage: Throughput vs Latency at VM level Can we show at VM level now? That’s why you need a 24” monitor  55
  • 56. Storage: Space vs Latency  Any big VM that is not getting the SLA we agreed on? 56
  • 57. Storage: Datastore space contention  Do we have space contention at any of the datastore? If yes, how bad is the contention? • While we use thick provision at vSphere level (and thin at array level), we still have risk of space from snapshots, vRAM increase, new VM, new vDisk, storage vMotion, storage DRS, etc.  Are the datastore uniformly sized? 57
  • 58. Storage: Space contention  We use thin provisioning 58
  • 59. CPU: Contention vs Usage at cluster level  Which clusters are doing the most work? Which are not doing much?  How is the CPU workload on every cluster?  For each of those clusters, can we see if there is CPU contention? 59
  • 60. CPU: Contention vs Usage at host level  Same questions with previous, but for host.  We can expect some “drill down” in this heat map 60
  • 61. CPU: Contention vs Usage at VM level Can we show at VM level now? That’s why you need a 24” full HD monitor  61
  • 62. VM Health  Current Health • Are all the VMs healthy? Especially those VMs which have high workload! • Which VMs are experiencing problems? • Are more demanding VMs less healthy? • Can we see this by cluster? By host?  Future Health • Will all the VMs be okay in future (30 days)? Need to check CPU, RAM, Disk IO, Disk Space and network for every single VM! • For those VMs which are not ok, can we be specific on which value will run out first? Can we “drill down” to individual VM? 62
  • 63. VM: color by health, size by workload 63
  • 64. VM: color by capacity, size by workload  This is now showing future projection. We can see that the VM vCenter 5 is having red color. Its capacity will run out within 30 days. So we click on it to drill down. 64
  • 65. Drill down to specific VM  Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in 10 days.  We can go as far as 6 months. This is good enough as you should not buy hardware >6 months in advance. It makes sense in the physical world as it’s fixed, but unwise in virtual world. 65
  • 66. Drill down to specific VM  Showing value in absolute terms is good, but can be confusing. vCenter Ops can also show in % 66
  • 67. Discussion Point Which heat maps are useful for you? What other heat maps or cold maps do you need? 67
  • 68. Smart Alert vs Normal Alert  Smart Alert • Relies on the advanced analytics instead of simple raw counters. • Not static, as it based on Dynamic Threshold • Examples: • Early warning alerts: use total anomalies to predict when a problem is happening, sometimes before users are impacted • KPI predictive: prediction that a KPI might soon go abnormal due to an event occurring that has preceded the KPI going abnormal on previous occasions • Fingerprint: set of metric anomalies matches previously seen problem (and associated resolution)  Comparison Advanced Edition Enterprise Edition provide alert on Minor Badges badge. E.g. Workload Provide alert on any counters (raw, badge, super YES, Health NO metric) Can only do infrastructure level alert Can do application-level alert good for Alerts on single objects (e.g. VM) Good for single or multi objects driven by the badge’s changing color Driven by threshold anomaly breaches and KPI Threshold Breaches Not customiseable Highly customisable Cannot do alert at Resource Pool or Folder Can do it 68
  • 69. Application-level smart alert  Needs Enterprise edition. 69
  • 70. Alert  When does Alert happen • When a badge change color • When a fault happens • VC Ops own alert • A component in VC Ops itself has failed. • VC Ops cannot get data  Can do SNMP and SMTP • Both are set at set on the Administration Web page. The URL format is https://VM-IP/admin/ 70
  • 71. Advance edition: Alert main window  Filter by the 11 badges  Filter the VC Ops own alert: system or environment 71
  • 72. 72
  • 73. Enterprise edition: Alert main window  New alerts: Early Warning, KPI Breach, KPI Prediction, KPI High Threshold breach, Classic (static)  We can also color the row by criticality, and specify period (start – end) 73
  • 75. 75
  • 78. Anomalies – Symptoms Window  The example is from an ESXi host with 11 VM. Example of an ESXi Anomalies symptom window. • It shows 3 resource type: VM, Datastore, Host System • The VM resource kind has 7 metric groups with anomaly.  The VM resource kind (30 out of 71 Symptom) • 71 – Total number of Symptoms under VM object • We’re reporting on an ESX here, and VM is a child of host. So all children metrics are included. • The metric group comes from the vSphere adapter + VC Ops own. • 30 – Total number of Displayed Symptoms • Based on the limit of 5 metrics shown for each Metric Group • The metric group (CPU Usage, network, Summary, etc) are specified by the adapter • Subcategory Network (3 of 11) • 11 – The total number of VMs associated with this ESX. This is not the number of symptoms. • 3 – The total number of VMs that have one or more Network symptoms. Metrics will not be identical common among VM. Most will be similar though. Multi vCPU VM will have more vCPU metrics than 1 vCPU VM. Different VM will have different anomalies They have different workload. 78
  • 79. vCenter Operations presents datastore with all the details 79
  • 80. Storage in vCenter Operations Automatic learning of storage performance. Calculating both Demand and Normal rate. 80
  • 81. vSphere 5 Performance Chart (fat client) Can only choose 1 component at a time. e.g. cannot show CPU and RAM at the same time. 81
  • 82. vSphere 5 Performance Chart (fat client) Can only show 1 chart at a time. Hence can only show 2 units at a time. 82
  • 83. vCenter Operation charts Can show >1 charts at a time. Can combine/split charts. Can show different data type from different objects. Line is color coded, showing when threshold is breached. 83
  • 84. Capacity Management in vSphere is hard CPU Optimizations Reserved Capacity vSMP, Shares, Reservations, Limits Memory Optimizations Transparent Page Sharing, Memory Ballooning, Memory Compression ? Remaining Capacity Storage Optimizations Usable Thin Provisioning, Linked-Clones Capacity Clusters DRS, HA, FT, vMotion, Storage vMotion Workload Flux Used VMs growing/shrinking, added/removed Capacity vSphere 36 days remaining 84
  • 85. Capacity Management  What are my historical utilization trends?  What resources have been requested vs. needed?  How many more VMs will fit in my current farm? Analyze  How can we use my resources more efficiently?  What VMs should be right-sized?  Can I reclaim over-provisioned or unused capacity? Optimize  When will I run out of capacity?  What if I add, remove, reconfigure capacity?  Can I defer infrastructure investments? Forecast 85
  • 86. Understanding Behavior  Need to understand the weekly pattern • Business week • Weekend • E.g. workload spike at 9am on Mondays Year 1  Accomplish through roll-ups • Roll-up weeks in a month to compute the typical week for the month • Roll-up typical week in a month to a typical week in the quarter Quarter 1  Differs from performance management roll-ups • Older performance data gets less granular. vCenter loses accuracy • Older capacity data maintains its granularity Month 1 Month 2 Month 3 86
  • 87. 87
  • 88. 88
  • 89. 89
  • 90. Planning  Summary  Export 90
  • 91. Planning  Summary  Resources 91
  • 92. Planning  Summary  Resources 92
  • 93. Planning  Summary  Resources 93
  • 94. What-if  Visualise • Add or remove VMs. • Add based on existing VMs as profiles • Add based on spec you supply • Add, remove, or update hosts. • Modify CPU and RAM only. No Network. • Add, remove, or update datastores. • Update means increase or decrease size. • No IOPS yet.  At a cluster level or host level • Cannot do at datacenter or higher level • Host level does not make sense when host has HA & DRS turned on  You can add multiple what-if scenario • You can combine them or compare them on the same chart • You cannot save. Changes lost upon log-off. • You can export the scenario results to an Adobe PDF or CSV file. 94
  • 95. 3 choice of views 95
  • 96. Average VM Capacity (trend view) 96
  • 97. 97
  • 98. 98
  • 99. Modeling a what-if scenario Change Supply Change Host/Datastore Based on existing VMs Change Demand Change VM New VM spec 99
  • 100. Modeling a what-if scenario 100
  • 101. Modeling a what-if scenario – Specifying VM Configuration 101
  • 102. Modeling a what-if scenario – Using Existing VMs Columns you can see 102
  • 103. Modeling a what-if scenario – Using Existing VMs 103
  • 104. Modeling a what-if scenario – Using Existing VMs 104
  • 105. Modeling a what-if scenario – Changing hosts 105
  • 106. Modeling a what-if scenario – Changing datastores 106
  • 107. Modeling a what-if scenario 107
  • 108. 108
  • 109. Capacity state today VM count capacity Current capacity cross-over point Actual VMs deployed 109
  • 112. 112
  • 114. VMs can appear in Stress and Waste at the Same Time Undersized for CPU Oversized for Memory 114
  • 115. Powered-Off VM and Idle VM: setting 115
  • 117. Capacity Planning: Is the VM really sized properly?  Setting a threshold of under-utilisation alone is not enough We need to calculate the degree of under-utilisation. 117
  • 118. Oversized VM & Undersized VM 118
  • 119. Oversized VMs - Calculation Same concept applies to undersize. Same concept applies to idle VM. 119
  • 120. Planning  Summary tab Planning  Views tab 120
  • 121. Tips  No of intervals and data points used for analysis • Tied to your business cycles. • Pick correct number of data points and the interval type to represent a typical business cycle. • Match no of intervals used for trend view and no of data points used for forecasting • Stay with default forecasting algorithm settings  Leverage buffer settings to accommodate for unforeseen usage spikes or future business growth. • VC Ops 5 does not yet have “future incoming VM” concept  Leverage business hours to eliminate off-peak usage  Don’t be afraid, play with global settings • They are just knobs used for data analysis • Raw data is not modified when global settings are changed 121
  • 122. Change Events Correlated with Performance  Overview • Integration between vCM and vC Ops Mgr for change events • Overlay Guest OS configuration changes from vCM in vC Ops performance trend graphs • Launch in context into vCM to see full details of changes and potentially remediate them  Benefits • Enable Operations to quickly understand and resolve performance issues arising from configuration changes (reduce MTTR) • Drive efficient & effective troubleshooting by correlating Guest OS configuration changes w/ VM performance degradations • In larger enterprise, help bridge gap between VMware Admin and Guest OS Admin 122
  • 123. VCM Events in vC Ops – Event Collected  vC Ops does not pull in every event from vCenter • Only events that could affect health or workload (vSphere Knowledge!)  Adapter only pulls in change events for Guest OSs • No ESX/i Host configurations changes (these come from vCenter Adapter) • Guest OS has to be by managed by VCM Event Collected Reboot Software Install/Uninstall Windows Registry IP/Networking changes Device Driver changes Memory/CPU changes Windows Firewall Patches 123
  • 124. Event Types in vC Ops Mgr  Circle Events are vCM Initiated • Change log in vCM updated when change is completed E • Time = Occurred time  Diamonds are non-VCM-initiated • Change log in vCM updated when vCM collects from VM • Time = Collected time E  Always Blue Events – “Might” have minimal impact  vCM events VMs follow the normal vC Ops display rules • vCM Events appear for the VM Object itself • vCM Events appear on an ESX host if you enable Child Events 124
  • 125. 125
  • 126. 126
  • 127. vCM Change Events Correlated with Performance  A pop-up for a vCM event related to uninstalling a piece of software on the VM in question 127
  • 128. vCM Change Events Correlated with Performance 128
  • 129. Terms  The terms Attribute, Metric, Counter mean the same thing. • CPU Ready Time is an attribute. • CPU Ready Time from the VM ABC123 is a metric. • vSphere uses the word Counter. VC Ops uses Attribute and Metric. • As there are many attributes, they are grouped together. This is called Attribute Package.  Resource provides the Metrics. • Example of resources: host, VM, datastore, cluster, etc. • So a resource provides many attributes. • Resource are pulled via Adapter. Adapter  Kind • In VC Ops, there are many kinds of resources. So there is a term Resource Kind, that you need to get used to. Resource Resource Resource • VC Ops uses different adapters to talk to different source. 1 type of adapter per source. So there is a term Adapter Kind. Attribute Attribute Attribute  Advance terms • Container. Super Metrics. Application. Tier. KPI 129
  • 130. Adapter, Resource, Attribute, Package VC Ops Adapter Source of data VMware Adapter vSphere 5 VCM Adapter VCM 5.4 VC Ops Adapter VC Ops 5 Container Adapter Adapter Kind = adapter type. VMware Adapter is an example of Adapter Kind. 1 Adapter Kind can have many kind of objects that it pulls from the source. This is called Resource Kind. To make management of attributes easier, they are put into Package. Inside a package, metris are grouped for ease of use. This is the actual Resource Kind Container Adapter is not actually an adapter. It’s a group or container that brought by VMware Adapter can hold other objects. 130
  • 131. Actual Resource Kinds  Sample adapters with their associated resource kinds. This is a special & built-in adapter. This is another special & built-in “adapter”. Technically, this is This monitor VC Ops itself! actually not an adapter, as it’s just a VC Ops is just an application, container. which also needs monitoring. 131
  • 132. vSphere resource kinds  Unlike the Advanced edition, we can utilise Folder and Resource Pool • This means you can create Super Metric at this level. • Complement vCenter. Not used? ESX Host Not used? No vApp, no Datastore Group, no vDS as at VC Ops 5. 132
  • 133. Resource Kind: default settings 133
  • 134. Attribute & Attribute Package  Package • A collection of Attributes from 1 Resource with the same collection interval. That’s all! • Need to map it to objects • Super Metric must be placed into a package • A package cannot come from multiple resources. See screen below. • Cannot create a package that has both VM and ESXi • There is a default package called All Attributes. 134
  • 135. 135
  • 136. 136
  • 137. 137
  • 138. 138
  • 139. Editing a resource property 139
  • 140. 140
  • 141. Resource Kind: Tags What’s the difference between Applications and Application? Looks like Application is from the Container adapter, which is built-in. Maintenance schedule contains the time a particular object is on scheduled downtime. It is used to tell VC Ops to ignore, else it would give alert as the behaviour is unexpected. It would think the health drop! So in this screen, ignore maintenance schedule as it should not be part of Resource Kind. The range for Health. This is not the same with the badge Health in VC Ops Advance, as this is universal and apply to beyond vSphere. Health in Advance edition include Fault, which is vSphere specific. Tier is a special container. Again, this is universal, so name your tier properly to avoid changing name later on. Only 1 value here. This means the entire VC Ops. 141
  • 142. Resource Kind: Tags  You can control which resource kinds are shown • In the picture below, ESX was hidden. 142
  • 144. Drag selected objects to the tag value 144
  • 146. VC Ops generated metrics 146
  • 147. Monitoring the big workload  You have convinced your CIO to virtualise the remaining 50% of the servers.  Your CIO needs you to prove, supported by performance charts, that the platform has served every VM well, meeting the SLA in the past 1 quarter. • Tier 1 cluster SLA: 2% CPU Ready, 0 RAM Ballooning, 10 ms disk latency, 0 drop packets. • Tier 2 cluster SLA: 4% CPU Ready, 5% RAM Ballooning, 20 ms disk latency, 0 drop packets. • Tier 3 cluster SLA: 6% CPU Ready, 10% RAM Ballooning, 30 ms disk latency, 0 drop packets.  You have 500 VM on 50 ESXi, 8 clusters, 40 datastores, 5 RDM.  You must prove that: • Not a single Tier 1 VM has >2% CPU Ready in the past 1 quarter. The underlying ESXi also has <2% CPU contention. • Not a single Tier 1 VM has >10 ms disk latency in the past 1 quarter. The underlying ESXi also has <10 ms disk latency. • Etc, for each Tier and each component (CPU, RAM, Disk, Net) What kind of charts do you need to show? 147
  • 149. Super Metric: Functions  2 types: • looping functions: take multiple input value • Average, sum, min, max, count, combine, etc. • More practical or useful than single functions • single functions: take 1 value • Absolute, round up, round down, square root, etc.  The xxxN functions, instead of working on just the immediate children, it looks down (or up) the number of levels specified in the formula. • This ‘2’ tells the function to look down for two levels for the metric. • Putting -2 means look up. 149
  • 150. Super Metric: hierarchy  Example: super metric for Average CPU usage of a cluster VM is 2 level down from cluster. 150
  • 151. 151
  • 152. 152
  • 153. Super Metric: Operators  To calculate a value for each VM based on metrics for that VM, use the ‘$This’ operator.  Another example: max ( $This:CPUavg, ESXi-Host-003:CPUavg, VM:CPUavg)  Finds the maximum value among these • CPUavg metric for the resource to which the super metric is assigned (so this is dynamic) • CPUavg metric for a specific resource called ESXi-Host-003 (so this is hardcoded) • CPUavg metric for all resources of type VM (so this is universal for all VM) 153
  • 154. 154
  • 155. 155
  • 157. 157
  • 158. 158
  • 159. Discussion Point Think of super metrics that you need. Explain why and how you will need them. 159
  • 160. Applications and Application Tiers  App Team often view things from their own application-centric. We can create custom dashboard showing their “Application”  Even better if we add non vSphere data, like Hyperic. This gives app-level info and GuestOS-level info, which is not available in vSphere adapter.  Define your own hierarchy and relationship 160
  • 161. Drag selected objects to the tag va 161
  • 163. 163
  • 164. What counters do you check? Component ESX VM Usage or Utilisation: Overall CPU utilisation (to get overall utilisation of entire box) Usage or Utilisation: Overall CPU utilisation Usage or Utilisation: Individual core utilisation Usage or Utilisation: Individual core utilisation (to see distribution and if any particular core is CPU max out) Wait (wait for IO. To see if it’s IO bound) Wait (wait for IO. To see if it’s IO bound) Ready (VM unable to run, waiting for core) Ready (VM unable to run, waiting for core) Co-Stop (if there are large VMs) Co-Stop (if there are large VMs) Ballooning Ballooning RAM Active or Active Write Active or Active Write Latency: kernel latency, device latency. Guest Latency Device Latency Storage Throughput Throughput IOPS IOPS Drop packets Drop packets Network Throughput Throughput vSphere Replication? Others System? Cluster service? 164
  • 165. Test your vSphere knowledge! How are Disk, Datastore, Adapter and Path related? 165
  • 166. CPU counters Test your vSphere knowledge! Which one is ESX, which one is VM? How do you know? Test your vSphere knowledge! What can stop/block a VM from getting the CPU it was configured? No more Collection Level limitation. VC-Ops collect them all and analyse them all. Changing collection level in vCenter does not impact VC Ops as VC Ops gets from “real-time” statistic. 166
  • 167. %OVRLP and %SYS Run Wait Ready Time World 1 %RUN %SYS %OVRLP %RUN continues to accumulate. But %OVRLP kicks in. World 2 %RUN %OVRLP Overlapping time. A world still wants CPU but interrupted by another world. High number normally means ESX is experiencing heavy IO %USED = %RUN + %SYS - %OVRLP As a result, the overlap value does not incorrectly inflate %USED. %SYS A high no means heavy IO or interrupts 167
  • 168. Memory counters ESXi VM 168
  • 169. Storage counters: ESXi host Datastore Disk Storage Adapter or Storage Path 169
  • 170. ESXi: Adapter, Device and Path 1 adapter can many Devices (LUN). 1 Device is accessed via many paths. 1 path can only access 1 Device. 170
  • 172. ESXi: Adapter, Device and Path ESXi 5.0 vmnic Storage Adapter 1 Storage Adapter 2 vmhba2 vmhba3 Storage Path Storage Path Storage Path Storage Path Storage Path Storage Path vmhba3 NFS VMFS VMFS RDM Datastore Datastore Datastore Disk Disk Disk 172
  • 173. Storage counters: VM Virtual Disk (VMDK, RDM) VM Drive 1 Drive 2 Drive 3 vDisk vDisk vDisk scsi0:0 scsi0:2 Datastore VMFS NFS RDM Datastore Datastore Disk Disk Disk 173
  • 174. Network counters ESXi VM 174
  • 175. Other Counters: ESXi Host vSphere Replication System (vmkernel) See next 2 slides for info Cluster Service Power 175
  • 176. 176
  • 177. A long list of vmkernel resources. Some are familiar, such as vMotion, FT, hostd, Vpxa, DCUI, logging 177
  • 178. 178
  • 181. Dashboard: creating a new Tab 181
  • 183. Application Overview and Application Detail 183
  • 184. 184
  • 185. 185
  • 187. 187
  • 188. 188
  • 196. Scoreboard: Health or Workload 196
  • 197. 197
  • 199. 199
  • 201. 201
  • 202. 202
  • 204. 204
  • 207. 207
  • 208. Metric Graph (Rolling View) 208
  • 209. 209
  • 212. 212
  • 213. 214
  • 215. 216
  • 218. 219
  • 220. The VC Relationship  There are 2 widgets that are vSphere related.  Use the advanced edition instead. • Enterprise edition can access Advanced edition UI at the same time. Just open another window or tab. 221
  • 221. Interaction between widget  Controlled at the dashboard level, not individual widget  Providing widget and Receiving widget 222
  • 224. Practice session: creating your dashboard  Goal: have a dashboard to help you investigates all non-local datastores quickly • Be able to plot chart for all non-local datastores for comparison.  Answer: • Create a tag called Storage from the Environment screen. • Create 1 tag value: Shared Datastore • Tag all the non-local datastores with this tag value • Done manually. Simply drag all the rows • Create a dashboard with 4 widgets • Health Status • This is where you show the overall health of all Non-Local Datastores • Resources • This is where you show all the members of Non-Local Datastore tags • Metric Selector • All the metrics will appear here. • Select the metric you want • Metric Graph or Metric Sparklines • Choose Sparklines if you have lots of graph. 225
  • 225. 226
  • 228. 229
  • 229. 230
  • 234. Major Steps in implementation Define who Create Create Create Create Create needs what Super Metrics Applications Tags Heat Maps Dashboards  Begin with the end in mind • Every Super Metric must serve a particular role • Role, not individual. A person can & will have many heatmaps/dashboards. • Decide if you need the following non-standard info • Application-level & Guest-OS-level info • Info from physical machines (UNIX, X64, etc) • Info from physical storage and network (switch, FW, router, etc)  Think in terms of application • A great way to complement vSphere as vCenter does not have this object. 235
  • 235. Who needs to see what Simple Dashboard. Big picture. Tend to be application focused. CIO or CTO No absolute data. Normalised to 0-100. Focus on long term. Averaged data. A 30-minute spike will not show up. Updated daily. Group Head e.g. Head of Infra, Head of Apps Dept Head e.g. Head of Storage, Head of Server, Head of Network, Head of Databases Rich Dashboard. Ideally Full HD screen. Admin/Architect Specific info. e.g. Storage Admin, Network Admin, App Owner, VM Owner Absolute data + Normalised Data. Focus on short term. Actual data. A 5-minute spike will be visible. Updated every 2 minutes. 236
  • 236. Who needs to see what (samples) Roles Info presented Health of overall IT in the past 1 month CIO Health of key applications in the past 1 month CTO As above, but with more technical content, and tailored to him. Health of all key apps in the past 1 month, with the ability to do 1 level drill down for each app. Head of Applications Capacity projection for all key apps. Health of Storage Health of Network Head of Infrastructure Health of Servers (VMware and Physical) Health of VM Head of Storage A higher level, simpler dashboard than Storage Admin Head of Network VMware Team An App Owner The infra is providing each of the VMs in my App with the resources it needs 237
  • 237. Designing Super Metric  Leverage existing derived metrics  Leverage Objects that vCenter cannot provide performance data • Application, Resource Pool, Folder, Location, can now have performance counters  Minimise static alert.  Know what a good range for the end result  Build a simple table to avoid super metric sprawl and duplicating existing metrics • Below is an example, showing 2 Super Metrics. Name Purpose Target Role Formula Good Range VM SLA = 100% - Max (CPU, RAM, Disk, Network) CPU = CPU Contention %. RAM = RAM ballooning %. Shows that a VM gets the Disk = % above threshold latency. >99% (Tier 1 cluster) resources it wants from VM SLA VM Owner Network = Packet Drop %. >97 (Tier 2 cluster) infrastructure based on the >95% (Tier 3 cluster) defined SLA. Tier 1 Disk SLA is 10 ms. Tier 2 Disk SLA is 20 ms. Tier 3 Disk SLA is 30 ms. Show that the underlying infra VMware Infra SLA = 100% - Max (Host Cluster, Datastore Infra SLA has the resources for all the Admin Cluster) VMs on it 238
  • 238. Custom Heat Map or Cold Map Component Heat Map Cold Map Least utilised VM: size by vCPU count, color by RAM + CPU CPU Resource pool: size by CPU utilisation, usage (a Super Metric) Most RAM intensive VMs, grouped by ESX. Size by RAM RAM utilisation, color by health Most disk intensive VMs, grouped by ESX. Size by disk Disk Least utilised disk: size by GB, color by % of free utilisation, color by health Most network intensive VMs, grouped by ESX. Size by Network Most idle VMs, grouped by host network utilisation, color by health VMs with file system that will run out soon. Color by % Capacity left, size by GB left. VM health, grouped by cluster. Color by health, size by Health workload.  Design consideration • Use Super Metric so the info is richer. • Group VMs by 1 consistent hierarchy only. If you group by cluster, it won’t make sense to further group by datastore as 1 datastore can spans multiple cluster. 239
  • 239. vCenter: network impact of vCenter Ops 240
  • 240. Choice of Tools  vCenter Operations • 1-15 minutes accuracy (for other sources) • 5 minutes accuracy (for vSphere) • No need reproducible. But problem should last >5 minutes, preferably 15 minutes (3 sample)  vCenter • 20 – 300 seconds accuracy • Reproducable performance issue • Requirements: you already have some idea what causes it  esxtop • 2 – 20 seconds accuracy. Short burst problem. • Reproducable performance issue • Requirements: you already know which ESX & VM has the problem.  vSCSIStat • Specific for storage, low level analysis 241
  • 241. 242
  • 242. 243
  • 243. 244
  • 244. 245