SlideShare une entreprise Scribd logo
1  sur  54
in Storage Area Networks (SAN)
Solving congestion problems
Paresh Gupta, Technical Marketing Engineer, Cisco
Ed Mazurek, Technical Leader, Services, Cisco
Nov, 2015
Agenda
Build robust and self-healing storage area networks
ducation xperience xperiment
Build robust and self-healing storage area networks
Agenda
ducation xperience xperiment
Build robust and self-healing storage area networks
Agenda
on new innovations on Cisco MDS and DCNM for solving SAN congestion
16 Gbps FC adoption leading to heterogeneous speeds
Why care about SAN congestion now?
Ports at 1/2/4/8/16 Gbps part of same fabric
Increased pressure on OpEx
Maximize the utilization of existing infrastructure
Flash storage
Pushing network infrastructure to limits
Shift in response time from milliseconds (ms) to microseconds (µs)
Legacy application
Older HW/SW will be around
Data explosion leading to scaled out architecture
Increased number of host and storage ports in the same network.
Collapsed core  Edge-Core. Edge-Core  Edge-Core-Edge
What is SAN congestion?
Congestion within
switches
Congestion between
switches
• Ability to switch traffic between
all ports at all rates at all frame
sizes
• Containing congestion from
affecting other ports
• Predictable & consistent
performance
• Reliable performance
Congestion by external
elements
• Slow Drain
(Misbehaving host or storage
ports)
• Over-utilized Inter Switch Links
(ISL)
• Inappropriate oversubscription
ratio
Cisco MDS Architecture
eliminates congestion within
the switch
Line Card 2Line Card 1
Active Supervisor Arbiter
Fabric Module(XBAR)
Fabric Module(XBAR)
XBAR
interface
VOQ
P
o
r
t
P
o
r
t
Frame & credit processing in MDS switch
Cisco MDS
Initiator sends FC frame1
MDS receives frame in its entirety
and stored
2
Frame transmitted to VOQ3
XBAR interface requests Arbiter for
grant to transmit frame to egress
port via XBAR
4
Arbiter grants request to XBAR
interface to forward frame – only
sent when egress port has buffer
space available
5
FC Frame is forwarded to XBAR
then R_RDY sent back since
buffer is now free
6
FC Frame is forwarded to egress
line card7
ASIC forwards frame to target8
Credit is returned to Arbiter9
Req
Grant
Frame
R_RDY Frame
Frame
Frame
credit
Line Card 2Line Card 1
Active Supervisor Arbiter
Fabric Module(XBAR)
Fabric Module(XBAR)
XBAR
interface
VOQ
P
o
r
t
P
o
r
t
Cisco MDS architecture advantage
Cisco MDS
Throughput & Latency
Consistent
performance at different
traffic loads & type
Predictable
by CRC checking at all
stages
Drops corrupt frame
non-blocking arbitrated
crossbar architecture
Never drops good frame
Under Congestion
What is SAN congestion?
Congestion within
switches
Congestion between
switches
• Ability to switch traffic between
all ports at all rates at all frame
sizes
• Containing congestion from
affecting other ports
• Predictable & consistent
performance
• Reliable performance
Congestion by external
elements
• Slow Drain
(Misbehaving host or storage
ports)
• Over-utilized Inter Switch Links
(ISL)
• Inappropriate oversubscription
ratio
Education
Understanding Slow Drain
• B2B credits are not negotiated – just agreed to
• Each side informs the other side of the number of buffer credits it has
Fibre Channel Flow Control: B2B Credits
I have 1 RX B2B credit
FN
OK. I have 3 B2B credits B B B
B
Fibre Channel
Switch
F-Port has
three credits!
Storage disk
N-port
has one
credit!
• MDS Rx buffer queue is decremented by 1 B2B credit for each received frame
• R_RDY is sent to sender when buffer occupying frame is handled
• For each frame sent, R_RDY (B2B Credit) should be returned
• R_RDYs are not sent reliably – they can be corrupted/lost
Fibre Channel Flow Control: Traffic Flow
Storage disk
FN
B B B
B
B
Frame1
R_RDY B
Fibre Channel
Switch
B BFrame2
Frame3
• Disk 1 sends frame to Server 1
• Switch 1 sends R_RDY after it transmits the frame to switch 2
• Switch 2 sends R_RDY after it transmits the frame to Server 1
• Server 1 sends R_RDY after frame is consumed by HBA
Lossless Fibre Channel fabric
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
R_RDYR_RDYR_RDY
Frame
• Server 1 cannot process frames  does not return R_RDY
• No available B2B credits on port connected to Server 1
• No available B2B credits on ISL Ports
• Disk 1 stops transmitting  fabric becomes lossless
Lossless Fibre Channel fabric
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
R_RDYBackPressureR_RDYBackPressureR_RDY
• B2B credits exhausted on ISL
• No R_RDY sent to Disk 1 as well as Disk 2
• Effect of ‘slow server 1’ on Flow Disk2-Server2
Slow Drain situation
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Disk 2
B
B
B
B
B
B
B
B B
B
Server 2
R_RDYBackPressureR_RDYBackPressureR_RDY
Frame
Frame
Frame
Frame
Frame
BackPressure
R_RDY
• One slow device impacts all other devices sharing same switches and ISL
• Unpredictability of slow drain devices
Slow Drain situation
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Disk 2
B
B
B
B
B
B
B
B B
B
Server 2
R_RDYBackPressureR_RDYBackPressureR_RDY
Frame
Frame
Frame
Frame
Frame
BackPressure
R_RDY
Slow
Node
Impacted
NodesImpacted
Node
• Edge devices
• Server performance problems: application or OS
• Host bus adapter (HBA) problems: driver or physical failure
• Speed mismatches: one fast device and one slow device
• Non-graceful virtual machine exit on a virtualized server, resulting in
packets held in HBA buffers
• Storage subsystem performance problems, including overload
• Inter Switch Links (ISL)
• The existence of slow drain edge devices
• Lack of B2B credits for the distance the ISL is traversing
• Ex: 4 credits per KM @ 8Gbps
• Edge devices with faster speeds than ISLs even when port-channeled
Reasons for Slow Drain
Cisco MDS & DCNM
Slow Drain Advantage
Detection Troubleshooting Automatic Recovery
MDS & DCNM Slow Drain Advantage
Detection Troubleshooting Automatic Recovery
Slow Port
Stuck Port
Slowport Monitor
Credit transition to zero
Credit and remaining credit
Info of dropped frames
See frames in ingress Q
OBFL logging
History graph
TXWait period for frames
6.2(9)
LR Rcvd B2B
6.2(13)
MDS & DCNM Slow Drain Advantage
6.2(13)
Detection Troubleshooting Automatic Recovery
Slow Port
Stuck Port
Slow Port Monitoring
Credit transition to zero
Credit and remaining credit
Info of dropped frames
See frames in ingress Q
OBFL logging
History graph
HW Assisted
TXWait period for frames
6.2(9)
LR Rcvd B2B
6.2(13)
MDS & DCNM Slow Drain Advantage
6.2(13)
DCNMFabric wide visibility
Automatic collection and graphical display of counters
Reduced false positives
7.1(1)
Detection Troubleshooting
Slow Port
Stuck Port
Slow Port Monitoring
Credit transition to zero
Credit and remaining credit
Info of dropped frames
See frames in ingress Q
OBFL logging
History graph
HW Assisted
TXWait period for frames
6.2(9)
LR Rcvd B2B
6.2(13)6.2(13)
DCNMFabric wide visibility
Automatic collection and graphical display of counters
Reduced false positives
7.1(1)
Prevent Head of the line blockingStep 0
Send Link Reset (LR) or flap the port (part of FC standard)Step 4
Link flapStep 5
Frame in switch > congestion-drop timeout? Drop it!Step 2
Frame in egress queue > no-credit-drop timeout? Drop it!Step 3
Shutdown the portStep 6
Alert only – Manual recoveryStep 1
Automatic Recovery
Virtual Output queues
Stuck Port Recovery
Port flap *
Congestion drop
No-credit-drop
Detection
1 ms
Action
Immediate
6.2(9)
SNMP Trap *
Error disable port*
6.2(13)
Enhanced
6.2(13)
Enhanced
6.2(13)
Enhanced
* = using Port Monitor
MDS & DCNM Slow Drain Advantage
Slow Port Monitoring
 Shows real time delay of data traffic on all ports
 Duration for which frames could not be transmitted out of a port due to unavailability of
transmit B2B credits
 Monitoring at as low as 1ms
 Hardware assisted! No overhead on CPU
 Recommendation: Always Turn it on!
From
6.2(9)
mds9700(config)# system timeout slowport-monitor ?
<1-500> Configure number of milliseconds
default Default timeout value for HW slowport monitoring
mds9700(config)# system timeout slowport-monitor default ?
mode Enter the port mode
mds9700(config)# system timeout slowport-monitor default mode ?
E E mode
F F mode
Understanding Slowport Monitor output
Mds9706# show process creditmon slowport-monitor-events
Module: 01 Slowport Detected: YES
=====================================================================
Interface = fc1/18
------------------------------------------------------------
| admin | slowport | oper | Timestamp
| delay | detection | delay |
| (ms) | count | (ms) |
------------------------------------------------------------
| 1 | 0 | 9 | Wed Jul 2 19:47:35.038 2014
| 1 | 128 | 9 | Wed Jul 2 19:47:19.922 2014
| 1 | 127 | 4 | Wed Jul 2 19:47:19.618 2014
| 1 | 119 | 10 | Wed Jul 2 19:47:19.518 2014
| 1 | 109 | 10 | Wed Jul 2 19:47:19.418 2014
| 1 | 101 | 10 | Wed Jul 2 19:47:19.318 2014
| 1 | 100 | 4 | Wed Jul 2 19:47:19.118 2014
| 1 | 93 | 10 | Wed Jul 2 19:47:19.017 2014
| 1 | 83 | 10 | Wed Jul 2 19:47:18.917 2014
| 1 | 74 | 12 | Wed Jul 2 19:47:18.818 2014
Configured Delay via
slow-port-monitor
Number of times the
delay was detected.
Actual Delay seen by
the port
Timestamp when the
delay was observed
From
6.2(9)
| oper | Timestamp
| delay |
| (ms) |
---------------------------------------
| 9 | Wed Jul 2 19:47:35.038 2014
| 9 | Wed Jul 2 19:47:19.922 2014
| 4 | Wed Jul 2 19:47:19.618 2014
| 10 | Wed Jul 2 19:47:19.518 2014
Slow Port Monitoring
Displays R_RDY delay in real
time & stores at logging buffer
Slowport monitor integration with Port-Monitor
Event
Time (seconds)
Operational
Delay (ms)
Threshold
Action SNMP Trap
From
6.2(13)
• Monitoring Interval : 1 Second
• Threshold type : Absolute (delay value in ms)
• Rising Threshold : 50ms
• Falling threshold : 0ms
• Action : Trap and Syslog
What this means in English
Event: “If a port remains at zero TX B2B credits for a continuous span
of 50ms in 1 second polling interval”
Action: Generate a SNMP trap and syslog
PMON configuration : TX-Slowport-Oper-Delay
Default
From
6.2(13)
• Hardware counter with nanosecond visibility
• Increments every 2-3ns when port is at 0 Tx credits and there are
frames queued for transmit
• Reported in units of 2.5us.
• txwait * 2.5 / 1000000 = seconds of time the port was unable to transmit
• Why reported in 2.5 us tick?
• Because of FICON requirements
• NS is to fast to interpret
• 5642973696 * 2.5/1000000 = 14107 seconds
• MDS was not able to transmit for around 14107 seconds since the
counter was last cleared
Understanding TXWait
mds9710-1# show interface fc1/1 counters | include wait
5642973696 2.5us Tx waits due to lack of transmit credits
• Intuitive way of reporting of how long frames could not be transmitted.
• In below output, frames could not be transmitted out of port fc1/13 for 1%
duration in last 1 second, 5% duration in last 1 minute and so on due to
lack of transmit B2B credits
Percentage of TxWait
MDS9700# show interface fc1/13 counters fc1/13
<snip>
5 Transmit B2B credit transitions to zero
2 Receive B2B credit transitions to zero
0 2.5us TxWait due to lack of transmit credits
Percentage Tx credits not available for last 1s/1m/1h/72h: 1%/5%/3%/2%
32 receive B2B credit remaining
128 transmit B2B credit remaining
128 low priority transmit B2B credit remaining
<snip>
• Graphical display of time
when credits were not
available
• 3 graphs per port
• Last 60 seconds
• Last 60 minutes
• Last 72 hours
• Top 3 rows(read vertically)
Actual txwait in ms
• Middle 10 rows(graph plot
using #)
• Bottom 2 rows (last 60
seconds)
• Example: @ 15th second,
TXWAIT = 989ms, @35nd
second, TXWAIT = 752ms
TXWait – Health report of port
mds9710-1# show process creditmon txwait-history
TxWait history for port fc1/13:
==============================
79998 79993 999999
08887 58882 9899999
000000000000299870000000000000000029994000000000000362999500
1000 ### ### ######
900 #### ### ######
800 #### #### ######
700 ##### #### ######
600 ##### #### ######
500 ##### #### ######
400 ##### #### ######
300 ##### ##### ######
200 ##### ##### ######
100 ##### ##### #######
0....5....1....1....2....2....3....3....4....4....5....5....6
0 5 0 5 0 5 0 5 0 5 0
Credit Not Available per second (last 60 seconds)
# = TxWait (ms)
From
6.2(13)
• TXWAIT delta value is logged periodically(20 seconds) into OBFL, if delta value >=100ms.
• Displays TXWAIT time in 2.5us ticks as well as in seconds.
• Congestion value is displayed in percentage over period of 20 seconds.
• Timestamp of event occurrence also recorded.
OBFL – Granular, long duration reporting
switch# show logging onboard txwait
Notes:
- sampling period is 20 seconds
- only txwait delta value >= 100 ms are logged
---------------------------------
Module: 1 txwait count
---------------------------------
-----------------------------------------------------------------------------
| Interface | Delta TxWait Time | Congestion | Timestamp |
| | 2.5us ticks | seconds | | |
-----------------------------------------------------------------------------
| fc1/11 | 3435973 | 08 | 42% | Sun Sep 30 05:23:05 2001 |
| fc1/11 | 6871947 | 17 | 85% | Sun Sep 30 05:22:25 2001 |
From
6.2(13)
• Monitoring Interval : 1 Second
• Threshold type : Delta
• Rising Threshold : 40% (translates to 400ms with 1 second monitoring interval)
• Falling threshold : 0%
• Action : Trap and Syslog
What this means in English
Event: “If the aggregate or sum of all the durations (with ns granularity)
when the port was at 0 TX credits, exceeds 400ms in 1 second polling
interval”
Action: Generate a SNMP trap and syslog
PMON configuration : TXWait
Default
Congestion Drop timeout
• MDS timestamps each received frame
• Frame is dropped if cannot be delivered to the egress port within timeout
• Logging is done
• Can be configured 100ms-500ms (500ms default)
• Lowering will timeout frames quicker and reduce effects of slow drain devices
B
B
B
B
B
B
B
MDS
Frame
Frame
Frame
Frame
Frame
Frame
Frame
B
B
BFrame
Frame
Frame
no-credit-drop timeout
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
MDS 1 MDS 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Disk 2
B
B
B
B
B
B
B
B B
B
Server 2
R_RDYR_RDY
Frame
Frame
Frame
Frame
Frame
R_RDY
Drop frames from egress queue
of Slow Port
BackPressure
Released
BackPressure
Released
BackPressure
Released
• Frames dropped in egress queue if credits unavailable for no-credit-drop timeout
Enhanced
since
6.2(9)
ducation xperience xperiment
Build robust and self-healing storage area networks
on new innovation on Cisco MDS and DCNM for solving SAN congestion
Cisco recommends troubleshooting slow drain in the following order
Methodology
36
Level 3: Extreme Delay
Level 2: Retransmission
Level 1: Latency
Troubleshooting Slow Drain
• If Rx congestion then find ports
communicating with this port that
have Tx congestion
• Zoning defines which devices
communicate with this port
• Understand topology
• If port communicating with port
showing Rx congestion is FCIP
• Check for TCP retransmits
• Check for overutilization of FCIP
37
F E
Rx Credits
0 Remaining
Tx Credits
0 Remaining
Congestion
Methodology – Follow Congestion to Source
Troubleshooting Slow Drain
• If Tx congestion found
• If F port then device
attached is slow drain
device, if not;
• If E port then go to
adjacent switch and
continue troubleshooting
• Continue to track through
the fabric until destination
F-port is discovered
38
E EF F
Rx Credits
0 Remaining
Tx Credits
0 Remaining
Congestion
Methodology – Follow Congestion to Source
Troubleshooting Slow Drain
Port-monitor Alerting
• Port-monitor allows monitoring of several counters relating to slow drain
• credit-loss-reco Credit loss recovery counter
• lr-rx The number of link resets received by the fc-port
• lr-tx Link resets transmitted by the fc-port
• timeout-discards Timeout discards counter
• tx-credit-not-available Credit not available counter(in 100ms increments)
• tx-discards Tx discards counter
• slowport-count Number of slowport events
• slowport-oper-delay Slowport operational delay
• txwait Amount of time at 0 Tx credits and packets queued
Port-monitor alerting
Note: There are other counters that are valuable and should also be considered for
inclusion in monitoring but are not part of slow drain
39
New!
Port-monitor Alerting
• Event indicates severity in alert
• 1 – Fatal
• 2 – Critical
• 3 – Error
• 4 – Warning
• 5 - Informational
Categorize counters as different severities for better visual impact in DCNM RMON event
severities
mds9513(config-port-monitor)# show rmon events
Event 1 is active, owned by PMON@FATAL
Description is FATAL(1)
Event firing causes log and trap to community public, last fired never
Event 2 is active, owned by PMON@CRITICAL
Description is CRITICAL(2)
Event firing causes log and trap to community public, last fired never
Event 3 is active, owned by PMON@ERROR
Description is ERROR(3)
Event firing causes log and trap to community public, last fired never
Event 4 is active, owned by PMON@WARNING
Description is WARNING(4)
Event firing causes log and trap to community public, last fired 2014/02/21-17:13:11
Event 5 is active, owned by PMON@INFO
Description is INFORMATION(5)
Event firing causes log and trap to community public, last fired 2014/03/08-08:25:19
Port-monitor Alerting
• Port-monitor allows separate policies
• F, FL ports(access)
• E, TL ports(trunks)
• Both F ports and E ports
• Only one policy type per port can be active at
a time
• Note: port-type access includes F port
connections to NPV switches that can carry
several logins
• Note: NP ports are not currently monitored
Separate policies or single policy
MDS9513(config-port-monitor)# port-type ?
access-port Configure port-monitoring for access ports
all Configure port-monitoring for all ports
trunks Configure port-monitoring for trunk ports
Port-monitor Alerting
• counter <name> poll-interval <interval> delta rising-threshold <rthresh> event <id> falling-
threshold <fthres> event <id> <portguard errordisable | flap>
• poll-interval – Seconds - How often should this counter be checked?
• delta – Compare the current value with the value at the previous poll interval
• absolute – Match the actual value
• rising-threshold – How much the counter must increase in this poll interval to trigger
• event – Indicates severity of alert - info, warning, error, etc.
• falling-threshold - How much the counter must decrease in this poll interval to reset
• portguard – Optional – Action to take when rising-threshold is reached
• errordisable – Place put in error-disable state. Requires manual shut/no shut to re-activate
• flap – shut/no shut port
Command parameters
43
Port-monitor Alerting
Port-monitor alerting – Example
port-monitor name AllPorts
port-type all
no monitor counter link-loss
no monitor counter sync-loss
no monitor counter signal-loss
no monitor counter invalid-words
no monitor counter invalid-crc
counter tx-discards poll-interval 60 delta rising-threshold 50 event 3 falling-threshold 10 event 3
counter lr-rx poll-interval 60 delta rising-threshold 5 event 2 falling-threshold 1 event 2
counter lr-tx poll-interval 60 delta rising-threshold 5 event 2 falling-threshold 1 event 2
counter timeout-discards poll-interval 60 delta rising-threshold 50 event 3 falling-threshold 10 event 3
counter credit-loss-reco poll-interval 60 delta rising-threshold 1 event 2 falling-threshold 0 event 2
counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event 4 falling-threshold 0 event 4
no monitor counter rx-datarate
no monitor counter tx-datarate
no monitor counter err-pkt-from-port
no monitor counter err-pkt-to-xbar
no monitor counter err-pkt-from-xbar
counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 4 falling-threshold 0 event 4
counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 50 event 4 falling-threshold 0 event 4
counter txwait poll-interval 1 delta rising-threshold 20 event 4 falling-threshold 0 event 4
port-monitor activate AllPorts
Policy applies to Access(F)
and Trunk(E) ports
These counters are not
monitored
Note: The above monitors 9 slow drain counters and does not monitor 10 others
44
Event 2 – Critical
Event 3 – Error
Event 4 - Warning
Port-monitor Alerting
MDS9710-1# show port-monitor AllPorts
Policy Name : AllPorts
Admin status : Not Active
Oper status : Not Active
Port type : All Ports
---------------------------------------------------------------------------------------------------------
Counter Threshold Interval Rising Threshold event Falling Threshold event PMON Portguard
------- --------- -------- ---------------- ----- ------------------ ----- --------------
TX Discards Delta 60 50 3 10 3 Not enabled
LR RX Delta 60 5 2 1 2 Not enabled
LR TX Delta 60 5 2 1 2 Not enabled
Timeout Discards Delta 60 50 3 10 3 Not enabled
Credit Loss Reco Delta 60 1 2 0 2 Not enabled
TX Credit Not Available Delta 1 10% 4 0% 4 Not enabled
slowport-count Delta 1 5 4 0 4 Not enabled
slowport-oper-delay Absolute 1 50ms 4 0ms 4 Not enabled
txwait Delta 1 20% 4 0% 4 Not enabled
----------------------------------------------------------------------------------------------------------
Activation and output
45
46
DCNM event log
Port-monitor
DCNM Demo
• Configure slowport-monitor @ 10-25ms for both E & F ports
• system timeout slowport-monitor 10 mode e
• system timeout slowport-monitor 10 mode f
• Configure congestion-drop on F ports
• system timeout congestion-drop 200ms mode f
• Configure no-credit-drop on F ports
• System timeout no-credit-drop <ms> mode f
• 200ms – safe, 100ms – aggressive, 50ms – Very aggressive
• Configure port-monitor policy(s)
• Use samples included in port-monitor section
Guidance on configuration
ducation xperience xperiment
Build robust and self-healing storage area networks
on new innovation on Cisco MDS and DCNM for solving SAN congestion
Refining no-credit-drop timeout
Enable Slowport Monitoring on all devices (No performance impact!)Step 1
Monitor end device performance (R_RDY delay)Step 2
Either “show process creditmon slowport-monitor-events”
Or better, “show logging onboard slowport-monitor-events”
Define typical R_RDY on slow ports (average, peak, variance, etc)Step 3
Use the (typical value + variance) as no-credit-drop timeoutStep 4
Result
Automatic recovery the moment a port sees R_RDY delay more than ‘typical’
• Find the delay values on ports for acceptable application performance
• Upside variance of delay value may lead to degraded application performance
• Use following for fabric benchmarking
• DCNM slow drain analysis
• MDS Slowport-monitor
• MDS TxWait health graph
• MDS TxWait percentage congestion
• Slowdrain SNMP MIBs
• Port-monitor alerts
Predicting slow drain: Fabric Benchmarking
Fabric Benchmarking
using slowport-monitor
and TxWait
• slowport-monitor at 10ms on E & F
ports
• congestion-drop on F ports at 200ms
• no-credit-drop on F ports
(200ms – safe, 100ms – aggressive, 50ms
– Very aggressive)
• Configure port-monitor policy(s)
Education Experience Experiment
build robust and self-healing storage area networks
Use Cisco MDS & DCNM to
Resources
• YouTube Videos
• Understanding Slow Drain: Detection, Troubleshooting & Automatic Recovery:
https://www.youtube.com/watch?v=wEz3z6NLaBU&list=PL_ju2fKFbFzVMZgXAHV9kZ6FT93BuG0eB
• Detecting and Troubleshooting Slow Drain using Cisco Prime DCNM:
https://www.youtube.com/watch?v=tijVaIatQgQ
• White Paper on “Slow Drain Device Detection and Congestion Avoidance” at
http://www.cisco.com/c/en/us/products/collateral/storage-networking/mds-9700-series-multilayer-directors/white_paper_c11-
729444.html
• Cisco Live Session: BRKSAN-3446 - SAN Congestion! Understanding, Troubleshooting, Mitigating in a
Cisco Fabric (2015 San Diego)
https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=83668&backBtn=true
• Solving Congestion problems in SAN (Feb 2015, enhancements till NX-OS 6.2(9)):
http://www.slideshare.net/Ciscodatacenter/solving-congestion-problems-in-storage-area-networks
• Generation 4 Slow Drain Counters commands and troubleshooting:
http://www.cisco.com/c/en/us/support/docs/storage-networking/mds-9509-multilayer-director/116098-trouble-gen4-00.html
• MDS 9148 Slow Drain Counters and Commands http://www.cisco.com/c/en/us/support/docs/storage-
networking/mds-9100-series-multilayer-fabric-switches/116401-trouble-mds9148-00.html
Slow Drain Reference
Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015

Contenu connexe

Tendances

Ericsson TN Cards in Details
Ericsson TN Cards in DetailsEricsson TN Cards in Details
Ericsson TN Cards in Detailsibrahimnabil17
 
Huawei XPIC Hardware Connection
Huawei XPIC Hardware ConnectionHuawei XPIC Hardware Connection
Huawei XPIC Hardware Connectionibrahimnabil17
 
Different Types of Backhaul
Different Types of BackhaulDifferent Types of Backhaul
Different Types of Backhaul3G4G
 
vdocuments.net_sp420-technical-description (1).pdf
vdocuments.net_sp420-technical-description (1).pdfvdocuments.net_sp420-technical-description (1).pdf
vdocuments.net_sp420-technical-description (1).pdfgebreyesusweldegebri2
 
LTE network: How it all comes together architecture technical poster
LTE network: How it all comes together architecture technical posterLTE network: How it all comes together architecture technical poster
LTE network: How it all comes together architecture technical posterDavid Swift
 
IP Mobile Backhaul Presentation
IP Mobile Backhaul PresentationIP Mobile Backhaul Presentation
IP Mobile Backhaul PresentationAviat Networks
 
LTE Testing | 4G Testing
LTE Testing | 4G TestingLTE Testing | 4G Testing
LTE Testing | 4G TestingIxia
 
SDH/SONET alarms & performance monitoring
SDH/SONET alarms & performance monitoringSDH/SONET alarms & performance monitoring
SDH/SONET alarms & performance monitoringMapYourTech
 
Fundamentals of sdh
Fundamentals of sdhFundamentals of sdh
Fundamentals of sdhsreejithkt
 
Microwave Huawei RTN Hardware Structure
Microwave Huawei RTN Hardware StructureMicrowave Huawei RTN Hardware Structure
Microwave Huawei RTN Hardware Structureibrahimnabil17
 
52528672 microwave-planning-and-design
52528672 microwave-planning-and-design52528672 microwave-planning-and-design
52528672 microwave-planning-and-designfat_zeq
 

Tendances (20)

Ericsson TN Cards in Details
Ericsson TN Cards in DetailsEricsson TN Cards in Details
Ericsson TN Cards in Details
 
Huawei XPIC Hardware Connection
Huawei XPIC Hardware ConnectionHuawei XPIC Hardware Connection
Huawei XPIC Hardware Connection
 
IPRAN BASICS.pdf
IPRAN BASICS.pdfIPRAN BASICS.pdf
IPRAN BASICS.pdf
 
Sdh concept
Sdh conceptSdh concept
Sdh concept
 
SICAM AK 3 automation
SICAM AK 3 automationSICAM AK 3 automation
SICAM AK 3 automation
 
Different Types of Backhaul
Different Types of BackhaulDifferent Types of Backhaul
Different Types of Backhaul
 
Introduction to LTE
Introduction to LTEIntroduction to LTE
Introduction to LTE
 
Bsc configuration
Bsc configurationBsc configuration
Bsc configuration
 
ZTE BTS Manual
ZTE BTS ManualZTE BTS Manual
ZTE BTS Manual
 
vdocuments.net_sp420-technical-description (1).pdf
vdocuments.net_sp420-technical-description (1).pdfvdocuments.net_sp420-technical-description (1).pdf
vdocuments.net_sp420-technical-description (1).pdf
 
lte advanced
lte advancedlte advanced
lte advanced
 
LTE network: How it all comes together architecture technical poster
LTE network: How it all comes together architecture technical posterLTE network: How it all comes together architecture technical poster
LTE network: How it all comes together architecture technical poster
 
IP Mobile Backhaul Presentation
IP Mobile Backhaul PresentationIP Mobile Backhaul Presentation
IP Mobile Backhaul Presentation
 
LTE Testing | 4G Testing
LTE Testing | 4G TestingLTE Testing | 4G Testing
LTE Testing | 4G Testing
 
E1 To Stm
E1 To Stm E1 To Stm
E1 To Stm
 
SDH/SONET alarms & performance monitoring
SDH/SONET alarms & performance monitoringSDH/SONET alarms & performance monitoring
SDH/SONET alarms & performance monitoring
 
Bhb sicam ak_eng
Bhb sicam ak_engBhb sicam ak_eng
Bhb sicam ak_eng
 
Fundamentals of sdh
Fundamentals of sdhFundamentals of sdh
Fundamentals of sdh
 
Microwave Huawei RTN Hardware Structure
Microwave Huawei RTN Hardware StructureMicrowave Huawei RTN Hardware Structure
Microwave Huawei RTN Hardware Structure
 
52528672 microwave-planning-and-design
52528672 microwave-planning-and-design52528672 microwave-planning-and-design
52528672 microwave-planning-and-design
 

Similaire à Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015

Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performanceRicky Zhu
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance ConsiderationsShawn Wells
 
Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...
Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...
Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...Scalar, Inc.
 
Presentation deploying cloud based services
Presentation   deploying cloud based servicesPresentation   deploying cloud based services
Presentation deploying cloud based servicesxKinAnx
 
Scalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar, Inc.
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:Tony Antony
 
IBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance AnalysisIBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance Analysisbrettallison
 
Oracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open WorldOracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open WorldPaul Marden
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Installing Oracle Database on LDOM
Installing Oracle Database on LDOMInstalling Oracle Database on LDOM
Installing Oracle Database on LDOMPhilippe Fierens
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
 
Justifying Migration of legacy process control systems
Justifying Migration of legacy process control systemsJustifying Migration of legacy process control systems
Justifying Migration of legacy process control systemsBrian Thomas
 
Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2
Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2
Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2ldangelo0772
 
6TiSCH @Telecom Bretagne 2015
6TiSCH @Telecom Bretagne 20156TiSCH @Telecom Bretagne 2015
6TiSCH @Telecom Bretagne 2015Pascal Thubert
 
UCS System Architecture
UCS System ArchitectureUCS System Architecture
UCS System ArchitectureCisco Canada
 
Load Balancing for Containers and Cloud Native Architecture
Load Balancing for Containers and Cloud Native ArchitectureLoad Balancing for Containers and Cloud Native Architecture
Load Balancing for Containers and Cloud Native ArchitectureChiradeep Vittal
 

Similaire à Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015 (20)

Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...
Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...
Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...
 
Presentation deploying cloud based services
Presentation   deploying cloud based servicesPresentation   deploying cloud based services
Presentation deploying cloud based services
 
Scalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction ManagerScalar DB: Universal Transaction Manager
Scalar DB: Universal Transaction Manager
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
IBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance AnalysisIBM SAN Volume Controller Performance Analysis
IBM SAN Volume Controller Performance Analysis
 
Oracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open WorldOracle RAC Presentation at Oracle Open World
Oracle RAC Presentation at Oracle Open World
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Installing Oracle Database on LDOM
Installing Oracle Database on LDOMInstalling Oracle Database on LDOM
Installing Oracle Database on LDOM
 
Troubleshooting Storage Devices Using vRealize Operations (formerly vC Ops)
Troubleshooting Storage Devices Using vRealize Operations (formerly vC Ops)Troubleshooting Storage Devices Using vRealize Operations (formerly vC Ops)
Troubleshooting Storage Devices Using vRealize Operations (formerly vC Ops)
 
Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
 
Justifying Migration of legacy process control systems
Justifying Migration of legacy process control systemsJustifying Migration of legacy process control systems
Justifying Migration of legacy process control systems
 
Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2
Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2
Cisco at v mworld 2015 vmworld - cisco mds and emc xtrem_io-v2
 
6TiSCH @Telecom Bretagne 2015
6TiSCH @Telecom Bretagne 20156TiSCH @Telecom Bretagne 2015
6TiSCH @Telecom Bretagne 2015
 
UCS System Architecture
UCS System ArchitectureUCS System Architecture
UCS System Architecture
 
Load Balancing for Containers and Cloud Native Architecture
Load Balancing for Containers and Cloud Native ArchitectureLoad Balancing for Containers and Cloud Native Architecture
Load Balancing for Containers and Cloud Native Architecture
 
Title Subtitle
Title SubtitleTitle Subtitle
Title Subtitle
 
Title Subtitle
Title SubtitleTitle Subtitle
Title Subtitle
 

Plus de Tony Antony

SAN Extension Design and Solutions
SAN Extension Design and SolutionsSAN Extension Design and Solutions
SAN Extension Design and SolutionsTony Antony
 
Nexus 7000 Series Innovations: M3 Module, DCI, Scale
Nexus 7000 Series Innovations: M3 Module, DCI, ScaleNexus 7000 Series Innovations: M3 Module, DCI, Scale
Nexus 7000 Series Innovations: M3 Module, DCI, ScaleTony Antony
 
Cisco storage networking protect scale-simplify_dec_2016
Cisco storage networking   protect scale-simplify_dec_2016Cisco storage networking   protect scale-simplify_dec_2016
Cisco storage networking protect scale-simplify_dec_2016Tony Antony
 
Higher Speed, Higher Density, More Flexible SAN Switching
Higher Speed, Higher Density, More Flexible SAN SwitchingHigher Speed, Higher Density, More Flexible SAN Switching
Higher Speed, Higher Density, More Flexible SAN SwitchingTony Antony
 
Automate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solutionAutomate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solutionTony Antony
 
Designing Scalable SAN using MDS 9396S
Designing Scalable SAN using MDS 9396SDesigning Scalable SAN using MDS 9396S
Designing Scalable SAN using MDS 9396STony Antony
 
Nexus 1000V Support for VMWare vSphere 6
Nexus 1000V Support for VMWare vSphere 6Nexus 1000V Support for VMWare vSphere 6
Nexus 1000V Support for VMWare vSphere 6Tony Antony
 

Plus de Tony Antony (7)

SAN Extension Design and Solutions
SAN Extension Design and SolutionsSAN Extension Design and Solutions
SAN Extension Design and Solutions
 
Nexus 7000 Series Innovations: M3 Module, DCI, Scale
Nexus 7000 Series Innovations: M3 Module, DCI, ScaleNexus 7000 Series Innovations: M3 Module, DCI, Scale
Nexus 7000 Series Innovations: M3 Module, DCI, Scale
 
Cisco storage networking protect scale-simplify_dec_2016
Cisco storage networking   protect scale-simplify_dec_2016Cisco storage networking   protect scale-simplify_dec_2016
Cisco storage networking protect scale-simplify_dec_2016
 
Higher Speed, Higher Density, More Flexible SAN Switching
Higher Speed, Higher Density, More Flexible SAN SwitchingHigher Speed, Higher Density, More Flexible SAN Switching
Higher Speed, Higher Density, More Flexible SAN Switching
 
Automate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solutionAutomate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solution
 
Designing Scalable SAN using MDS 9396S
Designing Scalable SAN using MDS 9396SDesigning Scalable SAN using MDS 9396S
Designing Scalable SAN using MDS 9396S
 
Nexus 1000V Support for VMWare vSphere 6
Nexus 1000V Support for VMWare vSphere 6Nexus 1000V Support for VMWare vSphere 6
Nexus 1000V Support for VMWare vSphere 6
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015

  • 1. in Storage Area Networks (SAN) Solving congestion problems Paresh Gupta, Technical Marketing Engineer, Cisco Ed Mazurek, Technical Leader, Services, Cisco Nov, 2015
  • 2. Agenda Build robust and self-healing storage area networks
  • 3. ducation xperience xperiment Build robust and self-healing storage area networks Agenda
  • 4. ducation xperience xperiment Build robust and self-healing storage area networks Agenda on new innovations on Cisco MDS and DCNM for solving SAN congestion
  • 5. 16 Gbps FC adoption leading to heterogeneous speeds Why care about SAN congestion now? Ports at 1/2/4/8/16 Gbps part of same fabric Increased pressure on OpEx Maximize the utilization of existing infrastructure Flash storage Pushing network infrastructure to limits Shift in response time from milliseconds (ms) to microseconds (µs) Legacy application Older HW/SW will be around Data explosion leading to scaled out architecture Increased number of host and storage ports in the same network. Collapsed core  Edge-Core. Edge-Core  Edge-Core-Edge
  • 6. What is SAN congestion? Congestion within switches Congestion between switches • Ability to switch traffic between all ports at all rates at all frame sizes • Containing congestion from affecting other ports • Predictable & consistent performance • Reliable performance Congestion by external elements • Slow Drain (Misbehaving host or storage ports) • Over-utilized Inter Switch Links (ISL) • Inappropriate oversubscription ratio
  • 7. Cisco MDS Architecture eliminates congestion within the switch
  • 8. Line Card 2Line Card 1 Active Supervisor Arbiter Fabric Module(XBAR) Fabric Module(XBAR) XBAR interface VOQ P o r t P o r t Frame & credit processing in MDS switch Cisco MDS Initiator sends FC frame1 MDS receives frame in its entirety and stored 2 Frame transmitted to VOQ3 XBAR interface requests Arbiter for grant to transmit frame to egress port via XBAR 4 Arbiter grants request to XBAR interface to forward frame – only sent when egress port has buffer space available 5 FC Frame is forwarded to XBAR then R_RDY sent back since buffer is now free 6 FC Frame is forwarded to egress line card7 ASIC forwards frame to target8 Credit is returned to Arbiter9 Req Grant Frame R_RDY Frame Frame Frame credit
  • 9. Line Card 2Line Card 1 Active Supervisor Arbiter Fabric Module(XBAR) Fabric Module(XBAR) XBAR interface VOQ P o r t P o r t Cisco MDS architecture advantage Cisco MDS Throughput & Latency Consistent performance at different traffic loads & type Predictable by CRC checking at all stages Drops corrupt frame non-blocking arbitrated crossbar architecture Never drops good frame Under Congestion
  • 10. What is SAN congestion? Congestion within switches Congestion between switches • Ability to switch traffic between all ports at all rates at all frame sizes • Containing congestion from affecting other ports • Predictable & consistent performance • Reliable performance Congestion by external elements • Slow Drain (Misbehaving host or storage ports) • Over-utilized Inter Switch Links (ISL) • Inappropriate oversubscription ratio Education
  • 12. • B2B credits are not negotiated – just agreed to • Each side informs the other side of the number of buffer credits it has Fibre Channel Flow Control: B2B Credits I have 1 RX B2B credit FN OK. I have 3 B2B credits B B B B Fibre Channel Switch F-Port has three credits! Storage disk N-port has one credit!
  • 13. • MDS Rx buffer queue is decremented by 1 B2B credit for each received frame • R_RDY is sent to sender when buffer occupying frame is handled • For each frame sent, R_RDY (B2B Credit) should be returned • R_RDYs are not sent reliably – they can be corrupted/lost Fibre Channel Flow Control: Traffic Flow Storage disk FN B B B B B Frame1 R_RDY B Fibre Channel Switch B BFrame2 Frame3
  • 14. • Disk 1 sends frame to Server 1 • Switch 1 sends R_RDY after it transmits the frame to switch 2 • Switch 2 sends R_RDY after it transmits the frame to Server 1 • Server 1 sends R_RDY after frame is consumed by HBA Lossless Fibre Channel fabric Disk 1 B B B B B B B B B BB B B B B B B B Frame Server 1 Switch 1 Switch 2 Frame R_RDYR_RDYR_RDY Frame
  • 15. • Server 1 cannot process frames  does not return R_RDY • No available B2B credits on port connected to Server 1 • No available B2B credits on ISL Ports • Disk 1 stops transmitting  fabric becomes lossless Lossless Fibre Channel fabric Disk 1 B B B B B B B B B BB B B B B B B B Frame Server 1 Switch 1 Switch 2 Frame Frame Frame FrameFrame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame R_RDYBackPressureR_RDYBackPressureR_RDY
  • 16. • B2B credits exhausted on ISL • No R_RDY sent to Disk 1 as well as Disk 2 • Effect of ‘slow server 1’ on Flow Disk2-Server2 Slow Drain situation Disk 1 B B B B B B B B B BB B B B B B B B Frame Server 1 Switch 1 Switch 2 Frame Frame Frame FrameFrame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Disk 2 B B B B B B B B B B Server 2 R_RDYBackPressureR_RDYBackPressureR_RDY Frame Frame Frame Frame Frame BackPressure R_RDY
  • 17. • One slow device impacts all other devices sharing same switches and ISL • Unpredictability of slow drain devices Slow Drain situation Disk 1 B B B B B B B B B BB B B B B B B B Frame Server 1 Switch 1 Switch 2 Frame Frame Frame FrameFrame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Disk 2 B B B B B B B B B B Server 2 R_RDYBackPressureR_RDYBackPressureR_RDY Frame Frame Frame Frame Frame BackPressure R_RDY Slow Node Impacted NodesImpacted Node
  • 18. • Edge devices • Server performance problems: application or OS • Host bus adapter (HBA) problems: driver or physical failure • Speed mismatches: one fast device and one slow device • Non-graceful virtual machine exit on a virtualized server, resulting in packets held in HBA buffers • Storage subsystem performance problems, including overload • Inter Switch Links (ISL) • The existence of slow drain edge devices • Lack of B2B credits for the distance the ISL is traversing • Ex: 4 credits per KM @ 8Gbps • Edge devices with faster speeds than ISLs even when port-channeled Reasons for Slow Drain
  • 19. Cisco MDS & DCNM Slow Drain Advantage
  • 20. Detection Troubleshooting Automatic Recovery MDS & DCNM Slow Drain Advantage
  • 21. Detection Troubleshooting Automatic Recovery Slow Port Stuck Port Slowport Monitor Credit transition to zero Credit and remaining credit Info of dropped frames See frames in ingress Q OBFL logging History graph TXWait period for frames 6.2(9) LR Rcvd B2B 6.2(13) MDS & DCNM Slow Drain Advantage 6.2(13)
  • 22. Detection Troubleshooting Automatic Recovery Slow Port Stuck Port Slow Port Monitoring Credit transition to zero Credit and remaining credit Info of dropped frames See frames in ingress Q OBFL logging History graph HW Assisted TXWait period for frames 6.2(9) LR Rcvd B2B 6.2(13) MDS & DCNM Slow Drain Advantage 6.2(13) DCNMFabric wide visibility Automatic collection and graphical display of counters Reduced false positives 7.1(1)
  • 23. Detection Troubleshooting Slow Port Stuck Port Slow Port Monitoring Credit transition to zero Credit and remaining credit Info of dropped frames See frames in ingress Q OBFL logging History graph HW Assisted TXWait period for frames 6.2(9) LR Rcvd B2B 6.2(13)6.2(13) DCNMFabric wide visibility Automatic collection and graphical display of counters Reduced false positives 7.1(1) Prevent Head of the line blockingStep 0 Send Link Reset (LR) or flap the port (part of FC standard)Step 4 Link flapStep 5 Frame in switch > congestion-drop timeout? Drop it!Step 2 Frame in egress queue > no-credit-drop timeout? Drop it!Step 3 Shutdown the portStep 6 Alert only – Manual recoveryStep 1 Automatic Recovery Virtual Output queues Stuck Port Recovery Port flap * Congestion drop No-credit-drop Detection 1 ms Action Immediate 6.2(9) SNMP Trap * Error disable port* 6.2(13) Enhanced 6.2(13) Enhanced 6.2(13) Enhanced * = using Port Monitor MDS & DCNM Slow Drain Advantage
  • 24. Slow Port Monitoring  Shows real time delay of data traffic on all ports  Duration for which frames could not be transmitted out of a port due to unavailability of transmit B2B credits  Monitoring at as low as 1ms  Hardware assisted! No overhead on CPU  Recommendation: Always Turn it on! From 6.2(9) mds9700(config)# system timeout slowport-monitor ? <1-500> Configure number of milliseconds default Default timeout value for HW slowport monitoring mds9700(config)# system timeout slowport-monitor default ? mode Enter the port mode mds9700(config)# system timeout slowport-monitor default mode ? E E mode F F mode
  • 25. Understanding Slowport Monitor output Mds9706# show process creditmon slowport-monitor-events Module: 01 Slowport Detected: YES ===================================================================== Interface = fc1/18 ------------------------------------------------------------ | admin | slowport | oper | Timestamp | delay | detection | delay | | (ms) | count | (ms) | ------------------------------------------------------------ | 1 | 0 | 9 | Wed Jul 2 19:47:35.038 2014 | 1 | 128 | 9 | Wed Jul 2 19:47:19.922 2014 | 1 | 127 | 4 | Wed Jul 2 19:47:19.618 2014 | 1 | 119 | 10 | Wed Jul 2 19:47:19.518 2014 | 1 | 109 | 10 | Wed Jul 2 19:47:19.418 2014 | 1 | 101 | 10 | Wed Jul 2 19:47:19.318 2014 | 1 | 100 | 4 | Wed Jul 2 19:47:19.118 2014 | 1 | 93 | 10 | Wed Jul 2 19:47:19.017 2014 | 1 | 83 | 10 | Wed Jul 2 19:47:18.917 2014 | 1 | 74 | 12 | Wed Jul 2 19:47:18.818 2014 Configured Delay via slow-port-monitor Number of times the delay was detected. Actual Delay seen by the port Timestamp when the delay was observed From 6.2(9)
  • 26. | oper | Timestamp | delay | | (ms) | --------------------------------------- | 9 | Wed Jul 2 19:47:35.038 2014 | 9 | Wed Jul 2 19:47:19.922 2014 | 4 | Wed Jul 2 19:47:19.618 2014 | 10 | Wed Jul 2 19:47:19.518 2014 Slow Port Monitoring Displays R_RDY delay in real time & stores at logging buffer Slowport monitor integration with Port-Monitor Event Time (seconds) Operational Delay (ms) Threshold Action SNMP Trap From 6.2(13)
  • 27. • Monitoring Interval : 1 Second • Threshold type : Absolute (delay value in ms) • Rising Threshold : 50ms • Falling threshold : 0ms • Action : Trap and Syslog What this means in English Event: “If a port remains at zero TX B2B credits for a continuous span of 50ms in 1 second polling interval” Action: Generate a SNMP trap and syslog PMON configuration : TX-Slowport-Oper-Delay Default From 6.2(13)
  • 28. • Hardware counter with nanosecond visibility • Increments every 2-3ns when port is at 0 Tx credits and there are frames queued for transmit • Reported in units of 2.5us. • txwait * 2.5 / 1000000 = seconds of time the port was unable to transmit • Why reported in 2.5 us tick? • Because of FICON requirements • NS is to fast to interpret • 5642973696 * 2.5/1000000 = 14107 seconds • MDS was not able to transmit for around 14107 seconds since the counter was last cleared Understanding TXWait mds9710-1# show interface fc1/1 counters | include wait 5642973696 2.5us Tx waits due to lack of transmit credits
  • 29. • Intuitive way of reporting of how long frames could not be transmitted. • In below output, frames could not be transmitted out of port fc1/13 for 1% duration in last 1 second, 5% duration in last 1 minute and so on due to lack of transmit B2B credits Percentage of TxWait MDS9700# show interface fc1/13 counters fc1/13 <snip> 5 Transmit B2B credit transitions to zero 2 Receive B2B credit transitions to zero 0 2.5us TxWait due to lack of transmit credits Percentage Tx credits not available for last 1s/1m/1h/72h: 1%/5%/3%/2% 32 receive B2B credit remaining 128 transmit B2B credit remaining 128 low priority transmit B2B credit remaining <snip>
  • 30. • Graphical display of time when credits were not available • 3 graphs per port • Last 60 seconds • Last 60 minutes • Last 72 hours • Top 3 rows(read vertically) Actual txwait in ms • Middle 10 rows(graph plot using #) • Bottom 2 rows (last 60 seconds) • Example: @ 15th second, TXWAIT = 989ms, @35nd second, TXWAIT = 752ms TXWait – Health report of port mds9710-1# show process creditmon txwait-history TxWait history for port fc1/13: ============================== 79998 79993 999999 08887 58882 9899999 000000000000299870000000000000000029994000000000000362999500 1000 ### ### ###### 900 #### ### ###### 800 #### #### ###### 700 ##### #### ###### 600 ##### #### ###### 500 ##### #### ###### 400 ##### #### ###### 300 ##### ##### ###### 200 ##### ##### ###### 100 ##### ##### ####### 0....5....1....1....2....2....3....3....4....4....5....5....6 0 5 0 5 0 5 0 5 0 5 0 Credit Not Available per second (last 60 seconds) # = TxWait (ms) From 6.2(13)
  • 31. • TXWAIT delta value is logged periodically(20 seconds) into OBFL, if delta value >=100ms. • Displays TXWAIT time in 2.5us ticks as well as in seconds. • Congestion value is displayed in percentage over period of 20 seconds. • Timestamp of event occurrence also recorded. OBFL – Granular, long duration reporting switch# show logging onboard txwait Notes: - sampling period is 20 seconds - only txwait delta value >= 100 ms are logged --------------------------------- Module: 1 txwait count --------------------------------- ----------------------------------------------------------------------------- | Interface | Delta TxWait Time | Congestion | Timestamp | | | 2.5us ticks | seconds | | | ----------------------------------------------------------------------------- | fc1/11 | 3435973 | 08 | 42% | Sun Sep 30 05:23:05 2001 | | fc1/11 | 6871947 | 17 | 85% | Sun Sep 30 05:22:25 2001 | From 6.2(13)
  • 32. • Monitoring Interval : 1 Second • Threshold type : Delta • Rising Threshold : 40% (translates to 400ms with 1 second monitoring interval) • Falling threshold : 0% • Action : Trap and Syslog What this means in English Event: “If the aggregate or sum of all the durations (with ns granularity) when the port was at 0 TX credits, exceeds 400ms in 1 second polling interval” Action: Generate a SNMP trap and syslog PMON configuration : TXWait Default
  • 33. Congestion Drop timeout • MDS timestamps each received frame • Frame is dropped if cannot be delivered to the egress port within timeout • Logging is done • Can be configured 100ms-500ms (500ms default) • Lowering will timeout frames quicker and reduce effects of slow drain devices B B B B B B B MDS Frame Frame Frame Frame Frame Frame Frame B B BFrame Frame Frame
  • 34. no-credit-drop timeout Disk 1 B B B B B B B B B BB B B B B B B B Frame Server 1 MDS 1 MDS 2 Frame Frame Frame FrameFrame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Frame Disk 2 B B B B B B B B B B Server 2 R_RDYR_RDY Frame Frame Frame Frame Frame R_RDY Drop frames from egress queue of Slow Port BackPressure Released BackPressure Released BackPressure Released • Frames dropped in egress queue if credits unavailable for no-credit-drop timeout Enhanced since 6.2(9)
  • 35. ducation xperience xperiment Build robust and self-healing storage area networks on new innovation on Cisco MDS and DCNM for solving SAN congestion
  • 36. Cisco recommends troubleshooting slow drain in the following order Methodology 36 Level 3: Extreme Delay Level 2: Retransmission Level 1: Latency Troubleshooting Slow Drain
  • 37. • If Rx congestion then find ports communicating with this port that have Tx congestion • Zoning defines which devices communicate with this port • Understand topology • If port communicating with port showing Rx congestion is FCIP • Check for TCP retransmits • Check for overutilization of FCIP 37 F E Rx Credits 0 Remaining Tx Credits 0 Remaining Congestion Methodology – Follow Congestion to Source Troubleshooting Slow Drain
  • 38. • If Tx congestion found • If F port then device attached is slow drain device, if not; • If E port then go to adjacent switch and continue troubleshooting • Continue to track through the fabric until destination F-port is discovered 38 E EF F Rx Credits 0 Remaining Tx Credits 0 Remaining Congestion Methodology – Follow Congestion to Source Troubleshooting Slow Drain
  • 39. Port-monitor Alerting • Port-monitor allows monitoring of several counters relating to slow drain • credit-loss-reco Credit loss recovery counter • lr-rx The number of link resets received by the fc-port • lr-tx Link resets transmitted by the fc-port • timeout-discards Timeout discards counter • tx-credit-not-available Credit not available counter(in 100ms increments) • tx-discards Tx discards counter • slowport-count Number of slowport events • slowport-oper-delay Slowport operational delay • txwait Amount of time at 0 Tx credits and packets queued Port-monitor alerting Note: There are other counters that are valuable and should also be considered for inclusion in monitoring but are not part of slow drain 39 New!
  • 40. Port-monitor Alerting • Event indicates severity in alert • 1 – Fatal • 2 – Critical • 3 – Error • 4 – Warning • 5 - Informational Categorize counters as different severities for better visual impact in DCNM RMON event severities mds9513(config-port-monitor)# show rmon events Event 1 is active, owned by PMON@FATAL Description is FATAL(1) Event firing causes log and trap to community public, last fired never Event 2 is active, owned by PMON@CRITICAL Description is CRITICAL(2) Event firing causes log and trap to community public, last fired never Event 3 is active, owned by PMON@ERROR Description is ERROR(3) Event firing causes log and trap to community public, last fired never Event 4 is active, owned by PMON@WARNING Description is WARNING(4) Event firing causes log and trap to community public, last fired 2014/02/21-17:13:11 Event 5 is active, owned by PMON@INFO Description is INFORMATION(5) Event firing causes log and trap to community public, last fired 2014/03/08-08:25:19
  • 41. Port-monitor Alerting • Port-monitor allows separate policies • F, FL ports(access) • E, TL ports(trunks) • Both F ports and E ports • Only one policy type per port can be active at a time • Note: port-type access includes F port connections to NPV switches that can carry several logins • Note: NP ports are not currently monitored Separate policies or single policy MDS9513(config-port-monitor)# port-type ? access-port Configure port-monitoring for access ports all Configure port-monitoring for all ports trunks Configure port-monitoring for trunk ports
  • 42. Port-monitor Alerting • counter <name> poll-interval <interval> delta rising-threshold <rthresh> event <id> falling- threshold <fthres> event <id> <portguard errordisable | flap> • poll-interval – Seconds - How often should this counter be checked? • delta – Compare the current value with the value at the previous poll interval • absolute – Match the actual value • rising-threshold – How much the counter must increase in this poll interval to trigger • event – Indicates severity of alert - info, warning, error, etc. • falling-threshold - How much the counter must decrease in this poll interval to reset • portguard – Optional – Action to take when rising-threshold is reached • errordisable – Place put in error-disable state. Requires manual shut/no shut to re-activate • flap – shut/no shut port Command parameters 43
  • 43. Port-monitor Alerting Port-monitor alerting – Example port-monitor name AllPorts port-type all no monitor counter link-loss no monitor counter sync-loss no monitor counter signal-loss no monitor counter invalid-words no monitor counter invalid-crc counter tx-discards poll-interval 60 delta rising-threshold 50 event 3 falling-threshold 10 event 3 counter lr-rx poll-interval 60 delta rising-threshold 5 event 2 falling-threshold 1 event 2 counter lr-tx poll-interval 60 delta rising-threshold 5 event 2 falling-threshold 1 event 2 counter timeout-discards poll-interval 60 delta rising-threshold 50 event 3 falling-threshold 10 event 3 counter credit-loss-reco poll-interval 60 delta rising-threshold 1 event 2 falling-threshold 0 event 2 counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event 4 falling-threshold 0 event 4 no monitor counter rx-datarate no monitor counter tx-datarate no monitor counter err-pkt-from-port no monitor counter err-pkt-to-xbar no monitor counter err-pkt-from-xbar counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 4 falling-threshold 0 event 4 counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 50 event 4 falling-threshold 0 event 4 counter txwait poll-interval 1 delta rising-threshold 20 event 4 falling-threshold 0 event 4 port-monitor activate AllPorts Policy applies to Access(F) and Trunk(E) ports These counters are not monitored Note: The above monitors 9 slow drain counters and does not monitor 10 others 44 Event 2 – Critical Event 3 – Error Event 4 - Warning
  • 44. Port-monitor Alerting MDS9710-1# show port-monitor AllPorts Policy Name : AllPorts Admin status : Not Active Oper status : Not Active Port type : All Ports --------------------------------------------------------------------------------------------------------- Counter Threshold Interval Rising Threshold event Falling Threshold event PMON Portguard ------- --------- -------- ---------------- ----- ------------------ ----- -------------- TX Discards Delta 60 50 3 10 3 Not enabled LR RX Delta 60 5 2 1 2 Not enabled LR TX Delta 60 5 2 1 2 Not enabled Timeout Discards Delta 60 50 3 10 3 Not enabled Credit Loss Reco Delta 60 1 2 0 2 Not enabled TX Credit Not Available Delta 1 10% 4 0% 4 Not enabled slowport-count Delta 1 5 4 0 4 Not enabled slowport-oper-delay Absolute 1 50ms 4 0ms 4 Not enabled txwait Delta 1 20% 4 0% 4 Not enabled ---------------------------------------------------------------------------------------------------------- Activation and output 45
  • 47. • Configure slowport-monitor @ 10-25ms for both E & F ports • system timeout slowport-monitor 10 mode e • system timeout slowport-monitor 10 mode f • Configure congestion-drop on F ports • system timeout congestion-drop 200ms mode f • Configure no-credit-drop on F ports • System timeout no-credit-drop <ms> mode f • 200ms – safe, 100ms – aggressive, 50ms – Very aggressive • Configure port-monitor policy(s) • Use samples included in port-monitor section Guidance on configuration
  • 48. ducation xperience xperiment Build robust and self-healing storage area networks on new innovation on Cisco MDS and DCNM for solving SAN congestion
  • 49. Refining no-credit-drop timeout Enable Slowport Monitoring on all devices (No performance impact!)Step 1 Monitor end device performance (R_RDY delay)Step 2 Either “show process creditmon slowport-monitor-events” Or better, “show logging onboard slowport-monitor-events” Define typical R_RDY on slow ports (average, peak, variance, etc)Step 3 Use the (typical value + variance) as no-credit-drop timeoutStep 4 Result Automatic recovery the moment a port sees R_RDY delay more than ‘typical’
  • 50. • Find the delay values on ports for acceptable application performance • Upside variance of delay value may lead to degraded application performance • Use following for fabric benchmarking • DCNM slow drain analysis • MDS Slowport-monitor • MDS TxWait health graph • MDS TxWait percentage congestion • Slowdrain SNMP MIBs • Port-monitor alerts Predicting slow drain: Fabric Benchmarking
  • 51. Fabric Benchmarking using slowport-monitor and TxWait • slowport-monitor at 10ms on E & F ports • congestion-drop on F ports at 200ms • no-credit-drop on F ports (200ms – safe, 100ms – aggressive, 50ms – Very aggressive) • Configure port-monitor policy(s) Education Experience Experiment build robust and self-healing storage area networks Use Cisco MDS & DCNM to
  • 53. • YouTube Videos • Understanding Slow Drain: Detection, Troubleshooting & Automatic Recovery: https://www.youtube.com/watch?v=wEz3z6NLaBU&list=PL_ju2fKFbFzVMZgXAHV9kZ6FT93BuG0eB • Detecting and Troubleshooting Slow Drain using Cisco Prime DCNM: https://www.youtube.com/watch?v=tijVaIatQgQ • White Paper on “Slow Drain Device Detection and Congestion Avoidance” at http://www.cisco.com/c/en/us/products/collateral/storage-networking/mds-9700-series-multilayer-directors/white_paper_c11- 729444.html • Cisco Live Session: BRKSAN-3446 - SAN Congestion! Understanding, Troubleshooting, Mitigating in a Cisco Fabric (2015 San Diego) https://www.ciscolive.com/online/connect/sessionDetail.ww?SESSION_ID=83668&backBtn=true • Solving Congestion problems in SAN (Feb 2015, enhancements till NX-OS 6.2(9)): http://www.slideshare.net/Ciscodatacenter/solving-congestion-problems-in-storage-area-networks • Generation 4 Slow Drain Counters commands and troubleshooting: http://www.cisco.com/c/en/us/support/docs/storage-networking/mds-9509-multilayer-director/116098-trouble-gen4-00.html • MDS 9148 Slow Drain Counters and Commands http://www.cisco.com/c/en/us/support/docs/storage- networking/mds-9100-series-multilayer-fabric-switches/116401-trouble-mds9148-00.html Slow Drain Reference