Today’s storage area networks (SANs) face tremendous pressure from the phenomenal growth of digital information and the need to access it quickly and efficiently. Worldwide data is projected to multiply by an astonishing 1000 percent by 2020. It’s little wonder, then, that storage administrators rank slow drain and related SAN congestion issues as their number-one concern. If not addressed in a timely fashion, these can have a domino effect, even degrading the performance of totally unrelated applications.
Find out how the Cisco Data Center Network Manager tool provides centralized monitoring and reporting of slow drain conditions across your entire fabric, enabling you to easily pinpoint the exact sources of congestion. Discover how these solutions maximize the performance of your existing SAN as we reveal:
•Common causes of slow drain
•Best practices for avoiding congestion
•Tools for Cisco Nexus and MDS switches that speed detection and recovery
•Recent innovations that fully automate resolution
4. ducation xperience xperiment
Build robust and self-healing storage area networks
Agenda
on new innovations on Cisco MDS and DCNM for solving SAN congestion
5. 16 Gbps FC adoption leading to heterogeneous speeds
Why care about SAN congestion now?
Ports at 1/2/4/8/16 Gbps part of same fabric
Increased pressure on OpEx
Maximize the utilization of existing infrastructure
Flash storage
Pushing network infrastructure to limits
Shift in response time from milliseconds (ms) to microseconds (µs)
Legacy application
Older HW/SW will be around
Data explosion leading to scaled out architecture
Increased number of host and storage ports in the same network.
Collapsed core Edge-Core. Edge-Core Edge-Core-Edge
6. What is SAN congestion?
Congestion within
switches
Congestion between
switches
• Ability to switch traffic between
all ports at all rates at all frame
sizes
• Containing congestion from
affecting other ports
• Predictable & consistent
performance
• Reliable performance
Congestion by external
elements
• Slow Drain
(Misbehaving host or storage
ports)
• Over-utilized Inter Switch Links
(ISL)
• Inappropriate oversubscription
ratio
8. Line Card 2Line Card 1
Active Supervisor Arbiter
Fabric Module(XBAR)
Fabric Module(XBAR)
XBAR
interface
VOQ
P
o
r
t
P
o
r
t
Frame & credit processing in MDS switch
Cisco MDS
Initiator sends FC frame1
MDS receives frame in its entirety
and stored
2
Frame transmitted to VOQ3
XBAR interface requests Arbiter for
grant to transmit frame to egress
port via XBAR
4
Arbiter grants request to XBAR
interface to forward frame – only
sent when egress port has buffer
space available
5
FC Frame is forwarded to XBAR
then R_RDY sent back since
buffer is now free
6
FC Frame is forwarded to egress
line card7
ASIC forwards frame to target8
Credit is returned to Arbiter9
Req
Grant
Frame
R_RDY Frame
Frame
Frame
credit
9. Line Card 2Line Card 1
Active Supervisor Arbiter
Fabric Module(XBAR)
Fabric Module(XBAR)
XBAR
interface
VOQ
P
o
r
t
P
o
r
t
Cisco MDS architecture advantage
Cisco MDS
Throughput & Latency
Consistent
performance at different
traffic loads & type
Predictable
by CRC checking at all
stages
Drops corrupt frame
non-blocking arbitrated
crossbar architecture
Never drops good frame
Under Congestion
10. What is SAN congestion?
Congestion within
switches
Congestion between
switches
• Ability to switch traffic between
all ports at all rates at all frame
sizes
• Containing congestion from
affecting other ports
• Predictable & consistent
performance
• Reliable performance
Congestion by external
elements
• Slow Drain
(Misbehaving host or storage
ports)
• Over-utilized Inter Switch Links
(ISL)
• Inappropriate oversubscription
ratio
Education
12. • B2B credits are not negotiated – just agreed to
• Each side informs the other side of the number of buffer credits it has
Fibre Channel Flow Control: B2B Credits
I have 1 RX B2B credit
FN
OK. I have 3 B2B credits B B B
B
Fibre Channel
Switch
F-Port has
three credits!
Storage disk
N-port
has one
credit!
13. • MDS Rx buffer queue is decremented by 1 B2B credit for each received frame
• R_RDY is sent to sender when buffer occupying frame is handled
• For each frame sent, R_RDY (B2B Credit) should be returned
• R_RDYs are not sent reliably – they can be corrupted/lost
Fibre Channel Flow Control: Traffic Flow
Storage disk
FN
B B B
B
B
Frame1
R_RDY B
Fibre Channel
Switch
B BFrame2
Frame3
14. • Disk 1 sends frame to Server 1
• Switch 1 sends R_RDY after it transmits the frame to switch 2
• Switch 2 sends R_RDY after it transmits the frame to Server 1
• Server 1 sends R_RDY after frame is consumed by HBA
Lossless Fibre Channel fabric
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
R_RDYR_RDYR_RDY
Frame
15. • Server 1 cannot process frames does not return R_RDY
• No available B2B credits on port connected to Server 1
• No available B2B credits on ISL Ports
• Disk 1 stops transmitting fabric becomes lossless
Lossless Fibre Channel fabric
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
R_RDYBackPressureR_RDYBackPressureR_RDY
16. • B2B credits exhausted on ISL
• No R_RDY sent to Disk 1 as well as Disk 2
• Effect of ‘slow server 1’ on Flow Disk2-Server2
Slow Drain situation
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Disk 2
B
B
B
B
B
B
B
B B
B
Server 2
R_RDYBackPressureR_RDYBackPressureR_RDY
Frame
Frame
Frame
Frame
Frame
BackPressure
R_RDY
17. • One slow device impacts all other devices sharing same switches and ISL
• Unpredictability of slow drain devices
Slow Drain situation
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
Switch 1 Switch 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Disk 2
B
B
B
B
B
B
B
B B
B
Server 2
R_RDYBackPressureR_RDYBackPressureR_RDY
Frame
Frame
Frame
Frame
Frame
BackPressure
R_RDY
Slow
Node
Impacted
NodesImpacted
Node
18. • Edge devices
• Server performance problems: application or OS
• Host bus adapter (HBA) problems: driver or physical failure
• Speed mismatches: one fast device and one slow device
• Non-graceful virtual machine exit on a virtualized server, resulting in
packets held in HBA buffers
• Storage subsystem performance problems, including overload
• Inter Switch Links (ISL)
• The existence of slow drain edge devices
• Lack of B2B credits for the distance the ISL is traversing
• Ex: 4 credits per KM @ 8Gbps
• Edge devices with faster speeds than ISLs even when port-channeled
Reasons for Slow Drain
21. Detection Troubleshooting Automatic Recovery
Slow Port
Stuck Port
Slowport Monitor
Credit transition to zero
Credit and remaining credit
Info of dropped frames
See frames in ingress Q
OBFL logging
History graph
TXWait period for frames
6.2(9)
LR Rcvd B2B
6.2(13)
MDS & DCNM Slow Drain Advantage
6.2(13)
22. Detection Troubleshooting Automatic Recovery
Slow Port
Stuck Port
Slow Port Monitoring
Credit transition to zero
Credit and remaining credit
Info of dropped frames
See frames in ingress Q
OBFL logging
History graph
HW Assisted
TXWait period for frames
6.2(9)
LR Rcvd B2B
6.2(13)
MDS & DCNM Slow Drain Advantage
6.2(13)
DCNMFabric wide visibility
Automatic collection and graphical display of counters
Reduced false positives
7.1(1)
23. Detection Troubleshooting
Slow Port
Stuck Port
Slow Port Monitoring
Credit transition to zero
Credit and remaining credit
Info of dropped frames
See frames in ingress Q
OBFL logging
History graph
HW Assisted
TXWait period for frames
6.2(9)
LR Rcvd B2B
6.2(13)6.2(13)
DCNMFabric wide visibility
Automatic collection and graphical display of counters
Reduced false positives
7.1(1)
Prevent Head of the line blockingStep 0
Send Link Reset (LR) or flap the port (part of FC standard)Step 4
Link flapStep 5
Frame in switch > congestion-drop timeout? Drop it!Step 2
Frame in egress queue > no-credit-drop timeout? Drop it!Step 3
Shutdown the portStep 6
Alert only – Manual recoveryStep 1
Automatic Recovery
Virtual Output queues
Stuck Port Recovery
Port flap *
Congestion drop
No-credit-drop
Detection
1 ms
Action
Immediate
6.2(9)
SNMP Trap *
Error disable port*
6.2(13)
Enhanced
6.2(13)
Enhanced
6.2(13)
Enhanced
* = using Port Monitor
MDS & DCNM Slow Drain Advantage
24. Slow Port Monitoring
Shows real time delay of data traffic on all ports
Duration for which frames could not be transmitted out of a port due to unavailability of
transmit B2B credits
Monitoring at as low as 1ms
Hardware assisted! No overhead on CPU
Recommendation: Always Turn it on!
From
6.2(9)
mds9700(config)# system timeout slowport-monitor ?
<1-500> Configure number of milliseconds
default Default timeout value for HW slowport monitoring
mds9700(config)# system timeout slowport-monitor default ?
mode Enter the port mode
mds9700(config)# system timeout slowport-monitor default mode ?
E E mode
F F mode
26. | oper | Timestamp
| delay |
| (ms) |
---------------------------------------
| 9 | Wed Jul 2 19:47:35.038 2014
| 9 | Wed Jul 2 19:47:19.922 2014
| 4 | Wed Jul 2 19:47:19.618 2014
| 10 | Wed Jul 2 19:47:19.518 2014
Slow Port Monitoring
Displays R_RDY delay in real
time & stores at logging buffer
Slowport monitor integration with Port-Monitor
Event
Time (seconds)
Operational
Delay (ms)
Threshold
Action SNMP Trap
From
6.2(13)
27. • Monitoring Interval : 1 Second
• Threshold type : Absolute (delay value in ms)
• Rising Threshold : 50ms
• Falling threshold : 0ms
• Action : Trap and Syslog
What this means in English
Event: “If a port remains at zero TX B2B credits for a continuous span
of 50ms in 1 second polling interval”
Action: Generate a SNMP trap and syslog
PMON configuration : TX-Slowport-Oper-Delay
Default
From
6.2(13)
28. • Hardware counter with nanosecond visibility
• Increments every 2-3ns when port is at 0 Tx credits and there are
frames queued for transmit
• Reported in units of 2.5us.
• txwait * 2.5 / 1000000 = seconds of time the port was unable to transmit
• Why reported in 2.5 us tick?
• Because of FICON requirements
• NS is to fast to interpret
• 5642973696 * 2.5/1000000 = 14107 seconds
• MDS was not able to transmit for around 14107 seconds since the
counter was last cleared
Understanding TXWait
mds9710-1# show interface fc1/1 counters | include wait
5642973696 2.5us Tx waits due to lack of transmit credits
29. • Intuitive way of reporting of how long frames could not be transmitted.
• In below output, frames could not be transmitted out of port fc1/13 for 1%
duration in last 1 second, 5% duration in last 1 minute and so on due to
lack of transmit B2B credits
Percentage of TxWait
MDS9700# show interface fc1/13 counters fc1/13
<snip>
5 Transmit B2B credit transitions to zero
2 Receive B2B credit transitions to zero
0 2.5us TxWait due to lack of transmit credits
Percentage Tx credits not available for last 1s/1m/1h/72h: 1%/5%/3%/2%
32 receive B2B credit remaining
128 transmit B2B credit remaining
128 low priority transmit B2B credit remaining
<snip>
30. • Graphical display of time
when credits were not
available
• 3 graphs per port
• Last 60 seconds
• Last 60 minutes
• Last 72 hours
• Top 3 rows(read vertically)
Actual txwait in ms
• Middle 10 rows(graph plot
using #)
• Bottom 2 rows (last 60
seconds)
• Example: @ 15th second,
TXWAIT = 989ms, @35nd
second, TXWAIT = 752ms
TXWait – Health report of port
mds9710-1# show process creditmon txwait-history
TxWait history for port fc1/13:
==============================
79998 79993 999999
08887 58882 9899999
000000000000299870000000000000000029994000000000000362999500
1000 ### ### ######
900 #### ### ######
800 #### #### ######
700 ##### #### ######
600 ##### #### ######
500 ##### #### ######
400 ##### #### ######
300 ##### ##### ######
200 ##### ##### ######
100 ##### ##### #######
0....5....1....1....2....2....3....3....4....4....5....5....6
0 5 0 5 0 5 0 5 0 5 0
Credit Not Available per second (last 60 seconds)
# = TxWait (ms)
From
6.2(13)
31. • TXWAIT delta value is logged periodically(20 seconds) into OBFL, if delta value >=100ms.
• Displays TXWAIT time in 2.5us ticks as well as in seconds.
• Congestion value is displayed in percentage over period of 20 seconds.
• Timestamp of event occurrence also recorded.
OBFL – Granular, long duration reporting
switch# show logging onboard txwait
Notes:
- sampling period is 20 seconds
- only txwait delta value >= 100 ms are logged
---------------------------------
Module: 1 txwait count
---------------------------------
-----------------------------------------------------------------------------
| Interface | Delta TxWait Time | Congestion | Timestamp |
| | 2.5us ticks | seconds | | |
-----------------------------------------------------------------------------
| fc1/11 | 3435973 | 08 | 42% | Sun Sep 30 05:23:05 2001 |
| fc1/11 | 6871947 | 17 | 85% | Sun Sep 30 05:22:25 2001 |
From
6.2(13)
32. • Monitoring Interval : 1 Second
• Threshold type : Delta
• Rising Threshold : 40% (translates to 400ms with 1 second monitoring interval)
• Falling threshold : 0%
• Action : Trap and Syslog
What this means in English
Event: “If the aggregate or sum of all the durations (with ns granularity)
when the port was at 0 TX credits, exceeds 400ms in 1 second polling
interval”
Action: Generate a SNMP trap and syslog
PMON configuration : TXWait
Default
33. Congestion Drop timeout
• MDS timestamps each received frame
• Frame is dropped if cannot be delivered to the egress port within timeout
• Logging is done
• Can be configured 100ms-500ms (500ms default)
• Lowering will timeout frames quicker and reduce effects of slow drain devices
B
B
B
B
B
B
B
MDS
Frame
Frame
Frame
Frame
Frame
Frame
Frame
B
B
BFrame
Frame
Frame
34. no-credit-drop timeout
Disk 1
B
B
B
B
B
B
B
B
B
BB
B
B
B
B
B
B
B
Frame
Server 1
MDS 1 MDS 2
Frame
Frame
Frame
FrameFrame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Disk 2
B
B
B
B
B
B
B
B B
B
Server 2
R_RDYR_RDY
Frame
Frame
Frame
Frame
Frame
R_RDY
Drop frames from egress queue
of Slow Port
BackPressure
Released
BackPressure
Released
BackPressure
Released
• Frames dropped in egress queue if credits unavailable for no-credit-drop timeout
Enhanced
since
6.2(9)
35. ducation xperience xperiment
Build robust and self-healing storage area networks
on new innovation on Cisco MDS and DCNM for solving SAN congestion
36. Cisco recommends troubleshooting slow drain in the following order
Methodology
36
Level 3: Extreme Delay
Level 2: Retransmission
Level 1: Latency
Troubleshooting Slow Drain
37. • If Rx congestion then find ports
communicating with this port that
have Tx congestion
• Zoning defines which devices
communicate with this port
• Understand topology
• If port communicating with port
showing Rx congestion is FCIP
• Check for TCP retransmits
• Check for overutilization of FCIP
37
F E
Rx Credits
0 Remaining
Tx Credits
0 Remaining
Congestion
Methodology – Follow Congestion to Source
Troubleshooting Slow Drain
38. • If Tx congestion found
• If F port then device
attached is slow drain
device, if not;
• If E port then go to
adjacent switch and
continue troubleshooting
• Continue to track through
the fabric until destination
F-port is discovered
38
E EF F
Rx Credits
0 Remaining
Tx Credits
0 Remaining
Congestion
Methodology – Follow Congestion to Source
Troubleshooting Slow Drain
39. Port-monitor Alerting
• Port-monitor allows monitoring of several counters relating to slow drain
• credit-loss-reco Credit loss recovery counter
• lr-rx The number of link resets received by the fc-port
• lr-tx Link resets transmitted by the fc-port
• timeout-discards Timeout discards counter
• tx-credit-not-available Credit not available counter(in 100ms increments)
• tx-discards Tx discards counter
• slowport-count Number of slowport events
• slowport-oper-delay Slowport operational delay
• txwait Amount of time at 0 Tx credits and packets queued
Port-monitor alerting
Note: There are other counters that are valuable and should also be considered for
inclusion in monitoring but are not part of slow drain
39
New!
40. Port-monitor Alerting
• Event indicates severity in alert
• 1 – Fatal
• 2 – Critical
• 3 – Error
• 4 – Warning
• 5 - Informational
Categorize counters as different severities for better visual impact in DCNM RMON event
severities
mds9513(config-port-monitor)# show rmon events
Event 1 is active, owned by PMON@FATAL
Description is FATAL(1)
Event firing causes log and trap to community public, last fired never
Event 2 is active, owned by PMON@CRITICAL
Description is CRITICAL(2)
Event firing causes log and trap to community public, last fired never
Event 3 is active, owned by PMON@ERROR
Description is ERROR(3)
Event firing causes log and trap to community public, last fired never
Event 4 is active, owned by PMON@WARNING
Description is WARNING(4)
Event firing causes log and trap to community public, last fired 2014/02/21-17:13:11
Event 5 is active, owned by PMON@INFO
Description is INFORMATION(5)
Event firing causes log and trap to community public, last fired 2014/03/08-08:25:19
41. Port-monitor Alerting
• Port-monitor allows separate policies
• F, FL ports(access)
• E, TL ports(trunks)
• Both F ports and E ports
• Only one policy type per port can be active at
a time
• Note: port-type access includes F port
connections to NPV switches that can carry
several logins
• Note: NP ports are not currently monitored
Separate policies or single policy
MDS9513(config-port-monitor)# port-type ?
access-port Configure port-monitoring for access ports
all Configure port-monitoring for all ports
trunks Configure port-monitoring for trunk ports
42. Port-monitor Alerting
• counter <name> poll-interval <interval> delta rising-threshold <rthresh> event <id> falling-
threshold <fthres> event <id> <portguard errordisable | flap>
• poll-interval – Seconds - How often should this counter be checked?
• delta – Compare the current value with the value at the previous poll interval
• absolute – Match the actual value
• rising-threshold – How much the counter must increase in this poll interval to trigger
• event – Indicates severity of alert - info, warning, error, etc.
• falling-threshold - How much the counter must decrease in this poll interval to reset
• portguard – Optional – Action to take when rising-threshold is reached
• errordisable – Place put in error-disable state. Requires manual shut/no shut to re-activate
• flap – shut/no shut port
Command parameters
43
43. Port-monitor Alerting
Port-monitor alerting – Example
port-monitor name AllPorts
port-type all
no monitor counter link-loss
no monitor counter sync-loss
no monitor counter signal-loss
no monitor counter invalid-words
no monitor counter invalid-crc
counter tx-discards poll-interval 60 delta rising-threshold 50 event 3 falling-threshold 10 event 3
counter lr-rx poll-interval 60 delta rising-threshold 5 event 2 falling-threshold 1 event 2
counter lr-tx poll-interval 60 delta rising-threshold 5 event 2 falling-threshold 1 event 2
counter timeout-discards poll-interval 60 delta rising-threshold 50 event 3 falling-threshold 10 event 3
counter credit-loss-reco poll-interval 60 delta rising-threshold 1 event 2 falling-threshold 0 event 2
counter tx-credit-not-available poll-interval 1 delta rising-threshold 10 event 4 falling-threshold 0 event 4
no monitor counter rx-datarate
no monitor counter tx-datarate
no monitor counter err-pkt-from-port
no monitor counter err-pkt-to-xbar
no monitor counter err-pkt-from-xbar
counter tx-slowport-count poll-interval 1 delta rising-threshold 5 event 4 falling-threshold 0 event 4
counter tx-slowport-oper-delay poll-interval 1 absolute rising-threshold 50 event 4 falling-threshold 0 event 4
counter txwait poll-interval 1 delta rising-threshold 20 event 4 falling-threshold 0 event 4
port-monitor activate AllPorts
Policy applies to Access(F)
and Trunk(E) ports
These counters are not
monitored
Note: The above monitors 9 slow drain counters and does not monitor 10 others
44
Event 2 – Critical
Event 3 – Error
Event 4 - Warning
44. Port-monitor Alerting
MDS9710-1# show port-monitor AllPorts
Policy Name : AllPorts
Admin status : Not Active
Oper status : Not Active
Port type : All Ports
---------------------------------------------------------------------------------------------------------
Counter Threshold Interval Rising Threshold event Falling Threshold event PMON Portguard
------- --------- -------- ---------------- ----- ------------------ ----- --------------
TX Discards Delta 60 50 3 10 3 Not enabled
LR RX Delta 60 5 2 1 2 Not enabled
LR TX Delta 60 5 2 1 2 Not enabled
Timeout Discards Delta 60 50 3 10 3 Not enabled
Credit Loss Reco Delta 60 1 2 0 2 Not enabled
TX Credit Not Available Delta 1 10% 4 0% 4 Not enabled
slowport-count Delta 1 5 4 0 4 Not enabled
slowport-oper-delay Absolute 1 50ms 4 0ms 4 Not enabled
txwait Delta 1 20% 4 0% 4 Not enabled
----------------------------------------------------------------------------------------------------------
Activation and output
45
47. • Configure slowport-monitor @ 10-25ms for both E & F ports
• system timeout slowport-monitor 10 mode e
• system timeout slowport-monitor 10 mode f
• Configure congestion-drop on F ports
• system timeout congestion-drop 200ms mode f
• Configure no-credit-drop on F ports
• System timeout no-credit-drop <ms> mode f
• 200ms – safe, 100ms – aggressive, 50ms – Very aggressive
• Configure port-monitor policy(s)
• Use samples included in port-monitor section
Guidance on configuration
48. ducation xperience xperiment
Build robust and self-healing storage area networks
on new innovation on Cisco MDS and DCNM for solving SAN congestion
49. Refining no-credit-drop timeout
Enable Slowport Monitoring on all devices (No performance impact!)Step 1
Monitor end device performance (R_RDY delay)Step 2
Either “show process creditmon slowport-monitor-events”
Or better, “show logging onboard slowport-monitor-events”
Define typical R_RDY on slow ports (average, peak, variance, etc)Step 3
Use the (typical value + variance) as no-credit-drop timeoutStep 4
Result
Automatic recovery the moment a port sees R_RDY delay more than ‘typical’
50. • Find the delay values on ports for acceptable application performance
• Upside variance of delay value may lead to degraded application performance
• Use following for fabric benchmarking
• DCNM slow drain analysis
• MDS Slowport-monitor
• MDS TxWait health graph
• MDS TxWait percentage congestion
• Slowdrain SNMP MIBs
• Port-monitor alerts
Predicting slow drain: Fabric Benchmarking
51. Fabric Benchmarking
using slowport-monitor
and TxWait
• slowport-monitor at 10ms on E & F
ports
• congestion-drop on F ports at 200ms
• no-credit-drop on F ports
(200ms – safe, 100ms – aggressive, 50ms
– Very aggressive)
• Configure port-monitor policy(s)
Education Experience Experiment
build robust and self-healing storage area networks
Use Cisco MDS & DCNM to