SnapDiff performance issue

SnapDiff Performance issue
The factors affecting SnapDiff throughput are:
 The hardware platform [How many cores NetApp Filer has ? As SnapDiff is CPU intensive, more cores the better
especially if you are running parallel SnapDiff jobs on the volumes on the same Node. For test purpose, just run
one job and see the performance and slowly increase the parallel jobs or stagger them]
 The version of Data ONTAP [Check for any known bugs]
 Disk I/O latency [ Is the volume/aggregate over the edge > 90 % full ]
 Volumes with large number of files [ Million files in a single directory ]
 SnapDiff sessions when run in parallel disk intensive workloads
 Make sure SnapDIff jobs are not overlapping with the SnapVault and SnapMirror update cycles
 Disable Network Teaming on Media Agent server and test the jobs again [Use wireshark: A packet trace between
the controller and application host server might help diagnose the network level issue especially packet
truncation.] Refer to NetApp KB ID: 1028373
 Ensure the network drivers are updated on the MA
Symptoms:
'Invalid session-id' basically translates to ‘time-out’ between ONTAP and Media Agent.
SnapDiff is simply 'Snapshot Differencing' tool: Looking at the meta-data changes between the two snapshots, it quickly
sends the list of files/folders names that have been modified/deleted or added since the previous snapshot. In short, A
SnapDiff enabled Backup Job has 4 phases, SnapDiff session related errors such as ‘invalid’ or ‘max-sessions exceeded’ can
occur in phase 2 & 3:
1) Backup Application takes a snapshot:
This should be quick, ideally less than 2 seconds: B'cos of the way WAFL works, this should be quick, as we are not copying
any data here, and it’s at controller level]
2) NetApp creates a session ID with CommVault for SnapDiff communication [snapdiff-iter-start]
Establishing a session should be quick as well.
3) Back & forth of snapdiff-iter-next API to pull the 'Index: files/folders name' [snapdiff-iter-next]
This is the phase when actual index is being updated at the Media Agent side and therefore most likely the time-outs could
occur here. For session-ID timeouts, look for 'mgwd logs' & ‘audit logs' on the NetApp side. This is the key communication
phase and therefore it is likely that the 'Indexing' is taking too long on MA (NASiDA <----> CVD) and it's hitting the time-out
or in other words ‘session’ is dead or invalid before next snapdiff-iter-next arrives at NetApp.
Note: snapdiff-iter-next API is repeated until 100% of diffs are retrieved.
4) snapdiff-iter-end.
Please talk to CommVault support to determine if it's indeed an issue. In some cases, increasing time-out on the MA can
help but ideally one should investigate why it is taking that long, especially on SSDs based Windows/Linux MA it shouldn't
have to increase the time-out.
IMP: Please note there is a difference between ‘invalid’ sessions and ‘max sessions’ error, invalid sessions could be result of
time-out during ‘snapdiff-iter-next’ iterations period. However, max-sessions are the limit you reached with parallel
SnapDiff Jobs and the error should affect ‘snapdiff-iter-start’ phase. Hence depending upon the scenario troubleshoot your
case.

Workaround: [If you cannot afford to stop the jobs while investigation is pending, then follow this workaround]
1) Skip Cataloging for SnapDiff Jobs.
2) Run Cataloging later at quite hours.
3) In case you want to restore any data : Don't panic, you can use live 'browse feature'.
SKIP:
Skip Cataloging for File System IntelliSnap: Depending on many factors, such as the number of files on the file server
volume, the cataloging operation might be resource intensive. Skipping the cataloging operation during backups results in
faster IntelliSnap backup jobs. You can run the cataloging phase during off-peak times as a separate job on the storage
policy.
RunCatalog later: [off-peak]
If you skip cataloging during IntelliSnap backups, you can later run cataloging on the IntelliSnap backup jobs, or you can use
the backup copy of the IntelliSnap job for cataloging and moving the job to media. To configure cataloging to run later, see
Configuring a Storage Policy for Cataloging Snapshots.
More info, refer this CommVault KB:
https://documentation.commvault.com/commvault/v11_sp15/article?p=36227.htm
Live browse:
To restore data, you can perform a live browse from the snapshot.
https://documentation.commvault.com/commvault/v11_sp15/article?p=36231.htm
SnapDiff FAQ:
https://www.slideshare.net/AshwinPawar/snapdiff
ashwinwriter@gmail.com
June, 2019

SnapDiff performance issue

Recommandé

Recommandé

Contenu connexe

Similaire à SnapDiff performance issue

Similaire à SnapDiff performance issue (20)

Plus de Ashwin Pawar

Plus de Ashwin Pawar (20)

Dernier

Dernier (20)

SnapDiff performance issue