SlideShare une entreprise Scribd logo
1  sur  45
1
Diagnostics
1
Diagnostics
2
●
SP Diags (stinger)
●Spdiag Tool (galaxy)
●
SunVTS
●
PC Check
●
HDT (sundiag replacement)
●
CSTH
●
Herd & EDAC
●
MCAT
●
Bonnie
●
Memtest
●Web Pages
●
Decoder Tools
Sun Confidential: Internal Only 3
Day 2 PM and Acknowledgements
• I have borrowed/stolen/copied* the following in this
presentation.
• Newisys decoder from Barry Wright
• HDT from Bernward Schwartz
• SGR from
http://panacea/twiki/bin/view/SGR/WebHome
Sun Confidential: Internal Only 4
SP Diags for V20/40z
• Not to be confused with “spdiag” Tool
• Bootable CD (nsv 2.2.0.6 or above required) or SP based
• Enable diagnostic boot in BIOS for bootable CD
• NSV installed on remote system and mounted
locally by NFS
.
Sun Confidential: Internal Only 5
SP Diags for V20/40z
● Install diags:
cp -r /mnt/cdrom/nsv_file /mnt/nsv/
cd /mnt/nsv/
unzip -a *.zip
chmod 777 /mnt/nsv/diags/NSV_version_number/scripts
chmod -R 755 /mnt/nsv/diags/NSV_version_number/mppc
Note:Now ensure nfs is enabled on server and can export file system
sp add mount -r NFS_server_hostname:/directory_with_NSV_files -l /mnt
sp update diags -p /mnt/diags/DIAGS_version#
Sun Confidential: Internal Only 6
SP Diags for V20/40z
● diags start (for standalone)
● diags start -n (on-line nic,disk,mem)
● diags get state (confirm diags are loaded)
● diags get tests (list diagnostics tests)
● diags run tests -av
● diags run tests -av >/mnt/log/diags.log
● diags terminate
Ensure diags and BIOS, drivers are compatible
Diags will fail to run otherwise
Sun Confidential: Internal Only 7
SP Diags for V20/40z
• diags -h this will show all syntax
• diags -a -v full test
• Bootable CD
> diags terminate -n
> diags start -n
> diags run tests -a -v >diags.out &
> tail -f diags.out
Sun Confidential: Internal Only 8
The “spdiag” Tool (Galaxy)
• SP based diagnostic
• Test i2c , voltage , fans , temp
• Stop ipmi /etc/init.d/ipmistack stop
• /usr/local/bin/spdiag 1 g4 i2ctst
• Reboot SP
Sun Confidential: Internal Only 9
PC Check
• Supplemental/Tools CD and now boot menu
• AMD based X2100,X2100M2,X2200M2 and all new
X4x40 platforms
• All Intel based platforms
• Monitor and keyboard
• Serial port
• Scripts , burn-in tests , loopback
Sun Confidential: Internal Only 10
PC Check
• Front Menu:
System Information menu
Advanced Diagnostics Tests
Immediate Burn-in Testing
Deferred Burn-in Testing
Create Diagnostic Partition
Show Results Summary
Print Results Report
Sun Confidential: Internal Only 11
PC Check
• Burn-in Testing:
> quick.tst - requires user input, no time-out
> noinput.tst – no user input, good first test
> full.tst – requires loopback & user input
• Command Line:
> Example pccheck cpu.tst /BD
> pccheck /? - shows all flags
> pccheck suncsi.tst /IS /BD /KS /MH 30 /HMD 1m /HDD 1m
/SD 5m
Sun Confidential: Internal Only 12
SUNvts
• What are you trying to test/replicate ?
• Local or bootable CD-ROM
• Galaxy 2.2 cd contains vts6.3
• GUI or command line
• Unsupported platforms:
> /opt/SUNWvts/lib/conf/platform.conf
> smbios | grep Product
> Boot with graphics head
> Edit tty boot console=ttya,ttya-mode=”9600,8,n,1,-”
Sun Confidential: Internal Only 13
SUNvts
Sun Confidential: Internal Only 14
HDT (Hardware Debug Tool)
• PLEASE USE WITH CAUTION !!!
• Will hang the host if OS running
• Reboot SP after use
http://panacea/twiki/bin/view/Products/Galaxydiag
• Additional tools:
• /usr/local/bin/collectHostStatus.sh
-nohdtl disables hdt test
• /usr/local/bin/collectDebugInfo.sh
Sun Confidential: Internal Only 15
Platform Specifics
• On G4:
> HDT uses some signals over the i2c bus => IPMI on the SP has
to be shut down. SP should be rebooted when done with hdt
diags.
> JTAG chain goes through all CPU modules => All slots must have
CPU or filler module inserted for HDT to work on G4
> Direct access to all CPU's, default is cpu 0
• Other Platforms:
> Only CPU0 in JTAG chain
> no i2c involved, only used for platform identification
From Bernward Schwarte presentation.
Sun Confidential: Internal Only 16
Getting Started & Cautions
• hdt or hdtl?
> Current hdt binary and some documentation at:
http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/SpBasedHdtDiag under Galaxy->Pre-
OS-Diagnostics
> Copy to SP: scp hdt sunservice@<SPIP>:/coredump
> ssh sunservice@<SPIP> password: changeme
> cd /coredump (or check /usr/local/bin for the built in copy)
• Caution: All hdt commands stop CPU's. Some hdt commands will
reset/power-cycle system.
• All command line parameters are interpreted as hex values !
• ./hdt prints syntax of all available commands
• ./hdt –pd 0 18 0
• hdt leaves CPU in HDT-mode when exiting, use “-e” option to exit HDT-
mode
Sun Confidential: Internal Only 17
Available Commands/Diagnostics
• Basics:
> Single HDT command: -h * Note: this is not -help
> Access io- and memory space: -mr, -mw, -ir, -iw
> Access CPU registers: -rd, -rr, -rw
> Single step: -hs
• Control:
> Reset : -xr [b c]
> Stop at reset: -xs [b c p]
> resource init: “-hi” : sets up HT routing and resources
> Power On/Off : -o [0 off, 1 cycle, 2 on]
> set CPU: -c G4 only
> Breakpoints: -bps -bpm -bpc
> exit: -e
Sun Confidential: Internal Only 18
Diagnostics
• Extended:
> PCI configuration space access -pr, -pw -pd, -ps
> “Dump” commands
– Machine check: -dm
– DIMM SPD: -dd
– CMOS: -dc
– SIO: -ds
– Flash: -df
> HT link testing: -a
– Powercycles, stops at reset vector, sets all HT links, warm
resets
Sun Confidential: Internal Only 19
HDT Not Working
> Depending on System state HDT can be non-functional
> To capture some system/error state:
– Reset system and stop at reset vector: hdt -xs b
– Init HT routing and PCI bridge enumeration hdt -hi
– Dump Machine check and HT link status: hdt -dm -dl
hdtDiag: Galaxy/Thumper HDT Diagnostics, Version 0.7.0
-------------------------------------------------------
hdtDiag: Error, HDT command failed, no CFF cpu 0
hdtDiag: SysIdent: HDT access failing
hdtDiag: defaulting to G12X
Sun Confidential: Internal Only 20
HDT
• Check Versions 0.8.0 , 0.8.3 , 0.9.9, 1.3, 1.4.1 etc
• ./hdt -xs
• ./hdt -hi
• ./hdt -l -q or try ./hdt -l -a
• ./hdt -e
• Reboot SP
Sun Confidential: Internal Only 21
CSTH (Continuous System Telemetry Harness)
• Calls ipmitool to create a telemetry stream of:
> volt,temp,current,fans and PSU variables
● Collect data and submit for analysis to engineering:
● ./start-csth-ipmi <spname> <splogin> <sppasswd> [--interval <numsecs>]
● Example:
➢ ./start-csth-ipmi test-sp admin test.pass 60 &
➢ ./stop-csth-ipmi test-sp
Sun Confidential: Internal Only 22
CSTH (Example From an x4200)
Sun Confidential: Internal Only 23
HERD (Hardware Error Report Decode)
http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD
•Hardware error report and decoding from mcelog or
via the command line with kernel 2.6.4 or above
•Installed as RPM on top of SLES and Red Hat
•Be provide by Sun Microsystems
•Will report errors to messages file and service
processor (if applicable)
•Same command line options as mcelog
•Must be run on the same host as the machine that
reported the errors when using the herd -e function.
Sun Confidential: Internal Only 24
HERD (Hardware Error Report Decode)
•Example from console / logs:
•Example of running herd manually (pre herd install):
Mar 5 18:03:01 va64-x2200c-gmp03 herd: HARDWARE ERROR. This is *NOT* a software problem!
Mar 5 18:03:01 va64-x2200c-gmp03 herd: Please contact your hardware vendor
Mar 5 18:03:01 va64-x2200c-gmp03 herd: CPU 0 4 northbridge
Mar 5 18:03:01 va64-x2200c-gmp03 herd: TSC fcc73b11cf
Mar 5 18:03:01 va64-x2200c-gmp03 herd: ADDR 142110
Mar 5 18:03:01 va64-x2200c-gmp03 herd: Northbridge Chipkill ECC error
Mar 5 18:03:01 va64-x2200c-gmp03 herd: Chipkill ECC syndrome = 11ea
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit46 = corrected ecc error
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit57 = processor context corrupt
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit61 = error uncorrected
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bus error 'local node response, request didn't time out
generic read mem transaction memory access, level generic'
Mar 5 18:03:01 va64-x2200c-gmp03 herd: STATUS b675410011080a13 MCGSTATUS 0
# herd -e 142110
000000142110: Cpu Node 0, DIMM 2
Sun Confidential: Internal Only 25
EDAC (Kernel 2.6.20.xx and above)
2 examples of edac not working & working (x2200):
Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout)
memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Multiple CE in quick succession or DIMM layout
Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: Failed to translate InputAddr to csrow for address 0xbb2c2fc0
Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac
Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac Error Overflow set
Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error ^^ Failed to translate due to overflow bit set
This happens if more than one error has occurred before edac gets to it or if edac does not understand the DIMM layout.
Here is the correct format of edac's output:
Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: general bus error: partic ipating processor(local node origin), time-out(no
timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Always CPU 0 (reporting error)
Mar 4 10:43:42 va64-x2200c kernel: MC0: CE page 0x100010, offset 0x10, grain 8, syndrome 0xa1e8, row 0, channel 0, label "": k8_edac
^^ This event tells you the actual offending CPU which in this instance is CPU 0. (label not used by default but Sun may/customer may populate)
Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC error <<Decode below:
MC0: CE error page 0x100010 adding offset of 0x10 = Address (0x10100010) Grain = 8 which is Chipkill Row 0, Channel 0 = CPU0,DIMM0
Channel 0Channel 1 Channel 0Channel 1
=================================== ===================================
Row> csrow0 | DIMM_A0| DIMM_B0 | csrow2 | DIMM_A1| DIMM_B1 |
csrow1 | DIMM_A0| DIMM_B0 | csrow3 | DIMM_A1| DIMM_B1 |
=================================== ===================================
If single rank DIMMs (1GB or less) then csrow1 and csrow3 are not used/available.
Sun Confidential: Internal Only 26
EDAC Continued:
•Example output from the SP (not created by edac):
1 | 02/23/2008 | 02:13:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
>>> Edac log should be here but does not show - Instead, you just see the BIOS scrubber results <<<
2 | 02/25/2008 | 16:27:55 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
3 | 02/25/2008 | 17:27:58 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
4 | 02/25/2008 | 18:28:00 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
5 | 02/25/2008 | 19:28:02 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
6 | 02/25/2008 | 20:28:04 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
7 | 02/25/2008 | 21:28:06 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
8 | 02/25/2008 | 22:28:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
9 | 02/25/2008 | 23:28:10 | Memory CPU0 DIMM0 | Correctable ECC | Asserted ... and so on ...
Do a cat of /proc/mc/0 to give you an understanding of the events occurred in a row/column summary
It's edac or herd, not both!!! They both try to grab /dev/mce events and report. (rmmod k8_edac to remove)
And remember, the SEL log is your friend so always get an ipmi dump first before escalating or decoding.
Sun Confidential: Internal Only 27
The “mcelog”
• Linux kernels after 2.6.4 do not print recoverable
machine check errors
• Messages are saved in /var/log/mcelog
• Mcelog read errors from /dev/mcelog and then deletes
entries
• Typically run as a cron jog:
> /usr/sbin/mcelg >> /var/log/mce
> *Note this is not collected by sysreport
• RedHat implemented as a daemon
• See RedHat advisory RHEA-2006-0134-7
Sun Confidential: Internal Only 28
MCAT (Machine Check Analysis Tool)
Event Source 62 - WMIxWDM
Processor Number : 0
Bank Number : 4
Time Stamp (0x): 01C856C4 58A8C10D
Error Status (0x): D4714000 E1080A13
Error Address (0x): 00000000 A047BF50
Error Misc. (0x): 00000000 00000000
Single bit errors:
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Second error
Error valid Cont: >>
Bus Error Code:
Participation processor: Local node responded to
the request (RES)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0A13
DRAM memory access (MEM) Generic read (RD),
on Generic (LG) cache
ChipKill Syndrome: 0xE1E2
Error address at 2564 MB
Takes input from a Windows Event Log entry and decodes the output:
Sun Confidential: Internal Only 29
MCAT Continued
• This can be gathered by running ipmitool fru:
FRU Device Description : p0.fru (ID 6)
Product Manufacturer : ADVANCED MICRO DEVICES
Product Name : DUAL CORE AMD OPTERON(TM) 275
Product Part Number : 0F21
Product Version : 02
FRU Device Description : p0.d0.fru (ID 8)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D7010058
Continued: >>>
FRU Device Description : p0.d1.fru (ID 9)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D7010056
FRU Device Description : p0.d2.fru (ID 10)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D701A6F4
FRU Device Description : p0.d3.fru (ID 11)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D701A6EE
FRU Output for this failing platform:
Sun Confidential: Internal Only 30
Manual Diagnosis
Processor Number :0 - CPU 0 (If said 4 then it would be socket CPU4, not core 4).
Error address at 2564 MB (i.e. between 2 and 3 GBytes).
From the FRU information, each DIMM is 1 Gbyte.
The DIMMs are numbered for closest to CPU outwards based on mapping.
(DIMMs should be populated from outside inward but are mapped closest to CPU outwards).
The BIOS sets up memory from DIMM0/1 outwards.
Assuming "optimal defaults":
Our Opterons use a 128-bit wide data path. DIMM0 and DIMM1 are used in a pair.
These are single-rank DIMMs but they are all the same so is "chipselect interleaving".
The first 128KB are on DIMM0 and 1. The second 128KB are on DIMM2 and 3.
2564/128 = 20.03 ----> which is in DIMM0 and DIMM1 pair.
(Always replace Opteron platform DIMMs in pairs).
Windows reporting decode is performed as follows:
Sun Confidential: Internal Only 31
Manual Diagnosis
ChipKill Syndrome: 0xE1E2
Looking this up in the table 26 of the AMD BIOS And Kernel Writer's Guide shows this is symbol 0x1a
which according to the text above 26, this symbol maps to the upper 64-bits of the 128-bit data path.
DIMM0 from 00h-0fh provides the low 64-bits, DIMM1 from 10h-1fh provides the high 64-bits.
The check bits for the lower 64-bits is 20h-21h and the check bits for the upper 64-bits is 22h-23h
Technical documentation including the AMD BIOS and Kernel Writers Guide is available from AMD via:
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html
Remember though to download the correct document for your processor revision:
SingleDual core Opteron for x2100, x2200, x4100, x4200, x4500, x4600 etc is document family 0fh.
Quad code Opteron for supported platforms is document family 10h.
Manual diagnosis continued:
Sun Confidential: Internal Only 32
Manual Diagnosis
Chipkill Syndrome Table for 0Fh CPUs 0-63 data bits
Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh
00h e821 7c32 9413 bb44 5365 c776 2f57 dd88 35a9 a1ba 499b 66cc 8eed 1afe f2df
01h 5d31 a612 fb23 9584 c8b5 3396 6ea7 eac8 b7f9 4cda 11eb 7f4c 227d d95e 846f
02h 0001 0002 0003 0004 0005 0006 0007 0008 0009 000a 000b 000c 000d 000e 000f
03h 2021 3032 1013 4044 6065 7076 5057 8088 a0a9 b0ba 909b c0cc e0ed f0fe d0df
04h 5041 a082 f0c3 9054 c015 30d6 6097 e0a8 b0e9 402a 106b 70fc 20bd d07e 803f
05h be21 d732 6913 2144 9f65 f676 4857 3288 8ca9 e5ba 5b9b 13cc aded c4fe 7adf
06h 4951 8ea2 c7f3 5394 1ac5 dd36 9467 a1e8 e8b9 2f4a 661b f27c bb2d 7cde 358f
07h 74e1 9872 ec93 d6b4 a255 4ec6 3a27 6bd8 1f39 f3aa 874b bd6c c98d 251e 51ff
08h 15c1 2a42 3f83 cef4 db35 e4b6 f177 4758 5299 6d1a 78db 89ac 9c6d a3ee b62f
09h 3d01 1602 2b03 8504 b805 9306 ae07 ca08 f709 dc0a e10b 4f0c 720d 590e 640f
0ah 9801 ec02 7403 6b04 f305 8706 1f07 bd08 2509 510a c90b d60c 4e0d 3a0e a20f
0bh d131 6212 b323 3884 e9b5 5a96 8ba7 1cc8 cdf9 7eda afeb 244c f57d 465e 976f
0ch e1d1 7262 93b3 b834 59e5 ca56 2b87 dc18 3dc9 ae7a 4fab 642c 85fd 164e f79f
0dh 6051 b0a2 d0f3 1094 70c5 a036 c067 20e8 40b9 904a f01b 307c 502d 80de e08f
0eh a4c1 f842 5c83 e6f4 4235 1eb6 ba77 7b58 df99 831a 27db 9dac 396d 65ee c12f
0fh 11c1 2242 3383 c8f4 d935 eab6 fb77 4c58 5d99 6e1a 7fdb 84ac 956d a6ee b72f
Sun Confidential: Internal Only 33
Manual Diagnosis
Chipkill Syndrome Table for 0Fh CPUs 64-128 data bits
Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh
10h 45d1 8a62 cfb3 5e34 1be5 d456 9187 a718 e2c9 2d7a 68ab f92c bcfd 734e 369f
11h 63e1 b172 d293 14b4 7755 a5c6 c627 28d8 4b39 99aa fa4b 3c6c 5f8d 8d1e eeff
12h b741 d982 6ec3 2254 9515 fbd6 4c97 33a8 84e9 ea2a 5d6b 11fc a6bd c87e 7f3f
13h dd41 6682 bbc3 3554 e815 53d6 8e97 1aa8 c7e9 7c2a a16b 2ffc f2bd 497e 943f
14h 2bd1 3d62 16b3 4f34 64e5 7256 5987 8518 aec9 b87a 93ab ca2c e1fd f74e dc9f
15h 83c1 c142 4283 a4f4 2735 65b6 e677 f858 7b99 391a badb 5cac df6d 9dee 1e2f
16h 8fd1 c562 4ab3 a934 26e5 6c56 e387 fe18 71c9 3b7a b4ab 572c d8fd 924e 1d9f
17h 4791 89e2 ce73 5264 15f5 db86 9c17 a3b8 e429 2a5a 6dcb f1dc b64d 783e 3faf
18h 5781 a9c2 fe43 92a4 c525 3b66 6ce7 e3f8 b479 4a3a 1dbb 715c 26dd d89e 8f1f
19h bf41 d582 6ac3 2954 9615 fcd6 4397 3ea8 81e9 eb2a 546b 17fc a8bd c27e 7d3f
1ah 9391 e1e2 7273 6464 f7f5 8586 1617 b8b8 2b29 595a cacb dcdc 4f4d 3d3e aeaf
1bh cce1 4472 8893 fdb4 3155 b9c6 7527 56d8 9a39 12aa de4b ab6c 678d ef1e 23ff
1ch a761 f9b2 5ed3 e214 4575 1ba6 bcc7 7328 d449 8a9a 2dfb 913c 365d 688e cfef
1dh ff61 55b2 aad3 7914 8675 2ca6 d3c7 9e28 6149 cb9a 34fb e73c 185d b28e 4def
1eh 5451 a8a2 fcf3 9694 c2c5 3e36 6a67 ebe8 bfb9 434a 171b 7d7c 292d d5de 818f
1fh 6fc1 b542 da83 19f4 7635 acb6 c377 2e58 4199 9b1a f4db 37ac 586d 82ee ed2f
Sun Confidential: Internal Only 34
Manual Diagnosis
ECC Syndrome Table (for completion) for 0Fh CPUs
(Single Error Correction, Double Error Detection):
n=0 n=1 n=2 n=3 n=4 n=5 n=6 n=7
Bit (0+n) ce cb d3 d5 d6 d9 da dc
Bit (8+n) 23 25 26 29 2a 2c 31 34
Bit (16+n) 0e 0b 13 15 16 19 1a 1c
Bit (24+n) e3 e5 e6 e9 ea ec f1 f4
Bit (32+n) 4f 4a 52 54 57 58 5b 5d
Bit (40+n) a2 a4 a7 a8 ab ad b0 b5
Bit (48+n) 8f 8a 92 94 97 98 9b 9d
Bit (56+n) 62 64 67 68 6b 6d 70 75
Bit (64+n) 01 02 04 08 10 20 40 80
*Typically used for single DIMM configurations
Sun Confidential: Internal Only 35
Other Tools/Diags (Un-supported)
• Bonnie
> Benchmark to measure performance of filesystem
http://www.textuality.com/bonnie/
• Memtest86+
> Standalone bootable diagnostic
http://www.memtest.org/
http://www.memtest86.com/ Original version
• Other memory tool
http://people.redhat.com/dledford/memtest.html
http://sourceforge.net/
• Netperf or ttcp - google for them - network tools
Sun Confidential: Internal Only 36
SGR
• Situation appraisal – Recognise a problem
• Problem Analysis - Find True Cause
http://systems-tsc/twiki/pub/SGR/SgrtOnlineHelp/PA-guide.pdf
The Steps in FTC are:
* Define a Problem Statement
* Describe the problem with a Problem Specification
* Develop Possible Causes from either Experience or Differences and
Changes
* Identify the Most Probable Cause
* Test the Most Probable Cause against the Problem Specification
* Verify the Most Probable Cause
Sun Confidential: Internal Only 37
Newisys MCE Decoder v20/40z
What to gather from inventory get all -v
1. How many CPU's?
2. How many Dimms per CPU?
3. What is the part number of the Dimm?
NOTE:This is for V20/40z ONLY and only works on
Northbridge Errors
Sun Confidential: Internal Only 38
Details from CPU0 explained
●
•
Here you see 4 identical Dimms on CPU0.
•
The Dimm Manufacture part # is: 36VDDF25672G-40BD2
●
●
●
●
Name Type OEM Manufacture Date Hardware Revision Part #
●CPU 0 DIMM 0 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2
●
CPU 0 DIMM 1 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2
●CPU 0 DIMM 2 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2
●CPU 0 DIMM 3 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2
●DDR 0 VRM memvrm S-SCI448 2005-05-27 A01 S01479
•
CPU 0 VRM vrm NA
Sun Confidential: Internal Only 39
Determine Type & Rank of Dimm
** Dimms can be single rank or dual rank. For a description of the differences see:
http://pts-platform.uk/twiki/bin/view/Products/ProdFAQv2040z or
http://en.wikipedia.org/wiki/DIMM#Ranking
Browse to the Qualified Memory page:
http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Stinger/StingerQualifiedMemory
Compare your DIMM Manufacture part number to the list: 36VDDF25672G-40BD2
This equates to a 2GB Micron Dual Rank DIMM:
Micron:
512MB: MT18VDDF6472G-40BG3 Die: G Single Rank SPD 1.0
1GB: MT18VDDF12872G-40BD3 Die: D Single Rank SPD 1.0
2GB: MT36VDDF25672G-40BD2 Die: D Dual Rank SPD 1.0
Now we are ready to populate the Memory Decode Tool
Sun Confidential: Internal Only 40
Warning! Decode Tool is Sun Internal
https://supportcenter.newisys.com/edbug/edbug_int.pl?auth=dfqsdftqw11p4jdvhasovygm82cbcfrk
This link cannot be shared with customers.
It is internal for Sun use only.
The link has the account and password in it.
Sun Confidential: Internal Only 41
Populate the Decode Tool
Sun Confidential: Internal Only 42
Information for Decode Tool
Enter the CPU that has the machine check: (From the Error)
0, 1, 2, or 3
Enter the platform type:
2100 = V20z
4300 = V40z
Enter the machine check status: (From the Error)
Enter the machine check address: (From the Error)
Specify which CPUs have DIMMs: (From inventory ger all -v)
Specify which DIMMs are populated on each CPU: (From Inventory get all -v)
Specify the DIMM type: (Rank from Qualified Memory Page)
BIOS defaults: Leave this at the default (Place a √ in DIMM interleaving, 128 bit
DIMM interface, and Chipkill ECC enabled. No √ in Node interleaving)
Sun Confidential: Internal Only 43
Result Output
Only one error is present
Error details:
K8_CPU-0 is reporting this corrected error:
DRAM chipkill ECC error found by scrubber
The DRAM error was at address '00000000 9C6A0B30' (2 GB range)
This error is related to DIMM 1 on K8_CPU-0
The ECC syndrome ('5E34'x) maps to a correctable error at data bit 66
Within the DIMM, this would be an error at physical bit 2
Processor was responding to another source of the transaction
Transaction was a read
Error classification:
Error type: DRAM ECC
Error severity: Corrected
Error enabled: yes
Error recovered: yes
Possible sympathy: no
Error address: '000000009C6A0B30'x
Address type: Physical
Sun Confidential: Internal Only 44
Anything Else........
• Newisys Machine Check (northbridge only)
• V20/40z only
• http://systems-tsc/twiki/pub/Products/ProdTroubleshootingV20z/V20z-V40z-Memory-DIMM
• Windows Debugging
> http://www.microsoft.com/whdc/devtools/debugging/default.mspx
• MCAT
• http://www.amd.com/us-
en/assets/content_type/utilities/mcatsetup.exe
Machine Check Analysis Tool (MCAT) is a command line utility
that takes Windows System Event Log (.evt) file as an argument
and decodes the MCA Error logs into human readable format.
MCAT can alternatively take in MCE Error information as raw
register hexadecimal values as command line argument as well.
45
Diagnostics Complete!
45

Contenu connexe

Tendances

HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 
Kernel Features for Reducing Power Consumption on Embedded Devices
Kernel Features for Reducing Power Consumption on Embedded DevicesKernel Features for Reducing Power Consumption on Embedded Devices
Kernel Features for Reducing Power Consumption on Embedded DevicesRyo Jin
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingAnne Nicolas
 
ARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreGlobalLogic Ukraine
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause AnalysisEric Sloof
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and InsightsGlobalLogic Ukraine
 
Spectre meltdown performance_tests - v0.3
Spectre meltdown performance_tests - v0.3Spectre meltdown performance_tests - v0.3
Spectre meltdown performance_tests - v0.3David Pasek
 
Hpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server DatasheetHpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server Datasheet美兰 曾
 
Global counters (ssh log)
Global counters (ssh log)Global counters (ssh log)
Global counters (ssh log)David Derrej
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLinaro
 
建構嵌入式Linux系統於SD Card
建構嵌入式Linux系統於SD Card建構嵌入式Linux系統於SD Card
建構嵌入式Linux系統於SD Card艾鍗科技
 
Placas base evolucion[1]
Placas base evolucion[1]Placas base evolucion[1]
Placas base evolucion[1]zuzanitah
 
使用XMPP進行遠端設備控制
使用XMPP進行遠端設備控制使用XMPP進行遠端設備控制
使用XMPP進行遠端設備控制艾鍗科技
 

Tendances (20)

Debugging linux
Debugging linuxDebugging linux
Debugging linux
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 
Nvidia smi.1
Nvidia smi.1Nvidia smi.1
Nvidia smi.1
 
Dx diag
Dx diagDx diag
Dx diag
 
Kernel Features for Reducing Power Consumption on Embedded Devices
Kernel Features for Reducing Power Consumption on Embedded DevicesKernel Features for Reducing Power Consumption on Embedded Devices
Kernel Features for Reducing Power Consumption on Embedded Devices
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
 
ARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/Spectre
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and Insights
 
mmmm
mmmmmmmm
mmmm
 
Spectre meltdown performance_tests - v0.3
Spectre meltdown performance_tests - v0.3Spectre meltdown performance_tests - v0.3
Spectre meltdown performance_tests - v0.3
 
Hpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server DatasheetHpe Proliant DL325 Gen10 Server Datasheet
Hpe Proliant DL325 Gen10 Server Datasheet
 
Global counters (ssh log)
Global counters (ssh log)Global counters (ssh log)
Global counters (ssh log)
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel Awareness
 
建構嵌入式Linux系統於SD Card
建構嵌入式Linux系統於SD Card建構嵌入式Linux系統於SD Card
建構嵌入式Linux系統於SD Card
 
Log
LogLog
Log
 
Placas base evolucion[1]
Placas base evolucion[1]Placas base evolucion[1]
Placas base evolucion[1]
 
Tuned
TunedTuned
Tuned
 
使用XMPP進行遠端設備控制
使用XMPP進行遠端設備控制使用XMPP進行遠端設備控制
使用XMPP進行遠端設備控制
 

En vedette

Cpu And Memory Events
Cpu And Memory EventsCpu And Memory Events
Cpu And Memory EventsAero Plane
 
One Sample Hypothesis Tips
One Sample Hypothesis   TipsOne Sample Hypothesis   Tips
One Sample Hypothesis Tipsprussin86
 
One Sample Hypothesis Tips
One  Sample  Hypothesis    TipsOne  Sample  Hypothesis    Tips
One Sample Hypothesis Tipsprussin86
 
One Sample Hypothesis - Tips
One Sample Hypothesis - TipsOne Sample Hypothesis - Tips
One Sample Hypothesis - Tipsprussin86
 
One Sample Hypothesis - Tips
One Sample Hypothesis - TipsOne Sample Hypothesis - Tips
One Sample Hypothesis - Tipsprussin86
 
Driving The Platform 2
Driving The Platform 2Driving The Platform 2
Driving The Platform 2Aero Plane
 
One Sample Hypothesis - Tips
One Sample Hypothesis - TipsOne Sample Hypothesis - Tips
One Sample Hypothesis - Tipsprussin86
 
Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2Aero Plane
 

En vedette (10)

Cpu And Memory Events
Cpu And Memory EventsCpu And Memory Events
Cpu And Memory Events
 
One Sample Hypothesis Tips
One Sample Hypothesis   TipsOne Sample Hypothesis   Tips
One Sample Hypothesis Tips
 
Tri county tourism
Tri county tourismTri county tourism
Tri county tourism
 
One Sample Hypothesis Tips
One  Sample  Hypothesis    TipsOne  Sample  Hypothesis    Tips
One Sample Hypothesis Tips
 
One Sample Hypothesis - Tips
One Sample Hypothesis - TipsOne Sample Hypothesis - Tips
One Sample Hypothesis - Tips
 
One Sample Hypothesis - Tips
One Sample Hypothesis - TipsOne Sample Hypothesis - Tips
One Sample Hypothesis - Tips
 
Onewordinspiration
OnewordinspirationOnewordinspiration
Onewordinspiration
 
Driving The Platform 2
Driving The Platform 2Driving The Platform 2
Driving The Platform 2
 
One Sample Hypothesis - Tips
One Sample Hypothesis - TipsOne Sample Hypothesis - Tips
One Sample Hypothesis - Tips
 
Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2
 

Similaire à Advanced Diagnostics 2

Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisAnne Nicolas
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Jian-Hong Pan
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
Starting Raspberry Pi
Starting Raspberry PiStarting Raspberry Pi
Starting Raspberry PiLloydMoore
 
CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)
CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)
CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)PROIDEA
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototypingYan Vugenfirer
 
Raspberry Pi tutorial
Raspberry Pi tutorialRaspberry Pi tutorial
Raspberry Pi tutorial艾鍗科技
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsDobrica Pavlinušić
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapedlangley
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemCyber Security Alliance
 
Linux+sensor+device-tree+shell=IoT !
Linux+sensor+device-tree+shell=IoT !Linux+sensor+device-tree+shell=IoT !
Linux+sensor+device-tree+shell=IoT !Dobrica Pavlinušić
 
Getting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer KitGetting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer KitSulamita Garcia
 
Hardware Discovery Commands
Hardware Discovery CommandsHardware Discovery Commands
Hardware Discovery CommandsKevin OBrien
 
LAS16-403 - GDB Linux Kernel Awareness
LAS16-403 - GDB Linux Kernel Awareness LAS16-403 - GDB Linux Kernel Awareness
LAS16-403 - GDB Linux Kernel Awareness Peter Griffin
 
Когда предрелизный не только софт
Когда предрелизный не только софтКогда предрелизный не только софт
Когда предрелизный не только софтCEE-SEC(R)
 

Similaire à Advanced Diagnostics 2 (20)

Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysis
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
Starting Raspberry Pi
Starting Raspberry PiStarting Raspberry Pi
Starting Raspberry Pi
 
CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)
CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)
CONFidence 2017: Hacking embedded with OpenWrt (Vladimir Mitiouchev)
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
 
Raspberry Pi tutorial
Raspberry Pi tutorialRaspberry Pi tutorial
Raspberry Pi tutorial
 
XS Boston 2008 Debugging Xen
XS Boston 2008 Debugging XenXS Boston 2008 Debugging Xen
XS Boston 2008 Debugging Xen
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
 
Basic Linux kernel
Basic Linux kernelBasic Linux kernel
Basic Linux kernel
 
Packet Card Knowledge Transferfinal
Packet Card Knowledge TransferfinalPacket Card Knowledge Transferfinal
Packet Card Knowledge Transferfinal
 
SystemV vs systemd
SystemV vs systemdSystemV vs systemd
SystemV vs systemd
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
Linux+sensor+device-tree+shell=IoT !
Linux+sensor+device-tree+shell=IoT !Linux+sensor+device-tree+shell=IoT !
Linux+sensor+device-tree+shell=IoT !
 
Getting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer KitGetting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer Kit
 
Ganeti - build your own cloud
Ganeti - build your own cloudGaneti - build your own cloud
Ganeti - build your own cloud
 
Hardware Discovery Commands
Hardware Discovery CommandsHardware Discovery Commands
Hardware Discovery Commands
 
LAS16-403 - GDB Linux Kernel Awareness
LAS16-403 - GDB Linux Kernel Awareness LAS16-403 - GDB Linux Kernel Awareness
LAS16-403 - GDB Linux Kernel Awareness
 
Когда предрелизный не только софт
Когда предрелизный не только софтКогда предрелизный не только софт
Когда предрелизный не только софт
 

Advanced Diagnostics 2

  • 2. Diagnostics 2 ● SP Diags (stinger) ●Spdiag Tool (galaxy) ● SunVTS ● PC Check ● HDT (sundiag replacement) ● CSTH ● Herd & EDAC ● MCAT ● Bonnie ● Memtest ●Web Pages ● Decoder Tools
  • 3. Sun Confidential: Internal Only 3 Day 2 PM and Acknowledgements • I have borrowed/stolen/copied* the following in this presentation. • Newisys decoder from Barry Wright • HDT from Bernward Schwartz • SGR from http://panacea/twiki/bin/view/SGR/WebHome
  • 4. Sun Confidential: Internal Only 4 SP Diags for V20/40z • Not to be confused with “spdiag” Tool • Bootable CD (nsv 2.2.0.6 or above required) or SP based • Enable diagnostic boot in BIOS for bootable CD • NSV installed on remote system and mounted locally by NFS .
  • 5. Sun Confidential: Internal Only 5 SP Diags for V20/40z ● Install diags: cp -r /mnt/cdrom/nsv_file /mnt/nsv/ cd /mnt/nsv/ unzip -a *.zip chmod 777 /mnt/nsv/diags/NSV_version_number/scripts chmod -R 755 /mnt/nsv/diags/NSV_version_number/mppc Note:Now ensure nfs is enabled on server and can export file system sp add mount -r NFS_server_hostname:/directory_with_NSV_files -l /mnt sp update diags -p /mnt/diags/DIAGS_version#
  • 6. Sun Confidential: Internal Only 6 SP Diags for V20/40z ● diags start (for standalone) ● diags start -n (on-line nic,disk,mem) ● diags get state (confirm diags are loaded) ● diags get tests (list diagnostics tests) ● diags run tests -av ● diags run tests -av >/mnt/log/diags.log ● diags terminate Ensure diags and BIOS, drivers are compatible Diags will fail to run otherwise
  • 7. Sun Confidential: Internal Only 7 SP Diags for V20/40z • diags -h this will show all syntax • diags -a -v full test • Bootable CD > diags terminate -n > diags start -n > diags run tests -a -v >diags.out & > tail -f diags.out
  • 8. Sun Confidential: Internal Only 8 The “spdiag” Tool (Galaxy) • SP based diagnostic • Test i2c , voltage , fans , temp • Stop ipmi /etc/init.d/ipmistack stop • /usr/local/bin/spdiag 1 g4 i2ctst • Reboot SP
  • 9. Sun Confidential: Internal Only 9 PC Check • Supplemental/Tools CD and now boot menu • AMD based X2100,X2100M2,X2200M2 and all new X4x40 platforms • All Intel based platforms • Monitor and keyboard • Serial port • Scripts , burn-in tests , loopback
  • 10. Sun Confidential: Internal Only 10 PC Check • Front Menu: System Information menu Advanced Diagnostics Tests Immediate Burn-in Testing Deferred Burn-in Testing Create Diagnostic Partition Show Results Summary Print Results Report
  • 11. Sun Confidential: Internal Only 11 PC Check • Burn-in Testing: > quick.tst - requires user input, no time-out > noinput.tst – no user input, good first test > full.tst – requires loopback & user input • Command Line: > Example pccheck cpu.tst /BD > pccheck /? - shows all flags > pccheck suncsi.tst /IS /BD /KS /MH 30 /HMD 1m /HDD 1m /SD 5m
  • 12. Sun Confidential: Internal Only 12 SUNvts • What are you trying to test/replicate ? • Local or bootable CD-ROM • Galaxy 2.2 cd contains vts6.3 • GUI or command line • Unsupported platforms: > /opt/SUNWvts/lib/conf/platform.conf > smbios | grep Product > Boot with graphics head > Edit tty boot console=ttya,ttya-mode=”9600,8,n,1,-”
  • 13. Sun Confidential: Internal Only 13 SUNvts
  • 14. Sun Confidential: Internal Only 14 HDT (Hardware Debug Tool) • PLEASE USE WITH CAUTION !!! • Will hang the host if OS running • Reboot SP after use http://panacea/twiki/bin/view/Products/Galaxydiag • Additional tools: • /usr/local/bin/collectHostStatus.sh -nohdtl disables hdt test • /usr/local/bin/collectDebugInfo.sh
  • 15. Sun Confidential: Internal Only 15 Platform Specifics • On G4: > HDT uses some signals over the i2c bus => IPMI on the SP has to be shut down. SP should be rebooted when done with hdt diags. > JTAG chain goes through all CPU modules => All slots must have CPU or filler module inserted for HDT to work on G4 > Direct access to all CPU's, default is cpu 0 • Other Platforms: > Only CPU0 in JTAG chain > no i2c involved, only used for platform identification From Bernward Schwarte presentation.
  • 16. Sun Confidential: Internal Only 16 Getting Started & Cautions • hdt or hdtl? > Current hdt binary and some documentation at: http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/SpBasedHdtDiag under Galaxy->Pre- OS-Diagnostics > Copy to SP: scp hdt sunservice@<SPIP>:/coredump > ssh sunservice@<SPIP> password: changeme > cd /coredump (or check /usr/local/bin for the built in copy) • Caution: All hdt commands stop CPU's. Some hdt commands will reset/power-cycle system. • All command line parameters are interpreted as hex values ! • ./hdt prints syntax of all available commands • ./hdt –pd 0 18 0 • hdt leaves CPU in HDT-mode when exiting, use “-e” option to exit HDT- mode
  • 17. Sun Confidential: Internal Only 17 Available Commands/Diagnostics • Basics: > Single HDT command: -h * Note: this is not -help > Access io- and memory space: -mr, -mw, -ir, -iw > Access CPU registers: -rd, -rr, -rw > Single step: -hs • Control: > Reset : -xr [b c] > Stop at reset: -xs [b c p] > resource init: “-hi” : sets up HT routing and resources > Power On/Off : -o [0 off, 1 cycle, 2 on] > set CPU: -c G4 only > Breakpoints: -bps -bpm -bpc > exit: -e
  • 18. Sun Confidential: Internal Only 18 Diagnostics • Extended: > PCI configuration space access -pr, -pw -pd, -ps > “Dump” commands – Machine check: -dm – DIMM SPD: -dd – CMOS: -dc – SIO: -ds – Flash: -df > HT link testing: -a – Powercycles, stops at reset vector, sets all HT links, warm resets
  • 19. Sun Confidential: Internal Only 19 HDT Not Working > Depending on System state HDT can be non-functional > To capture some system/error state: – Reset system and stop at reset vector: hdt -xs b – Init HT routing and PCI bridge enumeration hdt -hi – Dump Machine check and HT link status: hdt -dm -dl hdtDiag: Galaxy/Thumper HDT Diagnostics, Version 0.7.0 ------------------------------------------------------- hdtDiag: Error, HDT command failed, no CFF cpu 0 hdtDiag: SysIdent: HDT access failing hdtDiag: defaulting to G12X
  • 20. Sun Confidential: Internal Only 20 HDT • Check Versions 0.8.0 , 0.8.3 , 0.9.9, 1.3, 1.4.1 etc • ./hdt -xs • ./hdt -hi • ./hdt -l -q or try ./hdt -l -a • ./hdt -e • Reboot SP
  • 21. Sun Confidential: Internal Only 21 CSTH (Continuous System Telemetry Harness) • Calls ipmitool to create a telemetry stream of: > volt,temp,current,fans and PSU variables ● Collect data and submit for analysis to engineering: ● ./start-csth-ipmi <spname> <splogin> <sppasswd> [--interval <numsecs>] ● Example: ➢ ./start-csth-ipmi test-sp admin test.pass 60 & ➢ ./stop-csth-ipmi test-sp
  • 22. Sun Confidential: Internal Only 22 CSTH (Example From an x4200)
  • 23. Sun Confidential: Internal Only 23 HERD (Hardware Error Report Decode) http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD •Hardware error report and decoding from mcelog or via the command line with kernel 2.6.4 or above •Installed as RPM on top of SLES and Red Hat •Be provide by Sun Microsystems •Will report errors to messages file and service processor (if applicable) •Same command line options as mcelog •Must be run on the same host as the machine that reported the errors when using the herd -e function.
  • 24. Sun Confidential: Internal Only 24 HERD (Hardware Error Report Decode) •Example from console / logs: •Example of running herd manually (pre herd install): Mar 5 18:03:01 va64-x2200c-gmp03 herd: HARDWARE ERROR. This is *NOT* a software problem! Mar 5 18:03:01 va64-x2200c-gmp03 herd: Please contact your hardware vendor Mar 5 18:03:01 va64-x2200c-gmp03 herd: CPU 0 4 northbridge Mar 5 18:03:01 va64-x2200c-gmp03 herd: TSC fcc73b11cf Mar 5 18:03:01 va64-x2200c-gmp03 herd: ADDR 142110 Mar 5 18:03:01 va64-x2200c-gmp03 herd: Northbridge Chipkill ECC error Mar 5 18:03:01 va64-x2200c-gmp03 herd: Chipkill ECC syndrome = 11ea Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit46 = corrected ecc error Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit57 = processor context corrupt Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit61 = error uncorrected Mar 5 18:03:01 va64-x2200c-gmp03 herd: bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' Mar 5 18:03:01 va64-x2200c-gmp03 herd: STATUS b675410011080a13 MCGSTATUS 0 # herd -e 142110 000000142110: Cpu Node 0, DIMM 2
  • 25. Sun Confidential: Internal Only 25 EDAC (Kernel 2.6.20.xx and above) 2 examples of edac not working & working (x2200): Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Multiple CE in quick succession or DIMM layout Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: Failed to translate InputAddr to csrow for address 0xbb2c2fc0 Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac Error Overflow set Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error ^^ Failed to translate due to overflow bit set This happens if more than one error has occurred before edac gets to it or if edac does not understand the DIMM layout. Here is the correct format of edac's output: Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: general bus error: partic ipating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Always CPU 0 (reporting error) Mar 4 10:43:42 va64-x2200c kernel: MC0: CE page 0x100010, offset 0x10, grain 8, syndrome 0xa1e8, row 0, channel 0, label "": k8_edac ^^ This event tells you the actual offending CPU which in this instance is CPU 0. (label not used by default but Sun may/customer may populate) Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC error <<Decode below: MC0: CE error page 0x100010 adding offset of 0x10 = Address (0x10100010) Grain = 8 which is Chipkill Row 0, Channel 0 = CPU0,DIMM0 Channel 0Channel 1 Channel 0Channel 1 =================================== =================================== Row> csrow0 | DIMM_A0| DIMM_B0 | csrow2 | DIMM_A1| DIMM_B1 | csrow1 | DIMM_A0| DIMM_B0 | csrow3 | DIMM_A1| DIMM_B1 | =================================== =================================== If single rank DIMMs (1GB or less) then csrow1 and csrow3 are not used/available.
  • 26. Sun Confidential: Internal Only 26 EDAC Continued: •Example output from the SP (not created by edac): 1 | 02/23/2008 | 02:13:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted >>> Edac log should be here but does not show - Instead, you just see the BIOS scrubber results <<< 2 | 02/25/2008 | 16:27:55 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 3 | 02/25/2008 | 17:27:58 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 4 | 02/25/2008 | 18:28:00 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 5 | 02/25/2008 | 19:28:02 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 6 | 02/25/2008 | 20:28:04 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 7 | 02/25/2008 | 21:28:06 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 8 | 02/25/2008 | 22:28:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 9 | 02/25/2008 | 23:28:10 | Memory CPU0 DIMM0 | Correctable ECC | Asserted ... and so on ... Do a cat of /proc/mc/0 to give you an understanding of the events occurred in a row/column summary It's edac or herd, not both!!! They both try to grab /dev/mce events and report. (rmmod k8_edac to remove) And remember, the SEL log is your friend so always get an ipmi dump first before escalating or decoding.
  • 27. Sun Confidential: Internal Only 27 The “mcelog” • Linux kernels after 2.6.4 do not print recoverable machine check errors • Messages are saved in /var/log/mcelog • Mcelog read errors from /dev/mcelog and then deletes entries • Typically run as a cron jog: > /usr/sbin/mcelg >> /var/log/mce > *Note this is not collected by sysreport • RedHat implemented as a daemon • See RedHat advisory RHEA-2006-0134-7
  • 28. Sun Confidential: Internal Only 28 MCAT (Machine Check Analysis Tool) Event Source 62 - WMIxWDM Processor Number : 0 Bank Number : 4 Time Stamp (0x): 01C856C4 58A8C10D Error Status (0x): D4714000 E1080A13 Error Address (0x): 00000000 A047BF50 Error Misc. (0x): 00000000 00000000 Single bit errors: Correctable ECC error Error address valid in MCi_ADDR Error reporting enabled Second error Error valid Cont: >> Bus Error Code: Participation processor: Local node responded to the request (RES) Time-out: Request did not time out Memory transaction type: Generic read (RD) I/O: DRAM memory access (MEM) Cache level: Generic (LG) North Bridge Error MC4: Extended Error Code: 0x8 - ChipKill ECC Error Error Code: 0x0A13 DRAM memory access (MEM) Generic read (RD), on Generic (LG) cache ChipKill Syndrome: 0xE1E2 Error address at 2564 MB Takes input from a Windows Event Log entry and decodes the output:
  • 29. Sun Confidential: Internal Only 29 MCAT Continued • This can be gathered by running ipmitool fru: FRU Device Description : p0.fru (ID 6) Product Manufacturer : ADVANCED MICRO DEVICES Product Name : DUAL CORE AMD OPTERON(TM) 275 Product Part Number : 0F21 Product Version : 02 FRU Device Description : p0.d0.fru (ID 8) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D7010058 Continued: >>> FRU Device Description : p0.d1.fru (ID 9) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D7010056 FRU Device Description : p0.d2.fru (ID 10) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D701A6F4 FRU Device Description : p0.d3.fru (ID 11) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D701A6EE FRU Output for this failing platform:
  • 30. Sun Confidential: Internal Only 30 Manual Diagnosis Processor Number :0 - CPU 0 (If said 4 then it would be socket CPU4, not core 4). Error address at 2564 MB (i.e. between 2 and 3 GBytes). From the FRU information, each DIMM is 1 Gbyte. The DIMMs are numbered for closest to CPU outwards based on mapping. (DIMMs should be populated from outside inward but are mapped closest to CPU outwards). The BIOS sets up memory from DIMM0/1 outwards. Assuming "optimal defaults": Our Opterons use a 128-bit wide data path. DIMM0 and DIMM1 are used in a pair. These are single-rank DIMMs but they are all the same so is "chipselect interleaving". The first 128KB are on DIMM0 and 1. The second 128KB are on DIMM2 and 3. 2564/128 = 20.03 ----> which is in DIMM0 and DIMM1 pair. (Always replace Opteron platform DIMMs in pairs). Windows reporting decode is performed as follows:
  • 31. Sun Confidential: Internal Only 31 Manual Diagnosis ChipKill Syndrome: 0xE1E2 Looking this up in the table 26 of the AMD BIOS And Kernel Writer's Guide shows this is symbol 0x1a which according to the text above 26, this symbol maps to the upper 64-bits of the 128-bit data path. DIMM0 from 00h-0fh provides the low 64-bits, DIMM1 from 10h-1fh provides the high 64-bits. The check bits for the lower 64-bits is 20h-21h and the check bits for the upper 64-bits is 22h-23h Technical documentation including the AMD BIOS and Kernel Writers Guide is available from AMD via: http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html Remember though to download the correct document for your processor revision: SingleDual core Opteron for x2100, x2200, x4100, x4200, x4500, x4600 etc is document family 0fh. Quad code Opteron for supported platforms is document family 10h. Manual diagnosis continued:
  • 32. Sun Confidential: Internal Only 32 Manual Diagnosis Chipkill Syndrome Table for 0Fh CPUs 0-63 data bits Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh 00h e821 7c32 9413 bb44 5365 c776 2f57 dd88 35a9 a1ba 499b 66cc 8eed 1afe f2df 01h 5d31 a612 fb23 9584 c8b5 3396 6ea7 eac8 b7f9 4cda 11eb 7f4c 227d d95e 846f 02h 0001 0002 0003 0004 0005 0006 0007 0008 0009 000a 000b 000c 000d 000e 000f 03h 2021 3032 1013 4044 6065 7076 5057 8088 a0a9 b0ba 909b c0cc e0ed f0fe d0df 04h 5041 a082 f0c3 9054 c015 30d6 6097 e0a8 b0e9 402a 106b 70fc 20bd d07e 803f 05h be21 d732 6913 2144 9f65 f676 4857 3288 8ca9 e5ba 5b9b 13cc aded c4fe 7adf 06h 4951 8ea2 c7f3 5394 1ac5 dd36 9467 a1e8 e8b9 2f4a 661b f27c bb2d 7cde 358f 07h 74e1 9872 ec93 d6b4 a255 4ec6 3a27 6bd8 1f39 f3aa 874b bd6c c98d 251e 51ff 08h 15c1 2a42 3f83 cef4 db35 e4b6 f177 4758 5299 6d1a 78db 89ac 9c6d a3ee b62f 09h 3d01 1602 2b03 8504 b805 9306 ae07 ca08 f709 dc0a e10b 4f0c 720d 590e 640f 0ah 9801 ec02 7403 6b04 f305 8706 1f07 bd08 2509 510a c90b d60c 4e0d 3a0e a20f 0bh d131 6212 b323 3884 e9b5 5a96 8ba7 1cc8 cdf9 7eda afeb 244c f57d 465e 976f 0ch e1d1 7262 93b3 b834 59e5 ca56 2b87 dc18 3dc9 ae7a 4fab 642c 85fd 164e f79f 0dh 6051 b0a2 d0f3 1094 70c5 a036 c067 20e8 40b9 904a f01b 307c 502d 80de e08f 0eh a4c1 f842 5c83 e6f4 4235 1eb6 ba77 7b58 df99 831a 27db 9dac 396d 65ee c12f 0fh 11c1 2242 3383 c8f4 d935 eab6 fb77 4c58 5d99 6e1a 7fdb 84ac 956d a6ee b72f
  • 33. Sun Confidential: Internal Only 33 Manual Diagnosis Chipkill Syndrome Table for 0Fh CPUs 64-128 data bits Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh 10h 45d1 8a62 cfb3 5e34 1be5 d456 9187 a718 e2c9 2d7a 68ab f92c bcfd 734e 369f 11h 63e1 b172 d293 14b4 7755 a5c6 c627 28d8 4b39 99aa fa4b 3c6c 5f8d 8d1e eeff 12h b741 d982 6ec3 2254 9515 fbd6 4c97 33a8 84e9 ea2a 5d6b 11fc a6bd c87e 7f3f 13h dd41 6682 bbc3 3554 e815 53d6 8e97 1aa8 c7e9 7c2a a16b 2ffc f2bd 497e 943f 14h 2bd1 3d62 16b3 4f34 64e5 7256 5987 8518 aec9 b87a 93ab ca2c e1fd f74e dc9f 15h 83c1 c142 4283 a4f4 2735 65b6 e677 f858 7b99 391a badb 5cac df6d 9dee 1e2f 16h 8fd1 c562 4ab3 a934 26e5 6c56 e387 fe18 71c9 3b7a b4ab 572c d8fd 924e 1d9f 17h 4791 89e2 ce73 5264 15f5 db86 9c17 a3b8 e429 2a5a 6dcb f1dc b64d 783e 3faf 18h 5781 a9c2 fe43 92a4 c525 3b66 6ce7 e3f8 b479 4a3a 1dbb 715c 26dd d89e 8f1f 19h bf41 d582 6ac3 2954 9615 fcd6 4397 3ea8 81e9 eb2a 546b 17fc a8bd c27e 7d3f 1ah 9391 e1e2 7273 6464 f7f5 8586 1617 b8b8 2b29 595a cacb dcdc 4f4d 3d3e aeaf 1bh cce1 4472 8893 fdb4 3155 b9c6 7527 56d8 9a39 12aa de4b ab6c 678d ef1e 23ff 1ch a761 f9b2 5ed3 e214 4575 1ba6 bcc7 7328 d449 8a9a 2dfb 913c 365d 688e cfef 1dh ff61 55b2 aad3 7914 8675 2ca6 d3c7 9e28 6149 cb9a 34fb e73c 185d b28e 4def 1eh 5451 a8a2 fcf3 9694 c2c5 3e36 6a67 ebe8 bfb9 434a 171b 7d7c 292d d5de 818f 1fh 6fc1 b542 da83 19f4 7635 acb6 c377 2e58 4199 9b1a f4db 37ac 586d 82ee ed2f
  • 34. Sun Confidential: Internal Only 34 Manual Diagnosis ECC Syndrome Table (for completion) for 0Fh CPUs (Single Error Correction, Double Error Detection): n=0 n=1 n=2 n=3 n=4 n=5 n=6 n=7 Bit (0+n) ce cb d3 d5 d6 d9 da dc Bit (8+n) 23 25 26 29 2a 2c 31 34 Bit (16+n) 0e 0b 13 15 16 19 1a 1c Bit (24+n) e3 e5 e6 e9 ea ec f1 f4 Bit (32+n) 4f 4a 52 54 57 58 5b 5d Bit (40+n) a2 a4 a7 a8 ab ad b0 b5 Bit (48+n) 8f 8a 92 94 97 98 9b 9d Bit (56+n) 62 64 67 68 6b 6d 70 75 Bit (64+n) 01 02 04 08 10 20 40 80 *Typically used for single DIMM configurations
  • 35. Sun Confidential: Internal Only 35 Other Tools/Diags (Un-supported) • Bonnie > Benchmark to measure performance of filesystem http://www.textuality.com/bonnie/ • Memtest86+ > Standalone bootable diagnostic http://www.memtest.org/ http://www.memtest86.com/ Original version • Other memory tool http://people.redhat.com/dledford/memtest.html http://sourceforge.net/ • Netperf or ttcp - google for them - network tools
  • 36. Sun Confidential: Internal Only 36 SGR • Situation appraisal – Recognise a problem • Problem Analysis - Find True Cause http://systems-tsc/twiki/pub/SGR/SgrtOnlineHelp/PA-guide.pdf The Steps in FTC are: * Define a Problem Statement * Describe the problem with a Problem Specification * Develop Possible Causes from either Experience or Differences and Changes * Identify the Most Probable Cause * Test the Most Probable Cause against the Problem Specification * Verify the Most Probable Cause
  • 37. Sun Confidential: Internal Only 37 Newisys MCE Decoder v20/40z What to gather from inventory get all -v 1. How many CPU's? 2. How many Dimms per CPU? 3. What is the part number of the Dimm? NOTE:This is for V20/40z ONLY and only works on Northbridge Errors
  • 38. Sun Confidential: Internal Only 38 Details from CPU0 explained ● • Here you see 4 identical Dimms on CPU0. • The Dimm Manufacture part # is: 36VDDF25672G-40BD2 ● ● ● ● Name Type OEM Manufacture Date Hardware Revision Part # ●CPU 0 DIMM 0 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2 ● CPU 0 DIMM 1 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2 ●CPU 0 DIMM 2 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2 ●CPU 0 DIMM 3 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2 ●DDR 0 VRM memvrm S-SCI448 2005-05-27 A01 S01479 • CPU 0 VRM vrm NA
  • 39. Sun Confidential: Internal Only 39 Determine Type & Rank of Dimm ** Dimms can be single rank or dual rank. For a description of the differences see: http://pts-platform.uk/twiki/bin/view/Products/ProdFAQv2040z or http://en.wikipedia.org/wiki/DIMM#Ranking Browse to the Qualified Memory page: http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Stinger/StingerQualifiedMemory Compare your DIMM Manufacture part number to the list: 36VDDF25672G-40BD2 This equates to a 2GB Micron Dual Rank DIMM: Micron: 512MB: MT18VDDF6472G-40BG3 Die: G Single Rank SPD 1.0 1GB: MT18VDDF12872G-40BD3 Die: D Single Rank SPD 1.0 2GB: MT36VDDF25672G-40BD2 Die: D Dual Rank SPD 1.0 Now we are ready to populate the Memory Decode Tool
  • 40. Sun Confidential: Internal Only 40 Warning! Decode Tool is Sun Internal https://supportcenter.newisys.com/edbug/edbug_int.pl?auth=dfqsdftqw11p4jdvhasovygm82cbcfrk This link cannot be shared with customers. It is internal for Sun use only. The link has the account and password in it.
  • 41. Sun Confidential: Internal Only 41 Populate the Decode Tool
  • 42. Sun Confidential: Internal Only 42 Information for Decode Tool Enter the CPU that has the machine check: (From the Error) 0, 1, 2, or 3 Enter the platform type: 2100 = V20z 4300 = V40z Enter the machine check status: (From the Error) Enter the machine check address: (From the Error) Specify which CPUs have DIMMs: (From inventory ger all -v) Specify which DIMMs are populated on each CPU: (From Inventory get all -v) Specify the DIMM type: (Rank from Qualified Memory Page) BIOS defaults: Leave this at the default (Place a √ in DIMM interleaving, 128 bit DIMM interface, and Chipkill ECC enabled. No √ in Node interleaving)
  • 43. Sun Confidential: Internal Only 43 Result Output Only one error is present Error details: K8_CPU-0 is reporting this corrected error: DRAM chipkill ECC error found by scrubber The DRAM error was at address '00000000 9C6A0B30' (2 GB range) This error is related to DIMM 1 on K8_CPU-0 The ECC syndrome ('5E34'x) maps to a correctable error at data bit 66 Within the DIMM, this would be an error at physical bit 2 Processor was responding to another source of the transaction Transaction was a read Error classification: Error type: DRAM ECC Error severity: Corrected Error enabled: yes Error recovered: yes Possible sympathy: no Error address: '000000009C6A0B30'x Address type: Physical
  • 44. Sun Confidential: Internal Only 44 Anything Else........ • Newisys Machine Check (northbridge only) • V20/40z only • http://systems-tsc/twiki/pub/Products/ProdTroubleshootingV20z/V20z-V40z-Memory-DIMM • Windows Debugging > http://www.microsoft.com/whdc/devtools/debugging/default.mspx • MCAT • http://www.amd.com/us- en/assets/content_type/utilities/mcatsetup.exe Machine Check Analysis Tool (MCAT) is a command line utility that takes Windows System Event Log (.evt) file as an argument and decodes the MCA Error logs into human readable format. MCAT can alternatively take in MCE Error information as raw register hexadecimal values as command line argument as well.