HKG18-116 - RAS Solutions for Arm64 Servers

Linaro
LinaroLinaro
ARM64 Server RAS Solutions
Jonathan (Zhixiong) Zhang
Cavium Inc.
Agenda
● Overview
● Solutions
● Building blocks
● Reflections
Overview
Reliability, Availability, Serviceability
RAS is one of the most important aspects of ARM64 servers’ success
Progresses
● RAS on ARM64
● SFO17
● Fu Wei
● Focus on OS and
ACPI
Progresses
● RAS on AARCH64
● SFO17
● Fu Wei, Supreeth
Venkatesh
● Focus on prototype
solution
Discussion Focus
● RAS is a 3
dimensional
problem!
● 2 Dimensions: from
CPU, to board, to
rack
● 1 dimension: from
hardware, to
firmware to
software
● Product ready
solutions
Who needs what – end user, apps
developer● Accurate error reporting
● Right level error reporting
● Low cycle stealing
● Quarantined uncorrectable error
Who needs what – data centre admin
● Accurate error reporting
● Reliable error reporting
● Uniformed error handling policy
● Actionable error reporting
Who needs what – Platform Vendor
Validation through production software stack
● Example: Memory stability has to be validated through running all
hardware threads, exercising all DIMMs through OS memory test tool.
● Example: PCIe tuning issue can be spotted through thousands of
reboots.
Who needs what – SOC Vendor
Validation through production software stack
● Validation/diag tool does not exercise the SOC enough.
● Subsystem design/programming issue needs to be unmasked earlier
in the engineering cycle.
Cost-effective support process
● Expedited problem root cause.
● Reduced reliant on hardware debugger.
Solutions
Early boot time error notification
● Post code, to diagnose halted boot.
● SEL record, to diagnose successful boot with component failure
Run time error notification
Catastrophic Error
1. GPIO interrupt. Out Of Band.
2. SEL record. Out Of Band.
3. BERT record. In Band
4. Crash dump. Out Of Band and
In Band; Catastrophic error
only.
Non-catastrophic Error
1. GPIO interrupt. Out Of Band.
2. SEL record. Out Of Band.
3. HEST record. In Band
4. GHESv2 notification. In Band
a. Polled
b. Interrupt
c. SDEI
d. SEA
e. SEI
DRAM error handling
Boot time issue
● Reasons:
● DIMM training failure.
● SPD access failure.
● Reponses:
○ Halt the boot.
○ Continue booting with SEL record
sent.
Run time issue
● Correctable error
● Uncorrectable error in non-
secure memory
● Recoverable.
● Unrecovable, in non OS critical
region
● Unrecovable, in OS critical region.
● Uncorrectable error in secure
memory
DRAM error reporting
Error location types concerned:
a) Apps developer: Physical address
b) Data center admin: FRU (Field Replaceable Unit)
c) Platform vendor: DRAM address
Channel interleaving complicates address translation:
DRAM address
Transaction
address
Physical address
Memory scrubbing
DRAM scrubbing prevents future error from happening.
a. Firmware initiated on-demand scrubbing, upon memory error.
b. Firmware initiated interval based scrubbing, as defined by user.
c. OS initiated on-demand scrubbing.
1. Platform exposes scrubbing support capability to OS through ACPI RASF table.
2. OS setup, start, stop scrubbing through RASF PCC channel.
PCIe error handling -- workflow
Firmware First Handling being asked.
1. PCIe device (such as end point device) detects error.
2. The error is reported up through AER (Advanced Error Reporting)
mechanism.
3. PCIe Root Complex on the SoC issues error interrupt.
4. Firmware secure interrupt handler analyzes and reports the error.
PCIe error handling -- ACPI
a) OSC: OS conveys APEI support to platform.
b) OSC: When OS requests control over PCIe AER, platform does not give
the control.
c) APEI error source: GHESv2 error source, no PCIe RP AER structure, no
PCIe device AER structure.
d) CPER: PCIe error section.
PCIe error handling – SEL sample
a) Multiple SEL records generated per error.
b) Some are generated from error producer, others from error
consumer.
c) SEL records hold info such as error type (bus error, PERR, SERR,
etc.), BDF numbers, vendor/device IDs, error ID (bad TLP, bad DLLP,
etc.).
Power/Thermal management
Sensors
a. Platform sensors
b. DIMM sensors
c. On-chip sensors
Defenses
d. DIMM temperature based memory throttling
e. Other sensor temperature based Close Loop Thermal Throttling
f. Last defense: Software and hardware based thermal trip
Power optimizations
g. Autonomous DVFS
h. Turbo mode and CPPC
Catastrophic Error -- Detection
a) Hardware error interrupt handler in Trusted Firmware – A (AKA. ARM
Trusted Firmware).
b) Secure watchdog bite in on-chip management controller
a) Watchdog signal 0 routed as SPI to GIC.
b) Watchdog signal 1 routed to the platform, such as on-chip management controller
c) BMC (Board Management Controller) detected heart beat stop from
on-chip management controller
d) OS initiated through ACPI PDTT mechanism
Catastrophic Error – crash dump
a) Contains data needed for failure analysis.
b) Vendor proprietary tool provided to analyze the failure.
c) Modularized to provide fault tolerance.
d) Saved in both non-volatile storage and BMC.
Building Blocks
Early boot time considerations
Report issues happened before interrupt can be enabled:
a. Check error status register at certain check points.
b. Check error status register right before interrupt is enabled.
c. Enable hardware error interrupt as soon as possible.
Prevent false positives:
d. Filter out noises happened during training/initialization process.
e. Clear status registers before enabling hardware error interrupt.
SEL record
I2C bus access conflict resolution
a. Dedicated I2C connection between Trusted Firmware – A and BMC.
b. Dedicated I2C connection between non-secure world (UEFI/OS) and
BMC.
BMC to handle simultaneous SEL record access
ARM TF
UEFI/OS
BMC
I2C
I2C
Configuration
Early boot loader configuration
a. Some customers want boot to be halted upon single DIMM training
failure.
b. Other customers want boot to continue unless all DIMMs failed.
ARM TF configuration (For example, memory CE threshold needs)
c. Length of sliding window.
d. The threshold number.
Enablement of display/update of configurations in OS
e. Stored in a partition in BootRom, separated from UEFI variable
partition.
f. Exposed through UEFI variable run time service.
g. Protected from firmware update.
h. Able to be set to factory default.
Error injection
Memory Error
● Inject into a specific physical
memory
● Inject into a specific DIMM
location
PCIe Error
● Inject through Root Port
● Inject through End Point device
● Inject through PCIe analyzer
Reflections
Hardware design
Reduce cycle stealing
a. Route hardware error interrupt to on-chip management controller
b. Manage Correctable Error threshold in hardware
Standardized design
c. Support RAS extension
Firmware RAS Features
● Exception Handling Framework in Trusted Firmware – A.
○ A framework for triaging RAS errors in EL3 prior to handling or delegation to lower EL
○ Up-stream support available for triaging RAS error interrupts and delegation through
SDEI
○ Support for v8.2 RAS extension under review
■ Error synchronization barrier
■ Standard Error records
○ Support for v8.4 RAS extensions planned next
● Software Delegated Exception Interface dispatcher in Trusted Firmware
– A.
○ A mechanism to deliver high priority non-maskable events to Normal world
○ Hooks into EHF as a way of delegating RAS errors to Normal world
○ Up-stream support available for physical SDEI dispatcher
● RAS error handling in a Secure partition
○ Support for CPER creation in a EDK2 Standalone MM partition in S-EL0 under
development
○ Enables isolation of platform specific code in a EL3 managed sandbox
● DRAM error handling prototyped on SGI-575 using above components
Kernel RAS Features
● Memory failure handling merged
● All Arm specific APEI notifications (SEA, SEI, SDEI) work
● Patches to make APEI support aware of multiple NMI sources under
review
● Support for SDEI Client driver merged
● Support in APEI for registering SDEI notification handler under review
● Support for IESB in v8.2 RAS Extension merged
○ SErrors normally unmasked when in kernel
● Errors in KVM-Guest memory forwarded to Qemu/Kvmtool
● Support for notifying KVM-Guest under development
User space daemon
BMC
ARM Server Base Manageability Guide
Firmware First vs. Kernel First
● In general, firmware first is asked by server industry.
● Kernel first is desired for some products.
● ACPI table available for review that describes the RAS error node
topology to the kernel.
● https://connect.arm.com/dropzone/systemarch/DEN0061A-RAS_ACPI-1.0
alp1.pdf
Challenges
New technologies
a. NVDIMM: ACPI NVDIMM Sub Team working on a number of proposals.
b. New DIMM technologies: DDR5.
c. New bus technologies: PCIe 5.0; CCIX; Gen-Z.
SoC complexities
d. SoC instead of chip set.
e. SoC interconnect.
f. Large number of cores, big size of caches, fast speed buses
g. On-Chip controllers (PCIe/SATA/USB, etc.)
Acknowledgement
● Achin Gupta (ARM)
● Charles Garcia-Tobin (ARM)
Thank You
#HKG18
HKG18 keynotes and videos on: connect.linaro.org
For further information: jon.zhixiong.zhang@gmail.comwww.linaro.org
1 sur 38

Recommandé

BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 par
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64
BUD17-209: Reliability, Availability, and Serviceability (RAS) on ARM64 Linaro
2.4K vues29 diapositives
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118 par
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Wei Fu
497 vues44 diapositives
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203 par
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
1.5K vues33 diapositives
AAME ARM Techcon2013 002v02 Advanced Features par
AAME ARM Techcon2013 002v02 Advanced FeaturesAAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAnh Dung NGUYEN
1.3K vues62 diapositives
Reliability, Availability and Serviceability on Linux par
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
2.2K vues18 diapositives
AAME ARM Techcon2013 004v02 Debug and Optimization par
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAnh Dung NGUYEN
2.1K vues59 diapositives

Contenu connexe

Tendances

AAME ARM Techcon2013 005v02 System Startup par
AAME ARM Techcon2013 005v02 System StartupAAME ARM Techcon2013 005v02 System Startup
AAME ARM Techcon2013 005v02 System StartupAnh Dung NGUYEN
961 vues33 diapositives
ARM AAE - Memory Systems par
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory SystemsAnh Dung NGUYEN
4.3K vues50 diapositives
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture par
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureRaahul Raghavan
1.3K vues16 diapositives
AAME ARM Techcon2013 006v02 Implementation Diversity par
AAME ARM Techcon2013 006v02 Implementation DiversityAAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation DiversityAnh Dung NGUYEN
1.8K vues26 diapositives
ARM AAE - Developing Code for ARM par
ARM AAE - Developing Code for ARMARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARMAnh Dung NGUYEN
1.5K vues53 diapositives
Interrupt handling par
Interrupt handlingInterrupt handling
Interrupt handlingmaverick2203
7.3K vues17 diapositives

Tendances(20)

AAME ARM Techcon2013 005v02 System Startup par Anh Dung NGUYEN
AAME ARM Techcon2013 005v02 System StartupAAME ARM Techcon2013 005v02 System Startup
AAME ARM Techcon2013 005v02 System Startup
Anh Dung NGUYEN961 vues
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture par Raahul Raghavan
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
Raahul Raghavan1.3K vues
AAME ARM Techcon2013 006v02 Implementation Diversity par Anh Dung NGUYEN
AAME ARM Techcon2013 006v02 Implementation DiversityAAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation Diversity
Anh Dung NGUYEN1.8K vues
AAME ARM Techcon2013 001v02 Architecture and Programmer's model par Anh Dung NGUYEN
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
Anh Dung NGUYEN2.8K vues
ARM® Cortex M Boot & CMSIS Part 1-3 par Raahul Raghavan
ARM® Cortex M Boot & CMSIS Part 1-3ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3
Raahul Raghavan2.4K vues
ARM® Cortex™ M Bootup_CMSIS_Part_2_3 par Raahul Raghavan
ARM® Cortex™ M Bootup_CMSIS_Part_2_3ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
Raahul Raghavan2.4K vues
ARM Processor architecture par rajkciitr
ARM Processor  architectureARM Processor  architecture
ARM Processor architecture
rajkciitr394 vues
Introduction to Avr Microcontrollers par Mohamed Tarek
Introduction to Avr MicrocontrollersIntroduction to Avr Microcontrollers
Introduction to Avr Microcontrollers
Mohamed Tarek13.4K vues
Basics Of Embedded Systems par arlabstech
Basics Of Embedded SystemsBasics Of Embedded Systems
Basics Of Embedded Systems
arlabstech1.8K vues
ARM® Cortex™ M Energy Optimization - Using Instruction Cache par Raahul Raghavan
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
Raahul Raghavan757 vues
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer par NETWAYS
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
NETWAYS643 vues

Similaire à HKG18-116 - RAS Solutions for Arm64 Servers

Spike yuan server ras and uefi cper final par
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper finalparth bera
37 vues26 diapositives
UNIT-III ES.ppt par
UNIT-III ES.pptUNIT-III ES.ppt
UNIT-III ES.pptDustinGraham19
8 vues44 diapositives
Understanding and Improving Device Access Complexity par
Understanding and Improving Device Access ComplexityUnderstanding and Improving Device Access Complexity
Understanding and Improving Device Access Complexityasimkadav
539 vues61 diapositives
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf par
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfPaul Yang
42 vues31 diapositives
Hardware Management Module par
Hardware Management ModuleHardware Management Module
Hardware Management ModuleAero Plane
867 vues56 diapositives
Introduction to Embedded Systems par
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded Systemsمحمد عبد الحى
1.3K vues40 diapositives

Similaire à HKG18-116 - RAS Solutions for Arm64 Servers(20)

Spike yuan server ras and uefi cper final par parth bera
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper final
parth bera37 vues
Understanding and Improving Device Access Complexity par asimkadav
Understanding and Improving Device Access ComplexityUnderstanding and Improving Device Access Complexity
Understanding and Improving Device Access Complexity
asimkadav539 vues
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf par Paul Yang
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Paul Yang42 vues
Hardware Management Module par Aero Plane
Hardware Management ModuleHardware Management Module
Hardware Management Module
Aero Plane867 vues
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen... par The Linux Foundation
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
Current and Future of Non-Volatile Memory on Linux par mountpoint.io
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linux
mountpoint.io1K vues
Las16 200 - firmware summit - ras what is it- why do we need it par Linaro
Las16 200 - firmware summit - ras what is it- why do we need itLas16 200 - firmware summit - ras what is it- why do we need it
Las16 200 - firmware summit - ras what is it- why do we need it
Linaro2.1K vues
laptop repairing course in delhi par Amit Gupta
laptop repairing course in delhilaptop repairing course in delhi
laptop repairing course in delhi
Amit Gupta1.2K vues
HKG15-311: OP-TEE for Beginners and Porting Review par Linaro
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting Review
Linaro22.6K vues
1 study of motherboard par Ankit Dubey
1 study of motherboard1 study of motherboard
1 study of motherboard
Ankit Dubey666 vues
Computer specifications par Leonel Rivas
Computer specificationsComputer specifications
Computer specifications
Leonel Rivas6.7K vues
Crash dump analysis - experience sharing par James Hsieh
Crash dump analysis - experience sharingCrash dump analysis - experience sharing
Crash dump analysis - experience sharing
James Hsieh22K vues
Ceph Day Shanghai - On the Productization Practice of Ceph par Ceph Community
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Community 183 vues

Plus de Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo par
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
7K vues54 diapositives
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria par
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
3K vues8 diapositives
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora par
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
3.7K vues20 diapositives
Bud17 113: distribution ci using qemu and open qa par
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
662 vues63 diapositives
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018 par
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
2.3K vues16 diapositives
HPC network stack on ARM - Linaro HPC Workshop 2018 par
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
2.6K vues21 diapositives

Plus de Linaro(20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo par Linaro
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro7K vues
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria par Linaro
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Linaro3K vues
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora par Linaro
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Linaro3.7K vues
Bud17 113: distribution ci using qemu and open qa par Linaro
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
Linaro662 vues
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018 par Linaro
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
Linaro2.3K vues
HPC network stack on ARM - Linaro HPC Workshop 2018 par Linaro
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
Linaro2.6K vues
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ... par Linaro
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Linaro1.9K vues
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ... par Linaro
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Linaro2.4K vues
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant... par Linaro
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Linaro2.4K vues
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su... par Linaro
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Linaro2.7K vues
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline par Linaro
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro4.4K vues
HKG18-100K1 - George Grey: Opening Keynote par Linaro
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
Linaro839 vues
HKG18-318 - OpenAMP Workshop par Linaro
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
Linaro608 vues
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline par Linaro
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro866 vues
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all par Linaro
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
Linaro219 vues
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor par Linaro
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
Linaro1.5K vues
HKG18-TR08 - Upstreaming SVE in QEMU par Linaro
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
Linaro476 vues
HKG18-113- Secure Data Path work with i.MX8M par Linaro
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
Linaro1.4K vues
HKG18-120 - Devicetree Schema Documentation and Validation par Linaro
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
Linaro1.3K vues
HKG18-223 - Trusted FirmwareM: Trusted boot par Linaro
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
Linaro1.4K vues

Dernier

Special_edition_innovator_2023.pdf par
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
18 vues6 diapositives
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... par
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
24 vues52 diapositives
Melek BEN MAHMOUD.pdf par
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
17 vues1 diapositive
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... par
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
27 vues49 diapositives
Five Things You SHOULD Know About Postman par
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
38 vues43 diapositives
Future of AR - Facebook Presentation par
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentationssuserb54b561
22 vues27 diapositives

Dernier(20)

Special_edition_innovator_2023.pdf par WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 vues
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... par The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... par Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Five Things You SHOULD Know About Postman par Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 vues
Future of AR - Facebook Presentation par ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56122 vues
STPI OctaNE CoE Brochure.pdf par madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 vues
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 par IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Business Analyst Series 2023 - Week 3 Session 5 par DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 vues
Piloting & Scaling Successfully With Microsoft Viva par Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Case Study Copenhagen Energy and Business Central.pdf par Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana17 vues
"Node.js Development in 2024: trends and tools", Nikita Galkin par Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays17 vues

HKG18-116 - RAS Solutions for Arm64 Servers

  • 1. ARM64 Server RAS Solutions Jonathan (Zhixiong) Zhang Cavium Inc.
  • 2. Agenda ● Overview ● Solutions ● Building blocks ● Reflections
  • 3. Overview Reliability, Availability, Serviceability RAS is one of the most important aspects of ARM64 servers’ success
  • 4. Progresses ● RAS on ARM64 ● SFO17 ● Fu Wei ● Focus on OS and ACPI
  • 5. Progresses ● RAS on AARCH64 ● SFO17 ● Fu Wei, Supreeth Venkatesh ● Focus on prototype solution
  • 6. Discussion Focus ● RAS is a 3 dimensional problem! ● 2 Dimensions: from CPU, to board, to rack ● 1 dimension: from hardware, to firmware to software ● Product ready solutions
  • 7. Who needs what – end user, apps developer● Accurate error reporting ● Right level error reporting ● Low cycle stealing ● Quarantined uncorrectable error
  • 8. Who needs what – data centre admin ● Accurate error reporting ● Reliable error reporting ● Uniformed error handling policy ● Actionable error reporting
  • 9. Who needs what – Platform Vendor Validation through production software stack ● Example: Memory stability has to be validated through running all hardware threads, exercising all DIMMs through OS memory test tool. ● Example: PCIe tuning issue can be spotted through thousands of reboots.
  • 10. Who needs what – SOC Vendor Validation through production software stack ● Validation/diag tool does not exercise the SOC enough. ● Subsystem design/programming issue needs to be unmasked earlier in the engineering cycle. Cost-effective support process ● Expedited problem root cause. ● Reduced reliant on hardware debugger.
  • 12. Early boot time error notification ● Post code, to diagnose halted boot. ● SEL record, to diagnose successful boot with component failure
  • 13. Run time error notification Catastrophic Error 1. GPIO interrupt. Out Of Band. 2. SEL record. Out Of Band. 3. BERT record. In Band 4. Crash dump. Out Of Band and In Band; Catastrophic error only. Non-catastrophic Error 1. GPIO interrupt. Out Of Band. 2. SEL record. Out Of Band. 3. HEST record. In Band 4. GHESv2 notification. In Band a. Polled b. Interrupt c. SDEI d. SEA e. SEI
  • 14. DRAM error handling Boot time issue ● Reasons: ● DIMM training failure. ● SPD access failure. ● Reponses: ○ Halt the boot. ○ Continue booting with SEL record sent. Run time issue ● Correctable error ● Uncorrectable error in non- secure memory ● Recoverable. ● Unrecovable, in non OS critical region ● Unrecovable, in OS critical region. ● Uncorrectable error in secure memory
  • 15. DRAM error reporting Error location types concerned: a) Apps developer: Physical address b) Data center admin: FRU (Field Replaceable Unit) c) Platform vendor: DRAM address Channel interleaving complicates address translation: DRAM address Transaction address Physical address
  • 16. Memory scrubbing DRAM scrubbing prevents future error from happening. a. Firmware initiated on-demand scrubbing, upon memory error. b. Firmware initiated interval based scrubbing, as defined by user. c. OS initiated on-demand scrubbing. 1. Platform exposes scrubbing support capability to OS through ACPI RASF table. 2. OS setup, start, stop scrubbing through RASF PCC channel.
  • 17. PCIe error handling -- workflow Firmware First Handling being asked. 1. PCIe device (such as end point device) detects error. 2. The error is reported up through AER (Advanced Error Reporting) mechanism. 3. PCIe Root Complex on the SoC issues error interrupt. 4. Firmware secure interrupt handler analyzes and reports the error.
  • 18. PCIe error handling -- ACPI a) OSC: OS conveys APEI support to platform. b) OSC: When OS requests control over PCIe AER, platform does not give the control. c) APEI error source: GHESv2 error source, no PCIe RP AER structure, no PCIe device AER structure. d) CPER: PCIe error section.
  • 19. PCIe error handling – SEL sample a) Multiple SEL records generated per error. b) Some are generated from error producer, others from error consumer. c) SEL records hold info such as error type (bus error, PERR, SERR, etc.), BDF numbers, vendor/device IDs, error ID (bad TLP, bad DLLP, etc.).
  • 20. Power/Thermal management Sensors a. Platform sensors b. DIMM sensors c. On-chip sensors Defenses d. DIMM temperature based memory throttling e. Other sensor temperature based Close Loop Thermal Throttling f. Last defense: Software and hardware based thermal trip Power optimizations g. Autonomous DVFS h. Turbo mode and CPPC
  • 21. Catastrophic Error -- Detection a) Hardware error interrupt handler in Trusted Firmware – A (AKA. ARM Trusted Firmware). b) Secure watchdog bite in on-chip management controller a) Watchdog signal 0 routed as SPI to GIC. b) Watchdog signal 1 routed to the platform, such as on-chip management controller c) BMC (Board Management Controller) detected heart beat stop from on-chip management controller d) OS initiated through ACPI PDTT mechanism
  • 22. Catastrophic Error – crash dump a) Contains data needed for failure analysis. b) Vendor proprietary tool provided to analyze the failure. c) Modularized to provide fault tolerance. d) Saved in both non-volatile storage and BMC.
  • 24. Early boot time considerations Report issues happened before interrupt can be enabled: a. Check error status register at certain check points. b. Check error status register right before interrupt is enabled. c. Enable hardware error interrupt as soon as possible. Prevent false positives: d. Filter out noises happened during training/initialization process. e. Clear status registers before enabling hardware error interrupt.
  • 25. SEL record I2C bus access conflict resolution a. Dedicated I2C connection between Trusted Firmware – A and BMC. b. Dedicated I2C connection between non-secure world (UEFI/OS) and BMC. BMC to handle simultaneous SEL record access ARM TF UEFI/OS BMC I2C I2C
  • 26. Configuration Early boot loader configuration a. Some customers want boot to be halted upon single DIMM training failure. b. Other customers want boot to continue unless all DIMMs failed. ARM TF configuration (For example, memory CE threshold needs) c. Length of sliding window. d. The threshold number. Enablement of display/update of configurations in OS e. Stored in a partition in BootRom, separated from UEFI variable partition. f. Exposed through UEFI variable run time service. g. Protected from firmware update. h. Able to be set to factory default.
  • 27. Error injection Memory Error ● Inject into a specific physical memory ● Inject into a specific DIMM location PCIe Error ● Inject through Root Port ● Inject through End Point device ● Inject through PCIe analyzer
  • 29. Hardware design Reduce cycle stealing a. Route hardware error interrupt to on-chip management controller b. Manage Correctable Error threshold in hardware Standardized design c. Support RAS extension
  • 30. Firmware RAS Features ● Exception Handling Framework in Trusted Firmware – A. ○ A framework for triaging RAS errors in EL3 prior to handling or delegation to lower EL ○ Up-stream support available for triaging RAS error interrupts and delegation through SDEI ○ Support for v8.2 RAS extension under review ■ Error synchronization barrier ■ Standard Error records ○ Support for v8.4 RAS extensions planned next ● Software Delegated Exception Interface dispatcher in Trusted Firmware – A. ○ A mechanism to deliver high priority non-maskable events to Normal world ○ Hooks into EHF as a way of delegating RAS errors to Normal world ○ Up-stream support available for physical SDEI dispatcher ● RAS error handling in a Secure partition ○ Support for CPER creation in a EDK2 Standalone MM partition in S-EL0 under development ○ Enables isolation of platform specific code in a EL3 managed sandbox ● DRAM error handling prototyped on SGI-575 using above components
  • 31. Kernel RAS Features ● Memory failure handling merged ● All Arm specific APEI notifications (SEA, SEI, SDEI) work ● Patches to make APEI support aware of multiple NMI sources under review ● Support for SDEI Client driver merged ● Support in APEI for registering SDEI notification handler under review ● Support for IESB in v8.2 RAS Extension merged ○ SErrors normally unmasked when in kernel ● Errors in KVM-Guest memory forwarded to Qemu/Kvmtool ● Support for notifying KVM-Guest under development
  • 33. BMC
  • 34. ARM Server Base Manageability Guide
  • 35. Firmware First vs. Kernel First ● In general, firmware first is asked by server industry. ● Kernel first is desired for some products. ● ACPI table available for review that describes the RAS error node topology to the kernel. ● https://connect.arm.com/dropzone/systemarch/DEN0061A-RAS_ACPI-1.0 alp1.pdf
  • 36. Challenges New technologies a. NVDIMM: ACPI NVDIMM Sub Team working on a number of proposals. b. New DIMM technologies: DDR5. c. New bus technologies: PCIe 5.0; CCIX; Gen-Z. SoC complexities d. SoC instead of chip set. e. SoC interconnect. f. Large number of cores, big size of caches, fast speed buses g. On-Chip controllers (PCIe/SATA/USB, etc.)
  • 37. Acknowledgement ● Achin Gupta (ARM) ● Charles Garcia-Tobin (ARM)
  • 38. Thank You #HKG18 HKG18 keynotes and videos on: connect.linaro.org For further information: jon.zhixiong.zhang@gmail.comwww.linaro.org