Title: RAS What is it? Why do we need it?
A 101 style introduction to RAS, its purpose and how we use it on ARM64. Covering current status of implementation in ASWG specs and Linux kernel. Plans for future features that are essential for ARM64. Followed by a discussion period.
Speaker: Yazen Ghannam, Fu Wei
What Are The Drone Anti-jamming Systems Technology?
Las16 200 - firmware summit - ras what is it- why do we need it
1. RAS: What is it? Why do we need it?
Harb Abdulhamid (Qualcomm)
Fu Wei (Red Hat)
Yazen Ghannam (AMD)
2. ENGINEERS AND DEVICES
WORKING TOGETHER
What is it?
● Reliability
○ Computation needs be correct and reliable.
○ Failures and errors need be detected and reported.
○ Computation needs to fail when an error is not handled.
● Availability
○ System needs to remain available as long as possible.
○ Errors should be corrected and failures handled so that operation can continue.
● Serviceability
○ System should provide information to administrator to aid in system servicing.
○ Service time needs to be minimized to maximize uptime.
3. ENGINEERS AND DEVICES
WORKING TOGETHER
Why do we need it?
● Increase in system uptime (productivity)
● Less time spent debugging bad or failing hardware (productivity/cost)
● Fewer hardware replacement calls (cost/mindshare)
4. ENGINEERS AND DEVICES
WORKING TOGETHER
Hardware Architecture (How do we do it?)
● x86: Machine Check Exceptions (MCE) & Machine Check Architecture (MCA)
○ Architectural features/extensions.
○ Defines a register set that can be used for multiple devices (IMPORTANT!).
○ Poll for correctable errors.
○ APIC LVT or SMI interrupts for correctable thresholding and deferred errors.
○ MCE for uncorrectable errors.
● PCI-E: Advanced Error Reporting (AER)
○ Similar concepts to MCE/MCA.
● Implementation-specific features
○ ECC in memory controllers
○ ECC in I/O RAMs
○ Poison/bad data markers
○ Flooding I/O links (e.g. Sync Flood)
5. ENGINEERS AND DEVICES
WORKING TOGETHER
Platform Firmware (How do we do it?)
● Platform Firmware has intimate knowledge of the system and can handle RAS
features not available through standardized mechanisms.
● Privileged code runs on the main cores or a separate microcontroller.
● Can mask registers from OS view and handle interrupts.
● Handling can be done without OS’s knowledge and information can be
exposed to OS if desired.
● Preferably, will use a standard mechanism, like ACPI, to inform the OS of errors.
● Can directly inform sysadmin of errors using sideband communications like a
baseboard management controller (BMC).
● Can pinpoint bad hardware for easy replacement.
6. ENGINEERS AND DEVICES
WORKING TOGETHER
Kernel (How do we do it?)
● Error Detect and Correct (EDAC) for system-specific handling and decoding.
● ISA-specific handling in /arch.
● Drivers for PCI-E AER and ACPI.
● Ideally, most RAS code in the Kernel would be obsoleted by Platform Firmware
handling of errors.
● Kernel could then be only responsible for reporting errors received through
standard mechanisms (e.g. ACPI).
● Kernel could also perform error handling relevant at the kernel-level (e.g. killing
processes or retiring bad/poisoned pages).
7. ENGINEERS AND DEVICES
WORKING TOGETHER
User-space (How do we do it?)
● Mcelog
○ Generally considered obsolete.
○ X86 only.
○ Reads data from /dev/mcelog.
● Rasdaemon
○ More active.
○ Can be updated to handle various platforms.
○ Reads data from Kernel tracepoints.
○ Can effectively obsolete EDAC modules for error decoding.
9. ENGINEERS AND DEVICES
WORKING TOGETHER
ACPI APEI BERT
● Scenarios : Record errors in
emergency (OS crash/reset)
● BERT:Boot Error Record Table
● Mechanism : report unhandled
errors that occurred in a previous
boot.
○ WHERE are the error records
12. ENGINEERS AND DEVICES
WORKING TOGETHER
ACPI APEI HEST
● Scenarios : Record errors in runtime
(OS still can work)
● HEST:Hardware Error Source Table
● Mechanism : describes a
standardized mechanism platforms
may use to describe their error
sources by Error Source Structure:
○ HOW to inform
○ WHERE are the error records
○ WHEN records can be free
13. ENGINEERS AND DEVICES
WORKING TOGETHER
ACPI APEI HEST
● Error Source Structure :
○ For IA-32 : MCE/CMC/NMI
○ For PCI: AER Root Port/Endpoint/Bridge
○ Generic Hardware : GHES V1/V2
● For ARM64 : GHES v2
○ HOW to inform : Notification Structure
○ WHERE are the error records: Error
Status Address (GAS : Generic Address
Structure)
○ WHEN records can be free:Read Ack
Register
15. ENGINEERS AND DEVICES
WORKING TOGETHER
ACPI APEI ERST
● Scenarios : Record and Retrieve errors in
persistent storage
● ERST:Error Record Serialization Table
● Mechanism : Operation abstract, provides
details necessary to communicate with
on-board persistent storage
● Plan B: use the UEFI runtime variable services
to carry out error record persistence
operations
16. ENGINEERS AND DEVICES
WORKING TOGETHER
ACPI APEI EINJ
● Scenarios : Test OSPM error handling stack
● EINJ:Error Injection Table
● Mechanism : Operation abstract, provides a
generic interface which OSPM can inject
hardware errors to the platform without
requiring platform specific software.
17. ENGINEERS AND DEVICES
WORKING TOGETHER
RAS on ARM64
● Architectural support for RAS is not available but not needed.
● In other words, no need to follow the same historical path as other
architectures.
● Focus should be on Platform Firmware handling of errors.
● Reporting should be through standard methods like ACPI.
● Will possibly need to implement kernel-relevant error handling based on
information received from Platform Firmware.
18. ENGINEERS AND DEVICES
WORKING TOGETHER
Current Work
● Add support for ACPI RAS features.
● Testing Platform Firmware to OS interface.
● No platform-specific RAS feature testing.
● Using modified QEMU for testing.
19. ENGINEERS AND DEVICES
WORKING TOGETHER
Future Work
● Finish ACPI implementation.
● Investigate kernel handling of poisoned pages and processes.
● Investigate I/O-related error handling in the Kernel.