Las16 200 - firmware summit - ras what is it- why do we need it

RAS: What is it? Why do we need it?
Harb Abdulhamid (Qualcomm)
Fu Wei (Red Hat)
Yazen Ghannam (AMD)

ENGINEERS AND DEVICES
WORKING TOGETHER
What is it?
● Reliability
○ Computation needs be correct and reliable.
○ Failures and errors need be detected and reported.
○ Computation needs to fail when an error is not handled.
● Availability
○ System needs to remain available as long as possible.
○ Errors should be corrected and failures handled so that operation can continue.
● Serviceability
○ System should provide information to administrator to aid in system servicing.
○ Service time needs to be minimized to maximize uptime.

WORKING TOGETHER
Why do we need it?
● Increase in system uptime (productivity)
● Less time spent debugging bad or failing hardware (productivity/cost)
● Fewer hardware replacement calls (cost/mindshare)

WORKING TOGETHER
Hardware Architecture (How do we do it?)
● x86: Machine Check Exceptions (MCE) & Machine Check Architecture (MCA)
○ Architectural features/extensions.
○ Defines a register set that can be used for multiple devices (IMPORTANT!).
○ Poll for correctable errors.
○ APIC LVT or SMI interrupts for correctable thresholding and deferred errors.
○ MCE for uncorrectable errors.
● PCI-E: Advanced Error Reporting (AER)
○ Similar concepts to MCE/MCA.
● Implementation-specific features
○ ECC in memory controllers
○ ECC in I/O RAMs
○ Poison/bad data markers
○ Flooding I/O links (e.g. Sync Flood)

WORKING TOGETHER
Platform Firmware (How do we do it?)
● Platform Firmware has intimate knowledge of the system and can handle RAS
features not available through standardized mechanisms.
● Privileged code runs on the main cores or a separate microcontroller.
● Can mask registers from OS view and handle interrupts.
● Handling can be done without OS’s knowledge and information can be
exposed to OS if desired.
● Preferably, will use a standard mechanism, like ACPI, to inform the OS of errors.
● Can directly inform sysadmin of errors using sideband communications like a
baseboard management controller (BMC).
● Can pinpoint bad hardware for easy replacement.

WORKING TOGETHER
Kernel (How do we do it?)
● Error Detect and Correct (EDAC) for system-specific handling and decoding.
● ISA-specific handling in /arch.
● Drivers for PCI-E AER and ACPI.
● Ideally, most RAS code in the Kernel would be obsoleted by Platform Firmware
handling of errors.
● Kernel could then be only responsible for reporting errors received through
standard mechanisms (e.g. ACPI).
● Kernel could also perform error handling relevant at the kernel-level (e.g. killing
processes or retiring bad/poisoned pages).

WORKING TOGETHER
User-space (How do we do it?)
● Mcelog
○ Generally considered obsolete.
○ X86 only.
○ Reads data from /dev/mcelog.
● Rasdaemon
○ More active.
○ Can be updated to handle various platforms.
○ Reads data from Kernel tracepoints.
○ Can effectively obsolete EDAC modules for error decoding.

WORKING TOGETHER
ACPI (How do we do it?)
● We’ll get into this next...

WORKING TOGETHER
ACPI APEI BERT
● Scenarios ： Record errors in
emergency (OS crash/reset)
● BERT：Boot Error Record Table
● Mechanism : report unhandled
errors that occurred in a previous
boot.
○ WHERE are the error records

WORKING TOGETHER
UEFI spec CPER

WORKING TOGETHER
ACPI APEI BERT

WORKING TOGETHER
ACPI APEI HEST
● Scenarios ： Record errors in runtime
(OS still can work)
● HEST：Hardware Error Source Table
● Mechanism : describes a
standardized mechanism platforms
may use to describe their error
sources by Error Source Structure:
○ HOW to inform
○ WHERE are the error records
○ WHEN records can be free

WORKING TOGETHER
ACPI APEI HEST
● Error Source Structure ：
○ For IA-32 : MCE/CMC/NMI
○ For PCI: AER Root Port/Endpoint/Bridge
○ Generic Hardware : GHES V1/V2
● For ARM64 : GHES v2
○ HOW to inform : Notification Structure
○ WHERE are the error records: Error
Status Address (GAS : Generic Address
Structure)
○ WHEN records can be free：Read Ack
Register

WORKING TOGETHER
ACPI APEI HEST

WORKING TOGETHER
ACPI APEI ERST
● Scenarios ： Record and Retrieve errors in
persistent storage
● ERST：Error Record Serialization Table
● Mechanism : Operation abstract, provides
details necessary to communicate with
on-board persistent storage
● Plan B: use the UEFI runtime variable services
to carry out error record persistence
operations

WORKING TOGETHER
ACPI APEI EINJ
● Scenarios ： Test OSPM error handling stack
● EINJ：Error Injection Table
● Mechanism : Operation abstract, provides a
generic interface which OSPM can inject
hardware errors to the platform without
requiring platform specific software.

WORKING TOGETHER
RAS on ARM64
● Architectural support for RAS is not available but not needed.
● In other words, no need to follow the same historical path as other
architectures.
● Focus should be on Platform Firmware handling of errors.
● Reporting should be through standard methods like ACPI.
● Will possibly need to implement kernel-relevant error handling based on
information received from Platform Firmware.

WORKING TOGETHER
Current Work
● Add support for ACPI RAS features.
● Testing Platform Firmware to OS interface.
● No platform-specific RAS feature testing.
● Using modified QEMU for testing.

WORKING TOGETHER
Future Work
● Finish ACPI implementation.
● Investigate kernel handling of poisoned pages and processes.
● Investigate I/O-related error handling in the Kernel.

WORKING TOGETHER
Demo

Thank You
#LAS16
For further information: www.linaro.org
LAS16 keynotes and videos on: connect.linaro.org

Las16 200 - firmware summit - ras what is it- why do we need it

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Las16 200 - firmware summit - ras what is it- why do we need it

Similaire à Las16 200 - firmware summit - ras what is it- why do we need it (20)

Plus de Linaro

Plus de Linaro (20)

Dernier

Dernier (20)

Las16 200 - firmware summit - ras what is it- why do we need it