7. Xen RAS latest progress
• Core error recovery
– A new MCA error type, error in current processor execution context
– CPU tag it as action required, must deal with before execution resume
– Currently 2 type of architecturally defined core error:
• Data Load Error
• Instruction Fetch Error
• APEI support
– ACPI Platform Error Interfaces
– Bring existing h/w error mechanism together as a coherent infrastructure
– Consists of 4 separate tables
• Boot Error Record Table
• Error Record Serialization Table
• Error Injection Table
• Hardware Error Source Table
– Linux3.0 as dom0 save us much effort
• Many dom0 APEI reuse
• Little maintain effort, benefit from kernel improvement
Intel Confidential
7
8. Core Error Recovery
• Xen core error recovery
– Basically same MCA infrastructure as uncore error recovery
– MCE exception ISR
• MCE broadcast to all logical processors
• Error in range of hypervisor/guest
– If in hypervisor
• Reset system
– Worst case, cannot resume execution
– If in guest
• Trigger vMCE to affected guest
• Trigger vIRQ to dom0 for logging
• Error contained in guest
– Medium case, error in guest kernel, kill the guest
– Best case, error in guest app, kill the app
– Code done, need kernel core recovery to do fine-grain test
Intel Confidential
8
9. APEI support
• Xen APEI support
– BERT
• BOOT Error Record Table
– For unhandled fatal error occurred in a previous boot
• Xen BERT
– Dom0 own, wait kernel BERT ready
– ERST
• Error Record Serialization Table
– Save/retrieve fatal error to/from persistent storage
• Hypervisor ERST:
– Save error
• Dom0 ERST:
– Retrieve/clear error
Intel Confidential
9
10. APEI support
• Xen APEI support
– EINJ
• Error Injection table
– Mechanism through which OSPM can inject h/w errors
• Xen EINJ
– Dom0 own
– Test done based on current bios available error types
– HEST
• Hardware Error Source Table
– Platform level description of error sources and error notifications
• Xen HEST
– Dom0 own SCI logic because of acpica
– Hypervisor own NMI logic, Xen APEI NMI handler currently not ready
– Need bios ready for more error sources and notifications
Intel Confidential
10
11. Robust enhancement
• Xen RAS robust enhancement
– Xen RAS robust challenge
• Buggy bios
• Some error types not h/w supported yet
• Hard to trigger errors and do auto test
– Our work to enhance Xen RAS robust
• Do some code cleanup & enhancement
• Current supported errors were triggered and tested
• QA add error-simulator tools and auto test script
• EINJ enabling help debug & test greatly
• Robust enhancement will continue w/ new platform support more error types
Intel Confidential
11
13. Call for co-work
• I/O error handling
– PCIe AER, Advanced Error Reporting
– For device assign to dom0/pv domU
• Basically reuse dom0/domU AER logic
– For device assign to hvm
• Need PCIe AER support at qemu
– Some VALinux work on standard qemu
– Porting to Xen qemu with AER support
Intel Confidential
13