08448380779 Call Girls In Greater Kailash - I Women Seeking Men
LCA13: BOF Reliability Accessibility and Serviceability (RAS)
1. Group photograph at Linaro Connect in Copenhagen
Monday 29 Oct 2012
LCA13, March 6, 2013
Robert Richter <robert.richter@calxeda.com>
BOF Reliability Accessibility
and Serviceability (RAS)
3. www.linaro.org
Existing subsystems and tools
• mcelog (http://www.mcelog.org/)
• EDAC drivers
• generic tracepoint: trace_mce_record()
(implemented only on x86)
• some other arch specific kernel implementations
(alpha, powerpc, sparc)
4. www.linaro.org
mcelog
• x86 shared kernel mce code for Intel and AMD
●
nmi handler
●
trace point
●
MCE device (/dev/mce) for Intel
●
AMD driver only logs to console
• buffers not mmap'ed
• mcelog tool works only for Intel
5. www.linaro.org
EDAC drivers
• many separate drivers available
• unified sysfs layout
• edac-util (using sysfs)
• memory only, no other events (cpu, io, power,
etc.)
• polls only sysfs, not event driven
7. www.linaro.org
Implementing RAS for ARM
• mcelog not suitable: maintained and developed by
Intel only
• edac: memory errors only, bunch of individual
drivers
• only arch dependent RAS solutions exist
Add another RAS for ARM?
No, let's implement an arch independent RAS
solution for Linux.
8. www.linaro.org
RAS daemon
• follow a proposal by Borislav Petkov of a generic
approach for a RAS daemon
• patch set, not yet upstream:
https://lkml.org/lkml/2011/4/23/72
• implement a generic and arch independent daemon
• use tracepoints and the perf_event subsystem to collect
events in a kernel ringbuffer
• reuse perf_event userland to access the event buffers
• add a RAS daemon to tools/ of the kernel repository
• do a reference implementation for x86 and arm
9. www.linaro.org
RAS daemon - Advantages
• generic approach
• reuse of existing code (esp. perf code)
• event-driven and mmap'ed buffers
• supported by the kernel community, x86
maintainers seem to like it
• standard and flexible framework allows easily
adding features by the kernel community (same
as for perf)
10. www.linaro.org
RAS daemon - Work to do
• start with memory error counting (ECC)
• kernel:
●
add persistent events to kernel:
– always enabled since early boot
– handle multiple users of event buffers
●
add machine check drivers for ARM
• userland
●
sharing code in tools/perf needs some rework
●
implement initial version of a RAS daemon with basic
feature set (logging and statistic)
●
define and add more features