This document discusses soft errors in electronic devices and NTT's practices to prevent network outages from soft errors. It consists of the following:
1. An introduction to soft error problems like non-reproducible errors and silent errors in networks.
2. An explanation of soft error mechanisms where cosmic rays can cause nuclear reactions and bit errors in electronic devices.
3. How soft error rates are increasing with device miniaturization.
4. NTT's four-step practices: 1) specifying requirements, 2) simulating soft error rates, 3) applying mitigation techniques like ECC and recovery, and 4) testing devices with a neutron source.
5. Test results showing the effectiveness
The Action Against Soft-Errors to Prevent Service Outage
1. Copyright 2015 QuEST Forum. All Rights Reserved.
1
The action against Soft-errors
to prevent service outages
NTT Network Service Systems Laboratories
Hidenori Iwashita
2015 APAC QuEST Forum APAC Best Practices Conference
April 2015
2. Agenda
2
1. Soft error problems
Laboratory non-reproducible errors
Silent errors
2. Soft error mechanisms
Soft errors are caused by cosmic rays
3. The increase of soft errors
With miniaturization of LSI design rules, soft errors are
increasing rapidly
4. Practices
Soft error test using a compact accelerator neutron source
5. Results
6. Conclusion
NTT can reduce service outages and failure recovery costs due
to soft errors.
3. 1. Soft error problems
Laboratory non-reproducible errors
3
Network System
Network operations center
① Error
② Alarm
Manufacturer factory
③ Return
④ Tests
⑤ Test OK
4. 1. Soft error problems
Silent errors
4
Network System
Network operations center① User complaint
I can’t connect! • Not alarmed
• Fault node
unknown
Prolonged
Significant failure
Press release
(Newspaper, TV)
5. 5
SunSupernova explosion
Earth
Cosmic rays
(High energy particles)
Neutron
Nuclei (O or N)陽子
High energy particles
Destruction
Nuclear reactions in the atmosphere
Proton
Muon
π-meson
2. Soft error mechanisms
Neutrons generated by cosmic rays
6. 6
2. Soft error mechanisms
Nuclear reactions in the device
Soft error
(Bit error)
Secondary ions
Silicon nuclei陽子
Destruction
NeutronNetwork System
Neutrons
7. 3. The increase of soft errors
7
Miniaturization of LSI design rule
(Highly integrated)
Soft errors increase
Current,
At ground level
Past,
Only in space or the sky
8. 3. The increase of soft errors
How often do soft errors occur ?
8
FPGA
SRAM
The FPGA contains large capacity SRAM.
Without soft error mitigation you got more than
10000 FIT.
E.g.
Since SRAMs have less critical charge (are more
sensitive), soft errors occur more frequently.
SRAM
×1000 units in network
FPGA×6
About 1.5 devices per day fail
10. 4. Practices
Step 1. Specifying requirements
10
Planned network scale
E.g.
1000 units on the network
Specify requirements
E.g.
1 failure per month
on the network
⇒ about 1300FIT / unit
11. 4. Practices
Step 2. Simulating soft errors
11
Device Design
rule
[nm]
Size
[Mb]
Soft error
rate
[FIT]
CPU SRAM 65 2 200
FPGA SRAM 28 100 10000
ASIC SRAM 90 2 150
DRAM ① 40 500 10
DRAM ② 40 500 10
DRAM ③ 40 500 10
DRAM ④ 40 500 10
SRAM ① 65 10 1000
SRAM ② 65 1 100
SRAM ③ 65 10 1000
SRAM ④ 65 2 200
SRAM ⑤ 65 10 1000
Flash Mem 90 50 50
Substrate
FPGA ASIC
CPU
SRAM
SRAMSRAMSRAMSRAMSRAM
DRAM
DRAM
DRAM
DRAM
Flash
Memory
SRAM
SRAM
E.g.
We simulate high soft error rates in devices.
High
High
High
High
12. 4. Practices
Step 3. Apply soft error countermeasures
12
(1) Reducing
soft errors
(2) Protection from
soft errors
(3) Recovery from
soft errors
Devices with low soft
error rates
Using memory devices
with error correction
functions such as ECC*.
*Error Correction Code
Systems automatically
restart or overwrite if a
soft error occurs.
Selecting the appropriate soft error countermeasures to suit
functions
MRAM
Special
device
Low
spec
High
cost
1 bit correction
2 bit detection
2 bit correction
3 bit detection
Low
cost
High
cost
Firmware Low cost
ASIC Long-term
development
13. 4. Practices
Step 4. Soft error tests with real products
13
We developed soft error testing technology using Hokkaido
University’s compact accelerator-driven neutron source.
Hokkaido University’s compact
accelerator-driven neutron source
15. 5. Results
15
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Comparison of neutron soft error rates
FPGA based device
ASIC based device
w/o ECC function
w/ ECC function
w/o auto recovery function
w/ auto recovery function
We measured the device to confirm the soft error rate reduction using
the accelerator neutron source.
On the real network, the number of soft errors largely decreased.
80% reduction
90% reduction
80% reduction
16. 6. Conclusion
16
We successfully reproduced soft errors using a compact
accelerator-driven neutron source.
We were able to investigate soft error tolerance, and check
the fault detection process and the process of switching to a
backup network system.
We conclude that NTT can reduce service outages and
failure recovery costs due to soft errors.
17. Message
17
Have you ever experience troubles with unknown
causes on your network ?
It might be caused with soft errors !
Soft errors is able to deal with !
We hope all of the carriers and manufacturers of
the world to be freed from this problems !