2. 2
A Multi-Phase Decision on Reliability Growth
with Latent Failure Modes
Tongdan Jin, Ph.D.
Ingram School of Engineering
Texas State University, TX 78666, USA
6pm Pacific Time on Feb. 9, 2014
3. 3
Contents
• The Needs for Reliability Growth Planning
• Reliability Growth considering Latent Failure Modes
• Multi-Phase Reliability Growth Management
• Applications to Electronic Equipment
• Conclusion
5. 5
Reliability Growth for Capital Equipment
• Large and complex capital goods
• Long service time
• Prohibitive downtime cost
• Expensive in maintenance, repair, and overhaul
(MRO)
• Integrated product-service system
6. 6
Reliability Growth Management
Design and
Development
Prototype and
Pilot Phase
Volume Production, Field Use and
After-Sales Support
Product Life Cycle
Reliability Growth Testing (RGT)
Reliability Growth Planning (RGP)
7. 7
Why Need GRP?
• Shorter Time-To-Market
• Cut-off in Testing Budget
• Dispersed Design, Manufacturing, and Integration
• Usage Diversity
• Variable System Configuration
Basic subsys 1
Basic subsys 2
time
Basic subsys 3
Basic design Volume manufacturing and shipping
Adv. subsys 4
Adv. subsys 5
Adv. subsys 6
t1
t2 t3 t4t0
Figure 3 Compressed System Design Cycle
8. 8
Reliability Post New Product Introduction
MTBF
System Install Base
SystemMTBF
FieldSystemPopulations
Chronological Time
Target MTBF
10. 10
System Failure Mode Categories
Failures Breakdown by Root-Cause Catagory
0%
10%
20%
30%
40%
50% Hardware
Design
Mfg
Process
Software
NFF
Four different modules,
Data from >100 systems
shipped within one year.
A
B
C
D
11. 11
RGP Program: A Synergy of ECO and CA
Product Design
& Manufacturing
In-service
Systems
Spare
Inventory
Retrofit Loop
ECO Loop1. Failure mode analysis
2. Reliability growth
prediction
3. CA implementations
Spare
Batch
Repair
Center
Retrofit
Team
New System Shipping and Installation
ECO=Engineering Change Order
CA=Corrective Actions
13. 13
Failure Intensity Rate w/o Latent Failures
n
i
iB
m
i
iAcs
tttt
1
,
1
,
)()()|(
A,i(t)= failure intensity for failure mode i in A
B,i(t)= failure intensity for failure mode i in B
m = number of failure modes in A by time tc
n= number of failure modes in B by time tc
Where:
Time t0
Failureintensity
No trends
1(t)
2(t)
3(t)
4(t)
Trends
tc
a
b
c
d
14. 14
Crow/AMSAA Growth Model
N
i
i
s
t
t
N
1
ln
ˆ
ˆ
ˆ
s
t
N
1ˆ
ˆˆ
tFailure Intensity:
2
2/1,2
ˆ
2
N
N 2
2/,2
ˆ
2
N
N
Reject H0
Where
Hypothesis Testing:
H0: β=1, HPP
H1: β1, NHPP
or
0
1
2
3
4
5
6
0 1 2 3 4 5
FailureIntensity
Time
Various FailureIntensity Models
beta 1
beta 0.5
beta 1.5
=1 for all
ts=termination time, ti=ith failure arrival time
HPP=Homogenous Poisson Process
NHPP=Non-homogenous Poisson Process
17. 17
What is the Latent Failure Mode
1. Also known as dormant failure mode
2. Hibernated
3. Depending on customer usage
4. May caused by design weakness
5. Software bugs, and
6. Electro-statistic discharge (ESD)
7. Others ….
19. 19
Surfaced and Latent Failure Modes
Surfaced
Latent
Latent
A latent failure mode becomes a surfaced once it occurred.
20. 20
Reliability Model with Latent Failure Modes
k
j
j
n
i
ii
m
i
ics
tttt i
11
1
1
)()|(
• k=the number of new latent failure modes occurred in T.
• γj(t) =the failure intensity for the jth latent failure mode.
• Where t>tc.
Where
Projected latent failure
intensity after tc.
21. 21
Estimate Cumulative Latent Failure Intensity
)()|(
1
1
1
tttt a
n
i
ii
m
i
ics
i
k
j
j
n
i
ii
m
i
ics
tttt i
11
1
1
)()|(
ck
j
cj
c
a
k
j
j Tt
T
T
tt
11
)()()(
(kc=# of latent failure
modes occurred in Tc)
where
c
c
c
cc
T
Tk
tt
ttk
k
0
)(
Eq. (4)
Eq. (5)
Eq. (3)
22. 22
Summary of Latent Failure Mode Prediction
• Step 1: Estimate i(t) for surfaced failure mode i at tc
using Crow/AMSAA model
• Step 2: Obtain s(t|tc) using Eq. (2) on slide 14.
• Step 3: Estimate k and Γa(t) using Eq. (4) and (5)
• Step 4: Obtain the reliability growth model Eq. (3)
For more details, please also refer to T. Jin, H. Liao, M. Kilari, “Reliability growth modeling
for in-service systems considering latent failure modes,” Microelectronics Reliability, vol. 50,
no. 3, 2010, pp. 324-331.
24. 24
Recourses ($)
Spent on CA due to
1. Retrofit
2. ECO
Links:
$ of CA and
% reduction of a
failure mode
CA
Effectiveness
Function
Why Need the CA Effectiveness Estimate
25. 25
0 c
x
1
effectiveness
b
c
x
xh
)(
h(x)
CA budget ($)
Effectiveness Model
b>1
b=1
b<1
Modeling CA (or Fix) Effectiveness
b and c to be determined
Effectiveness=
Failure rate before CA – Failures rate after CA
Failure rate before CA
For more details on effectiveness function, please refer to T. Jin, Y. Yu, and F. Belkhouche, “Reliability growth using retrofit or
engineering change order-a budget-based decision making,” in Proceedings of IERC Conference, 2009, pp. 2152-2157.
26. 26
An Example: ECO or Retrofit
A type of relays used on a PCB module fails constantly due to
a known failure mechanism. Two options available for
corrective actions
1. Replace all on-board relays upon the failure return of the
module
2. Pro-actively recall all modules and replace with new types
of relays having much higher reliability
CA Option Cost ($) CA Effectiveness
ECO Low Low
Retrofit High High
27. 27
An Illustrative Example
The current failure rate a type of relay is 210-8 faults per
hour. Upon the implementation of CA, the rate is reduced to
510-9.
The CA effectiveness can be expressed as 0.75, that is
75.0
102
105102
8
98
28. 28
Incorporate h(x) into
b
c
x
xh
)(
)|( cs tt
)(11);(
11
11
tt
c
x
c
x
t a
c
x
ii
n
i
b
i
i
m
i
i
b
i
i
s
i
ib
i
i
ii
x
)()|(
1
1
1
tttt a
n
i
ii
m
i
ics
i
29. 29
Optimization Formulation
Min:
Subject to:
xi0 for i=1, 2, …., m
Where
,
}
m
i
ixg
1
)(x
0);( ts x
xi=CA budget for failure mode i, for i=1, 2, …, m.
0= target system failure intensity
RGP budget
Target reliability
30. 30
Topic V:
Numerical Example
(Driving Electronic Equipment
Reliability)
The example is taken from the following paper:
T. Jin, Y. Yu, H.-Z. Huang, “A multiphase decision model for reliability growth considering
stochastic latent failures,” IEEE Transactions on Systems, Man and Cybernetics, Part A,
vol. 43. no. 4, 2013, pp. 958-966.
31. 31
Overview of The Planning Horizon
Phase 1
Day 1-90
Phase 2
Day 91-220
Phase 3
Day 221-350
• Collect field data
• Identify surface failure
modes
• Reliability prediction for
Phase 2
• Resource allocation for
Phase 2
• Collect field data
• Identify latent failure
modes
• Reliability prediction
• Implement CA/ECO
• Resource allocation for
Phase 3
• Collect field data
• Identify new latent failure
modes
• Reliability prediction
• Implement CA/ECO
• Resource allocation for
Phase 4 (next)
32. 32
Failure Inter-Arrival Times in Phase 1
i Days 7 14 15 21 84 85 87 89
1 Open Diode 1 1 1 1 1 1
2 Power Supply 1
3 EEPROM 1 1
4 Cold Solder 1
5 NFF 1
6 Flux Contam 1
FailureMode
Note: Numbers in the cell represents the failure quantity.
39. 39
Conclusions
1. New designs are often subject to both components (hardware)
and non-components failures. Some failure modes are dormant.
2. RGP is a multi-disciplinary cross-function team effort as it
involves design, manufacturing, testing, operation,
maintenance as well as latent failures.
3. We proposes a CA effectiveness function and further integrates
it into the RGP model to achieve reliability target a lower cost.
4. An accurate reliability growth prediction is useful, yet it is
more beneficial to industry as when to reach the reliability goal
and how much resource (labor and budget) is required.
40. 40
References
1. D. S. Jackson, H. Pant, M. Tortorella, “Improved reliability-prediction and field-reliability-data analysis for field-
replaceable units,” IEEE Transactions on Reliability, vol. 51, no. 1, 2002, pp. 8-16.
2. J. T. Duane, “Learning curve approach to reliability monitoring,” IEEE Transactions on Aerospace, vol. 2, no. 2, 1964, pp.
563-566.
3. L. H. Crow, “Reliability analysis for complex, repairable systems,” SIAM Reliability and Biometry, 1974, pp. 379-410.
4. M. Xie, M. Zhao, “Reliability growth plot-an underutilized tool in reliability analysis,” Microelectronics and Reliability,
vol. 36, no. 6, 1996, pp. 797-805.
5. D. W. Coit, “Economic allocation of test times for subsystem-level reliability growth testing,” IIE Transactions on Quality
and Reliability Engineering, vol. 30, no. 12, 1998, pp. 1143-1151.
6. M. Krasich, J. Quigley, L. Walls, “Modeling reliability growth in the system design process,” in Proceedings of Annual
Reliability and Maintainability Symposium, 2004, pp. 424-430.
7. S. Inoue, S. Yamada, “Generalized discrete software reliability modeling with effect of program size,” IEEE Transactions on
Systems, Man and Cybernetics, Part A, vol. 37, no. 2, 2007, pp. 170-179.
8. P. M. Ellner, J. B. Hall, “An approach to reliability growth planning based on failure mode discovery and correction using
AMSAA projection methodology,” in Proceedings of Annual Reliability and Maintainability Symposium, 2006, pp. 266-
272.
9. T. Jin, H. Liao, M. Kilari, “Reliability growth modeling for in-service systems considering latent failure modes,”
Microelectronics Reliability, vol. 50, no. 3, 2010, pp. 324-331.
10. T. Jin, Y. Yu, H.-Z. Huang, "A multiphase decision model for reliability growth considering stochastic latent failures," IEEE
Transactions on Systems, Man and Cybernetics, Part A, vol. 43. no. 4, 2013, pp. 958-966.
11. L. Attardi, G. Pulcini, “A new model for repairable systems with bounded failure intensity,” IEEE Transactions on
Reliability, vol. 54, no. 4, 2005, pp. 572-582.
12. M. S. Bazaraa, C. M. Shetty, Nonlinear Programming: Theories and Applications, 3rd edition, 2006, John Wiley & Sons,
New York.