This is a four parts lecture series. The course is designed for reliability engineers working in electronics, opto-electronics and photonics industries. It explains the roles of Highly Accelerated Life Testing (HALT) in the design and manufacturing efforts, with the emphasis on the design one (the HALT in manufacturing is the well known late Greg Hobb’s approach), and teaches what could and should be done to design, when high probability is a must, a product with the predicted, specified (“prescribed”) and, if necessary, even controlled, low probability of the field failure.
Part 3: • Design for Reliability (DfR)
• Probabilistic Design for Reliability (PDfR): role, attributes, challenges, pitfalls
• Safety margin and safety factor
• Practical examples: assemblies subjected to thermal and/or dynamic loading
Part 4: • More general PDfR approach
• New Qualification Approaches Needed?
• One effective way to improve the existing QT practices and specifications
2. ASQ Reliability Division
ASQ Reliability Division
Short Course Series
Short Course Series
One of the monthly webinars
One of the monthly webinars
on topics of interest to
reliability engineers.
To view recorded webinar (available to ASQ Reliability
Division members only) visit asq.org/reliability
) /
To sign up for the free and available to anyone live
webinars visit reliabilitycalendar.org and select English
Webinars to find links to register for upcoming events
http://reliabilitycalendar.org/The_Re
liability_Calendar/Short_Courses/Sh
liability Calendar/Short Courses/Sh
ort_Courses.html
3. PROBABILISTIC DESIGN for RELIABILITY (PDfR) CONCEPT,
the Roles of Failure Oriented Accelerated Testing (FOAT)
and Predictive Modeling (PM), and
a Novel Approach to Qualification Testing (QT)
“You can see a lot by observing”
Yogi Berra, American Baseball Player
“It is easy to see, it is hard to foresee”
Benjamin Franklin, American Scientist and Statesman
E. Suhir
Bell Laboratories, Physical Sciences and Engineering Research Division, Murray Hill, NJ (ret),
University of California, Dept. of Electrical Engineering, Santa Cruz, CA,
University of Maryland, Dept. of Mechanical Engineering, College Park, MD, and
ERS Co. LLC, 727 Alvina Ct. Los Altos, CA, 94024, USA
Tel. 650-969-1530, cell. 408-410-0886, e-mail: suhire@aol.com
Four hour ASQ-IEEE RS Webinar short course
Dr. E. Suhir
January 3-6, 2011 Page 1
4. Contents
Session I
1. Introduction: background, motivation, incentive
2. Reliability engineering as part of applied probability and Probabilistic Risk
Management (PRM) bodies of knowledge
3. Failure Oriented Accelerated Testing (FOAT): its role, attributes, challenges, pitfalls
and interaction with other accelerated test categories
Session II
4. Predictive Modeling (PM): FOAT cannot do without it
5. Example of a FOAT: physics, modeling, experimentation, prediction
Session III
6. Probabilistic Design for Reliability (PDfR), its role and significance
Session IV
7. General PDfR approach using probability density functions (pdf)
8. Twelve steps to be conducted to add value to the existing practice
9. Do electronic industries need new approaches to qualify their devices into products?
10. Concluding remarks
Dr. E. Suhir Page 2
5. Session III
6. PROBABILISTIC DESIGN FOR RELIABILITY,
ITS ROLE and SIGNIFICANCE
“Probable is what usually happens”
Aristotle, Greek philosopher
“Probability is the very guide of life”
Marcus Tullius Cicero,
Roman philosopher and statesman
Dr. E. Suhir Page 60
6. Design-for-Reliability
Design for reliability (DfR) is a set of approaches, methods and best practices
that are supposed to be used during the design phase of the product to minimize
the likelihood (risk) that the product will not meet the reliability requirements,
objectives and expectations.
While 50% of the total actual cost of an electronic product is due to the cost of
materials, 15% - to the cost of labor, 30% to the overhead costs and only 5% to
the design effort, this effort influences about 70% of the total cost of the product
(“Six Sigma”, M. Harry and R. Schroeder).
If reliability is taken care of during the design phase, the final cost of the product
does not go up. If a reliability problem is detected during engineering the cost of
the product goes up by a factor of 10. If the problem is caught in production
phase, the cost of the product increases by a factor of 100 or more.
Dr. E. Suhir Page 61
7. Deterministic approach
Deterministic approach is based on the concept that reliability is assured by
introducing a sufficiently high deterministic safety factor, which is defined as the ratio
of the capacity (“strength”) C of the system to the demand (“load”) D:
C
δ =SF= .
D
The level of the safety factor SF is being chosen depending on the consequences of
failure, acceptable risks, the available and trustworthy information about the capacity
and the demand, the accuracy with which these characteristics are determined,
possible costs and social benefits, variability of materials and structural parameters,
construction (manufacturing, fabrication) procedures, etc.
In a particular problem the capacity and demand could be different from the strength
and load, and the role of these characteristics can be replaced by, say, acceptable and
actual current, voltage, light intensity, electrical resistance; traffic capacity and traffic
flow; culvert size and the quantity of water; critical (buckling) and actual compressive
stresses; etc.
The safety factors in engineering are being established from the previous experiences
for the considered system in its anticipated environmental or operation conditions.
Dr. E. Suhir Page 62
8. Probabilistic approach
Probabilistic DfR (PDfR) approach is based on the probabilistic risk management
(PRM) concept, and if applied broadly and consistently, brings in the probability
measure (dimension) to each of the design characteristics of interest. Using AT data
and particularly FOAT data, and PM techniques, it enables one to establish the
probability of the possible (anticipated) failure under the given operation conditions
and for the given moment of time in operation
After the probabilistic PMs are developed, one should use sensitivity analyses to
determine the most feasible materials and geometric characteristics of the design, so
that the lowest probability of failure is achieved
In other cases, the probabilistic DfR approach enables one to find the most feasible
compromise between the reliability and cost effectiveness of the product
When probabilistic DfR (PRM) approach is used, the reliability criteria (specifications)
are based on the acceptable (allowable) probability of failure for the given product.
Dr. E. Suhir Page 63
9. Basic Principles Underlying our PDfR Approach-1
Not all the products require the PDfR approach, but only those for which high
reliability is crucial and for which there is a reason to believe that this probability
might not be high enough for particular applications
Nobody and nothing is perfect. The difference between a reliable and unreliable
system (device) is in the level of the probability of failure in the field under the
given (anticipated) loading (environmental) conditions and after the given
(specified) time in operation.
The probability of failure in the field is the ultimate and a “reliable” criterion
(“judge”) of the product’s reliability
This probability can be established through a specially designed and carefully
conducted DFOAT aimed at understanding the physics of failure and choosing the
right predictive DFOAT model (e.g., Arrhenius, Coffin-Manson, crack propagation,
demand-capacity “interference”, etc.) for the anticipated loading conditions or
their combination (say, thermal+vibrations)
Dr. E. Suhir Page 64
10. Basic Principles Underlying our PDfR Approach-2
The reliability of a product is due to the reliability of its one or two most vulnerable
(most unreliable) functional or structural elements, and it is for these elements that the
adequate DFOAT should be designed and conducted
Sensitivity analyses are a must after the physics of the anticipated failure is
established, the appropriate predictive model is agreed upon, and the acceptable
probability of failure in the field is specified, but prior to the final decision about
launching the mass production of the product
DFOAT is not necessarily a destructive test, but is always a test to failure, a test to
determine the limits of the reliably operation and the probability that these limits are
exceeded
DFOAT cannot do without predictive modeling, and it is only through the predictive
modeling that the probability-of-failure in the field could be found (established)
Dr. E. Suhir Page 65
11. Basic Principles Underlying our PDfR Approach-3
Time and labor consuming a-posteriori “statistics-of-failure” can be successfully
replaced, to a great extent, by the anticipated a-priori “probability-of-failure”
confirmed by some statistical data (for the mean and STD values of the probability
distribution of interest, but not for the probability-distribution function itself)
PDfR concept enables one to qualify a viable device (system) into a reliable-in-the-
field product, with the predicted, prescribed (specified) and even, if necessary,
controlled probability of failure in the field
Technical diagnostics, prognostication and health monitoring could be effective
means to anticipate, establish and prevent possible field failures
PDfR has to do with the DfR, and not with the Manufacturing-for-Reliability (MfR)
Burn-ins could be viewed as a special type of FOAT intended for MfR objectives
and are always a must, whatever DfR approach is considered.
Dr. E. Suhir Page 66
12. Reliability function
The simplest objects (items) in reliability engineering are those that do not let
themselves to restoration (repair) and have to be replaced after the first failure.
The reliability of such items is due entirely do their dependability, i.e., probability
of non-failure, which is the probability that no failure could possibly occur during
the given period of time. The dependence of this probability of time is known as
the reliability function.
As any other probability, the dependability of a sufficiently large population of
non-repairable items can be substituted by the frequency, and therefore the
reliability function can be sought as
s (t )
,
R(t ) =
0
where 0 is the total number of items being tested and s(t) is the number of
items that are still sound by the time t .
Dr. E. Suhir Page 67
13. Failure rate
Differentiation the relationship
s (t )
R(t ) =
0
with respect to time t, we have:
dR(t) 1 d s (t) 1 d f (t)
= =−
dt 0 dt 0 dt
where f (t) = 0 − s (t) is the number of the failed items.
The failure rate is introduced as follows:
1 d f (t)
λ(t) =
s (t) dt
As evident from this formula, the failure rate is the ratio of the number of items that
failed by the time t to the number of items that remained sound by this time. The
failure rate characterizes the change in the dependability of an item in the course of its
lifetime.
Dr. E. Suhir Page 68
15. Probabilistic and statistical definitions
of the reliability function
1 d f (t) dR(t) 1 d s (t) 1 d f (t)
Considering λ(t) = , the formula = =−
dt dt dt
s (t) dt 0 0
dR(t ) 1 s (t ) dR ( t )
yields: dt = −λ (t ) = −λ (t ) R(t ) , or = − λ ( t ) dt
0 0 R
t t
so that ln R(t ) = −∫ λ(τ )dτ .
. Hence, R(t ) = exp− ∫ λ (τ )dτ
0 0
The reliability function R(t) satisfies the obvious initial condition R(0)=1. The
above formula for the reliability function expresses the probabilistic definition of
this function, while the formula
s (t)
R(t) =
0
provides its statistical definition.
Dr. E. Suhir Page 70
16. Exponential formula of reliability (revisited).
Probability of failure
When the failure rate is time independent, the formula
t
R ( t ) = exp − ∫ λ (τ ) d τ
0
leads to the exponential formula of reliability:
R(t) = e−λt
The function
dR (t ) t
f (t ) = − = λ ( t ) exp − ∫ λ (τ )dτ
dt 0
is the probability density function for the flow of failures, or the failure frequency.
The probability of a failure during the time t can be evaluated as
t
Q(t) =1− R(t) = ∫ f (τ )dτ
0
Dr. E. Suhir Page 71
17. Stress-strength (“interference”) concept
The curve on the right should be obtained experimentally, based on the accelerated life testing and
on the accumulated experience. The bearing capacity of the structure should be such that the
probability of failure, P(t), is sufficiently low, and the safety factor (SF) is not lower than the
specifies value, say, SF=1.4. In a simplified analysis the curve on the right could be substituted,
particularly, by a constant value, which, if a conservative approach is taken, should be sufficiently
low. Capability of the tile structure with respect to the
Probability density function for a particular mechanical or thermal loading (may or
particular mechanical or thermal may not be time-dependent). In the current analysis
characteristic (response) of the tile we assume that the bearing capacity for a particular
structure to the given environmental factor reliability characteristic is either a constant value or a
at the given moment of time (“Demand”, D) normally distributed random variable with a known
(evaluated) mean and standard deviation
(“Capacity”, C)
The larger is the overlap of these two curves, the higher is the probability of failure, and the lower is
the safety factor. After these two curves are evaluated (established) for each reliability characteristic of
interest and for each moment of time (separately, for the take off and landing processes) we evaluate
the probability distributing function, f(ψ), for the safety margin, ψ=C-D, its mean, <ψ>, and standard
deviation, ŝ, and the safety factor, SF= <ψ>/ ŝ. It should not be lower than the specified value, say,
SF=1.4.
18. Probability of non-failure (dependability)
The “reliability” (actually, “dependability”) of a non-repairable item is defined as the
probability of non-failure, P = P {C>D}, i.e., as the probability that the item’s bearing
capacity (“strength”), C, during the time, t, of operation under the given stress
conditions, will always be greater than the demand (“loading”), D.
Although the probability of non-failure is never zero, it can be made, if a probabilistic
approach is used, as low as necessary. If the probability distributions f (C) and g (D)
(probability density functions) for the random variables C and D are known, then the
probability, P, of non-failure (reliability, dependability) can be evaluated as
∞
P = ∫0
f ψ (ψ ) d ψ
where f(ψ) is the probability density function of the margin of safety ψ=C-D, which is
also a random variable.
19. Safety factor -1
Direct use of the probability of non-failure is often inconvenient, since, for highly
reliable items, this probability is expressed by a number which is very close to one,
and, for this reason, even significant chan in the item’s (system’s) design, which have
an appreciable impact on the item’s reliability, may have a minor effect on the
probability of non-failure.
In those cases when both the mean value, <ψ>, and the standard deviation, ŝ, of the
margin of safety (or any other suitable characteristic of the item’s reliability, such as
stress, temperature, displacement, affected area, etc.), are available, the safety factor
(safety index, reliability index)
SF=δ= <ψ>/ŝ
can be used as a suitable reliability criterion.
20. Safety factor-2
After the capacity and the demand curves are established for each probability
characteristic of interest and for each moment of time the probability distribution
function f (ψ ) for the safety margin Ψ = C − D should be determined. Then,
for normally distributed capacity and demand, the mean value
∞
< ψ >= ∫0
f (ψ )ψ d ψ
of the safety margin and its standard deviation
∞
sψ = ∫ f (ψ)( − <ψ >)2dψ
ψ
0
should be evaluated.
The safety factor could be found as the ratio of the mean value of the safety margin
to its standard deviation:
<ψ >
SF = δ =
sψ
Dr. E. Suhir Page 75
21. Safety factor-3
The SF should not be lower than its specified value for the characteristic of interest.
This value should reflect the state-of-the-art in the given area of engineering, cost and
time-to-market considerations, and should account for the consequences of failure.
If the computed SF does not meet the specification requirements, the design should be
revised (improved) until the required level of safety (reliability) is met.
The required level of safety could be established also based on the level of the
probability
∞
P(ψ ) = ∫ f (ψ )dψ
ψ
of non-failure. This formula defines the probability that the safety margin Ψ=C−D
is found between the given value and infinity. i.e., is higher than the given (specified)
value of this margin.
Dr. E. Suhir Page 76
22. The SF and the probability P(ψ ) of exceeding a certain level of the safety margin
are related If the reliability characteristic of interest (such as, e.g., the safety margin,
ψ) is distributed in accordance with the normal law
Normal law
( −ψ
1 ψ ) 2
fψ (ψ ) = exp− dψ
2πDψ
2 Dψ
then the probability of non-failure is related to the safety factor
SF as P SF
P=½[1+Ф(SF)], 0.999000 3.0901
where
α 0.999900 3.7194
2
∫
2
Ф(α) = e−t dt 0.999990 4.5255
π0
is the probability integral (Laplace function). 0.999999 4.7518
1.0 ∞
23. Safety factor-4
SF establishes both the upper limit of the reliability characteristic of interest
(through the mean value of the corresponding margin of safety) and the accuracy
with which this characteristic is defined (through the corresponding standard
deviation).
The structure of the SF indicates that it is acceptable that a system characterized by
a high mean value of the safety margin (i.e., a system whose bearing capacity with
respect to a certain stress/reliability-characteristic, not necessarily mechanical, is
significantly higher than the level of loading) has a less accurately defined deviation
from this mean value than a system characterized by a low mean value of the safety
margin (i.e., a system whose bearing capacity is much closer to the possible level of
loading). In other words, the uncertainty in the evaluation of the safety margin
should be smaller for a more vulnerable design.
Dr. E. Suhir Page 78
24. Safety factor (SF) and coefficient of variability (COV)
Safety factor (SF) is reciprocal to the coefficient of variability (COV). The latter is
defined as the ratio of the standard deviation to the mean value of the random
variable of interest.
While the COV is the characteristic of uncertainty of the random variable of
interest, the SF is the characteristic of certainty of the random parameter (stress-
at-failure, the highest possible temperature, the ultimate displacement, the
affected area, etc.) that is responsible for the non-failure of the item.
If the reliability characteristic of interest (for a non-repairable item) is a random
variable that is determined by just two independent non-random quantities (say,
the mean value and the standard deviation), then the safety factor, SF, determines
completely the probability of non-failure (reliability): the larger the SF is, the
higher is the probability of non-failure.
Dr. E. Suhir Page 79
25. Time-to-failure (TTF), MTTF and the corresponding SF
Usually the capacity (strength), C, and/or the demand (loading), D, change in time.
Failure occurs, when the demand (loading), D, becomes equal or smaller than the
bearing capacity (strength), C, of the item. This random event is the time-at-failure
(TAF), and the duration of operation until this time takes place is the random variable
known as time-to-failure (TTF).
Thus, TTF is the time from the beginning of operation until the moment of time when
the demand (loading) D becomes equal or higher than the bearing capacity C, i.e.,
when the safety margin Ψ=C−D becomes zero or negative.
The corresponding safety factor, SF, is the ratio of the MTTF to the STD of the TTF:
SF=MTTF/STD
26. Mean time-to-failure and reliability function
Mean-time-to-failure (MTTF) is the mean time of the item operation until it fails.
∞
dR ( t )
Hence, it can be computed as t = ∫
0
f ( t ) tdt . Since f (t ) = −
dt
we have (using integration by parts):
∞ ∞ ∞ ∞
dR ( t )
∫ f ( t )tdt = − ∫ tdt = −[R ( t )t ]0 + ∫ R ( t ) dt = ∫ R ( t ) dt ,
∞
t =
0 0
dt 0 0
and the variance of the TTF can be found as
∞ ∞ ∞
Dt = ∫ f (t)(t − t )2 dt = ∫ f (t)t2dt − t 2 = 2∫ R(t)tdt− t 2
0 0 0
The corresponding SF is
MTTF t
δ = SF= =
STD D t
Dr. E. Suhir Page 81
27. Example #1
As a simple example, examine a device whose MTTF, τ , during steady-state operation is described
by the Boltzmann-Arrhenius equation τ = τ 0 exp The failure rate is therefore
U
.
kT
1 1 U If Weibull law is used to predict the probability of failure, then the probability
λ = = exp − .
τ τ0 kT
of non-failure (dependability) can be evaluated on the basis of the following probability distribution
t U where β is a shape parameter. Solving
β
function: P = exp [− ( λ t ) ] = exp −
β
exp − ,
τ0
kT
this equation for the absolute temperature T , we obtain: T = −
U
.
τ 1/ β
k ln 0 (− ln P )
t
28. Example #1 (cont)
U U
Let for the given type of failure (say, surface charge accumulation), the ratio is = 116000 K ,
k k
the τ 0 value predicted on the basis of the ALT is τ 0 = 5x10−8 hours, and the shape parameter β
turned out to be close to β = 2 (Rayleigh distribution). Let the allowable (specified) probability of
−5
failure at the end of the device’s service time of, say, t = 40,000 hours be Q = 10 (it is acceptable
that one out of hundred thousand devices fails). Then the above formula indicates that the steady-state
0 0
operation temperature should not exceed T = 349.8 K = 76.8 C, and the thermal management
tools should be designed accordingly. This rather elementary example gives a feeling of how the
PDfR concept works and what kind of information one could expect using it.
29. Example #2
Let, for instance, the absolute temperature T be distributed in accordance with the
Rayleigh law, so that the probability that a certain level T is exceeded is
*
determined as
T*2
P(T > T* ) = exp − 2
T
0
where T0 is the most likely value of the absolute temperature T. Then, using the
Boltzmann-Arrhenius relationship
Ua
τ = τ 0 exp
kT
we conclude that the probability that the random MTTF τ (“random”, because
τ
of the uncertainty in the level of the most likely temperature) is below a certain level *
(probability of failure is defined in this case as the probability that the specified level
is not achieved) can be found as
Dr. E. Suhir Page 84
30. Example #2 (cont)
2
T*2 Ua
P (τ > τ * ) = exp − 2 = exp −
T
0
τ
kT 0 ln *
τ0
Solving this equation for the most likely (specified) T value, we find:
0
Ua
T0 =
τ*
k ln − ln P
τ0
This formula indicates how the (most likely) level of the device temperature should be
established, so that the probability that the specified level τ of the MTTF is not
*
achieved is sufficiently low.
Dr. E. Suhir Page 85
31. Reliability of repairable items
Reliability of complex items (products) depends not only on their dependability,
but on their repairability as well.
It is important that the products are designed in such a way that their gradual and
potential failures could be easily detected and eliminated in due time, and that the
detected damages (defects), such as, say, fatigue cracks, could be removed
before a catastrophic failure process commences.
The reliability of complex products is characterized, first of all, by their
availability, which is defined as an ability of an item (system) to perform its
required function at the given time or over a stated period of time, with
consideration of its dependability, repairability, maintainability and maintenance
support.
A high level of reliability of complex products can be achieved by employing the
most feasible combination of dependability, on one hand, and dependability,
repairability, maintainability and maintenance support, on the other.
Dr. E. Suhir Page 86
32. Availability index-1
The non-steady-state (time dependent) operational availability indexK (t )is defined
as the probability that the item of interest will be available to the user at the given
moment T of time and will operate failure-free during the given time beginning
with the moment t .
The steady-state availability index K is the time-independent probability that the
item will operate (will be available) failure-free during the time T , beginning with
an arbitrary moment t of time that is sufficiently remote from the beginning of
operations (so that the “infant mortality” portion of the “bathtub” curve is
excluded).
The most often used availability characteristic of the Class II and Class III items,
whose normal operation includes regular repairs (say, workstations or other
complex and expensive electronic systems), is the availability index K a defined
as the steady-state probability that the item will be available at the arbitrary
moment of time taken between the preplanned preventive maintenance activities.
Dr. E. Suhir Page 87
33. Availability index-2
The availability index K a can be computed by the formula
1
Ka = n
t ir
1+ ∑i =1 ti f
where ti f
is the mean time between successive failures for the i-th item in the
system, andr
t is the mean-time-to-repair for this item.
i
The index K a indicates the percentage of time, during which the system is in the
working (available) condition.
The use of the index K a enables one to make assessments of the unforeseen
idle times and to consider these times at early stages of the design of the product.
Dr. E. Suhir Page 88
34. Operational Availability Index
The operational availability index K (t ) can be calculated for situations,
when the probability of failure-free operation during the time interval t is
independent of the beginning of this interval, by the formula
K (t ) = K a R (t )
where R(t) is the dependability of the item.
This formula determines the probability that two events take place:
1) the item is available at the arbitrary moment of time with the probability Ka and
2) will operate failure-free during the time period of the duration t.
Dr. E. Suhir Page 89
35. Session IV
7. GENERAL PDfR APPROACH
USING PROBABILITY DENSITY FUNCTIONS (PDF)
“Education is man’s going forward from cocksure ignorance to thoughtful uncertainty”,
Donald B. Clark, Australian author, “Scrapbook”
“There are things in this world, far more important than the most splended discoveries –
It is the methods by which they were made”
Gottfried Leibnitz, German mathematician
Dr. E. Suhir Page 90
36. PDfR Characteristics
The appropriate electrical, optical, mechanical, thermal, and other physical
characteristics that determine the functional performance, mechanical
(physical/structural) reliability and/or environmental durability of the
design/device/apparatus of interest should be established.
Examples of are: appropriate electrical parameters (current, voltage, etc.), light
output, heat transfer capability, mechanical ultimate and fatigue strength, fracture
toughness, maximum and/or minimum temperatures, maximum
accelerations/decelerations, etc.
Dr. E. Suhir Page 91
37. Factors that affect the PDfR characteristics-1
Establish the electrical, optical, mechanical, thermal, environmental and other
possible (say, human) stress (loading) factors (conditions) that might affect the
reliability characteristics, i.e., characteristics that determine (affect) the short- and
long-term reliability of the object (structure) of interest.
Examples are: high an/or low temperatures, high electrical current or voltage,
electrical and/or optical properties of materials, mechanical and thermal stresses,
displacements, maximum temperatures, size of the affected areas, etc.
This should be one separately for each characteristic of interest and, if necessary,
for each manufacturing process and for different phases of manufacturing, testing
and/or operations
Dr. E. Suhir Page 92
38. Factors that affect the PDfR characteristics-2
Based on the physical nature of the particular environmental/loading factor
(electrical, optical, mechanical, environmental) and on the available information of
it, establish if this factor should be treated as a non-random (deterministic) value,
or should/could be treated as a random variable with the given (assumed)
probability distribution function.
At this stage one could treat random characteristics of interest as nonrandom
functions of random factors, and establish the probability distribution functions
for the random factors using experimental data, and/or Monte-Carlo simulations,
and/or finite-element analyses (FEA), and/or evaluations based on analytical
(“mathematical”) modeling, etc.
Dr. E. Suhir Page 93
39. Factors that affect the PDfR characteristics-3
Let, for instance, the absolute temperature T be distributed in accordance with the
Rayleigh law, so that the probability that a certain level T* is exceeded is
determined as
T*2
P(T > T* ) = exp − 2
T
0
where T0 is the most likely value of the absolute temperature T.
Then, using the Boltzmann-Arrhenius relationship
Ua
τ = τ 0 exp
kT
τ
we conclude that the probability that the random mean-time-to-failure (“random”,
τ * of the most likely temperature) is
because of the uncertainty in the level
below a certain level
Dr. E. Suhir Page 94
40. Factors that affect the PDfR characteristics-4
(probability of failure that is define in this case as the probability that the specified
level is not achieved) can be found as
2
T*2 Ua
P (τ > τ * ) = exp − 2 = exp −
T
0
τ*
kT 0 ln τ
0
Solving this equation for the P(τ >τ* ) we find:
value,
Ua
T0 =
τ*
k ln − ln P
τ0
This formula indicates how the (most likely) level of the device temperature should be
established, so that the probability that the specified level τ * of the MTTF is not
achieved is sufficiently low.
Dr. E. Suhir Page 95
41. Choose appropriate basic probability distributions-1
After the reliability characteristics are established and the factors affecting these
characteristics are selected , one should choose the adequate probability
distributions for the factors (conditions) that affect the short- and long-term
reliability characteristics.
For those factors (conditions) that should be treated as random variables,
establish (accept) the physically meaningful probability distribution laws.
When the actual experimental information is not available, assume, based on
general physical considerations, the most suitable (or the most conservative)
laws of the probability distribution (e.g., uniform, exponential, normal, Weibull,
Rayleigh, etc.).
Dr. E. Suhir Page 96
42. Choose appropriate basic probability distributions-2
Here are some general considerations that can be used in practical applications.
Since the exponential distribution has the largest entropy (the largest uncertainty)
of all the distributions with the same mean, this distribution should be considered, if
no other information, except the expected (mean) value, is available. The
exponentially distributed random variable is always positive. The safety factor for an
exponentially distributed random variable is always “one”.
If the random process of failures can be treated as a simple Poisson flow with a
constant intensity, then the time interval between two adjacent consecutive failures
has an exponential distribution. The most likely value of the exponentially distributed
random variable, t, is at the initial moment of time t=0.
Dr. E. Suhir Page 97
43. Choose appropriate basic probability distributions-3
If the physical nature of a random environmental factor is such that it can be only
positive (i.e., acceleration during take off of an aircraft, or a current for an
electronic module) or only negative (i.e., deceleration during landing or during
drop tests of a cell phone), its most likely value is certainly non-zero.
If only this value (or the mean) is available, then the Rayleigh law could be
employed. This law is also (like the exponential law) a single-parametric law.
The safety factor, when Rayleigh distribution is used, is always
1
δ = = 0.6633
4
1+
π
Dr. E. Suhir Page 98
44. Choose appropriate basic probability distributions-4
If a normally distributed random variable has a finite variance and zero mean, and
changes periodically with a constant or next-to-constant frequency, but with a
random amplitude and random phase angle, then these amplitudes and the
corresponding energies obey the Rayleigh law of distribution.
If the expected (mean) value and the variance are known, and the physical nature
of the random environmental factor is such that the probability density function is
symmetric with respect to the mean value (which coincides with the median and
the most likely value), then the normal distribution should be accepted, especially
(but not necessarily) if the random variable can be either positive or negative.
Dr. E. Suhir Page 99
45. Choose appropriate basic probability distributions-5
It is noteworthy that if the safety factor defined as the ratio of the mean value of
the safety margin to its standard deviation, is significant (which is typically the
case), then application of the normal law of the distribution of the safety factor is
acceptable: its negative values, although are possible in principle, are
characterized by negligibly low probabilities and need not be considered.
If the expected (mean) value and the variance are known, and the physical nature
of the random environmental factor is such that the probability density function is
highly asymmetric (skewed) with respect to its mean or the most likely value,
then Weibull distribution, or the distribution of the absolute value of a normal
random variable, or a truncated normal distribution, or a log-normal distribution
can be used.
Dr. E. Suhir Page 100
46. Establish appropriate
cumulative probability distributions-1
Treating each reliability characteristic of interest as a non-random function
(output) of a random argument (input) due to a particular external or internal
factor, evaluate the probability density function of this characteristic for the
assumed (accepted, determined) law of the probability distribution of the
environmental factor.
Time could enter as an independent parameter into the computed response.
For some factors, the input could be considered as a non-random (deterministic)
value.
Dr. E. Suhir Page 101
47. Establish appropriate
cumulative probability distributions-2
Determine the cumulative probability distribution functions for all the probability
density functions that affect the given mechanical or thermal characteristic of
interest.
Such a convolution of the constituent laws of distribution considers, in the most
accurate and non-conservative way, the probabilistic input of each of the
environmental parameters that affect the particular mechanical, electrical, optical
or thermal characteristic.
Cumulative distributions consider the likelihood that the maxima of different
important factors might not occur simultaneously
Dr. E. Suhir Page 102
48. Establish appropriate
cumulative probability distributions-3
If the number of random variables does not exceed two, the convolution could be
carried out analytically.
If the number of random variables is three or more, one should “teach” a
computer how to obtain a cumulative law of distribution.
Since the above distributions are based on the transient responses of the
mechanical (thermal) characteristics of interest to the time-dependent
environmental excitations (parameters), these distributions determine the
probability that at the given moment of time the given characteristic is
below/above the given value of this characteristic.
Dr. E. Suhir Page 103
49. Probabilistic reliability criteria
Determine for each point of time, after the given duration of operation (mission):
the safety factors and other reliability criteria for the characteristics that
determine the performance, reliability, durability and safety of the system,
the probability of non-failure, P (t), for the established (accepted) safety factor, at
each point of time, and
the mean time-to-failure, MTTF, for the established (accepted) safety factor,
standard deviation, STD, of the time-to-failure and safety factor SF=MTTF/STD for
the time-to-failure.
Dr. E. Suhir Page 104
50. 8. Twelve steps to be conducted
to add value to the existing practice
“The man who removes a mountain begins by carrying away small stones”
Chinese saying
“Give me a fruitful error any time, full of seeds, bursting with
its own corrections. You can keep your sterile truth for yourself”
Vilfredo Pareto, Italian engineer, sociologist, and economist
Dr. E. Suhir Page 105
51. Some important preliminary steps
Establish, as the manufacturer of a particular product, the list of possible failures and
suitable failure criteria, as far as the functional, mechanical (physical) and
environmental failures are concerned.
Find out the similar requirements that the customer specifies (desires) regarding
lifetimes (minimum and mean time to failure), failure rates (considering, for a particular
product, if necessary, the wear-out portion of the bath-tub curve), probability of failure
(for non-reparable products), availability specifications, etc.
Identify active and passive parts, reparable and non-reparable parts, the most
vulnerable (least reliably) parts (e.g., solder joint interconnections, materials prone to
creep or aging, etc.), the feasibility of introducing redundancy, etc.
As a customer, evaluate the ability of a particular manufacturer, to make parts with
consistent quality, and, as a manufacturer, establish your company’s ability to
produce such parts.
Dr. E. Suhir Page 106
52. Twelve steps to be conducted to add value to the
existing practice-1
1) Develop a detailed list of possible electrical, mechanical (structural), thermal, and
environmental failures that should be considered, in one way or another, in the
particular design (package, invertor, module, structure, etc.)
2) Make, based on the existing experience and best practices, the preliminary decision on
the materials and geometries in the physical design and packaging of the product and
its units/subunits/assemblies
3) Conduct predictive modeling (using FEA or other simulation packages, as well as
analytical/"mathematical" wherever possible) of the stresses and other failure criteria
(say, elevated temperatures or electrical characteristics), considering steady state
and transient thermal, stress/strain and electrical fields
4)Consider possible loading in actual use conditions (electrical, thermal, mechanical,
dynamic, as well as their combinations) and distinguish between short-term high-
level loading (related to the ultimate strength of the structure) and long-term low-level
loading (related to the fatigue strength of the structure)
Dr. E. Suhir Page 107
53. Twelve steps to be conducted to add value to the
existing practice-2
5) Review the existing qualification standards for the similar structures, having in mind,
however, that these standards were designed, although for similar, but for different
(power, geometry, materials, use) conditions, than what we will be dealing with; come
up with the preliminary level of acceptable stresses, accelerations, temperatures,
voltages, currents, etc.
6) Having in mind FOAT procedures, decide on the constitutive relationships (formulas,
FEA procedures, plots) that govern the failure mechanisms in question (Arrhenius type
of equations for high temperature "baking", Minor type- for the materials that are
expected to work within the elastic range, Erdogan-Paris type - for brittle materials,
etc.)
7) Design, conduct and interpret the results of the FOAT and, based on this testing,
predict the reliability characteristics of the assemblies, joints, subunits and units of
interest
Dr. E. Suhir Page 108
54. Twelve steps to be conducted to add value to the
existing practice-3
8) Based on the obtained information, the state-of-the-art in the area in question and the
requirements of the existing specifications, decide on the allowable (acceptable)
values of the characteristics of failure, with consideration of the economically and
technically feasible lifetime of the module and its major subassemblies
9) Write first draft of the qualification specs (in other words, revise, if necessary, the
existing JEDEC specs) for the module and its unites/subunits of interest
10) Develop root cause analysis (RCA) methodologies
11) Decide on the burn-in conditions and establish adequate service for collecting field
failures
12) Conduct, on the permanent basis, revisions of the designs and the reliability
specifications.
Dr. E. Suhir Page 109
55. 9. DO ELECTRONIC INDUSTRIES
NEED NEW APPROACHES
TO QUALIFY THEIR DEVICES INTO PRODUCTS?
“I do not need an everlasting pen. I do not intend to live forever”
Ilf and E. Petrov, “The Golden Calf” (in Russian)
“It is always better to be approximately right than precisely wrong”
Unknown Reliability Manager
56. Nobody and nothing is perfect:
probability of failure is never zero
It should be widely recognized that the probability of a failure is never zero, but could
be predicted and, if necessary, controlled and maintained at an acceptable low level
One effective way to achieve this is to implement the existing methods and
approaches of PRM techniques and to develop adequate PDfR methodologies
These methodologies should be based mostly on FOAT and on a widely employed
predictive modeling effort
FOAT should be carried out in a relatively narrow but highly focused and time-
effective fashion for the most vulnerable elements of the design of interest
If the QT has a solid basis in FOAT, PM and PDfR, then there is reason to believe that
the product of interest will be sufficiently robust in the field.
Dr. E. Suhir Page 111
57. QT could be viewed as “quasi-FOAT”
The QT could be viewed as “quasi-FOAT,” as a sort-of the “initial stage of FOAT” that
more or less adequately replicates the initial non-destructive, yet full-scale, stage of
FOAT.
We believe that such an approach to qualify devices into products will enable industry
to specify, and the manufacturers -to assure, a predicted and low enough probability
of failure for a device that passed the QT and will be operated in the field under the
given conditions for the given time.
We expect that the suggested approach to the DfR and QT will be accepted by the
engineering and manufacturing communities, implemented into the engineering
practice and be adequately reflected in the future editions of the QT specifications and
methodologies.
Dr. E. Suhir Page 112
58. The PDfR-based QT will still be non-destructive
Such QTs could be designed, therefore, as a sort of mini-FOAT that, unlike the actual ,
“full-scale” FOAT, is non-destructive and conducted on a limited scale.
The duration and conditions of such “mini-FOAT” QT should be established based on
the observed and recorded results of the actual FOAT, and should be limited to the
stage when no failures in the actual full-scale FOAT were observed.
Prognostics and health management (PHM) technologies (such as “canaries”) should
be concurrently tested to make sure that the safe limit is not exceeded.
Dr. E. Suhir Page 113
59. What should be done differently
It is important to understand the reliability physics that underlies the mechanisms and
modes of failure in electronics and photonics components and devices
FOAT should be thoroughly implemented, so that the QT is based on the FOAT
information and data.
PDfR concept should be widely employed
FOAT cannot do without predictive modeling, the role of such modeling, both
computer-aided and analytical (“mathematical”), in making the suggested new
approach to product qualification practical and successful.
Dr. E. Suhir Page 114
60. 10. CONCLUSIVE REMARKS
“Life is the art of drawing sufficient conclusions from insufficient premises”
Samuel Butler, British poet and satirist, “The Way of All Flesh”
61. Conclusions-1
Improvements in the existing QT, as well as in the existing best QT practices, are
indeed possible, provided that the Probabilistic Design for Reliability (PD fR) concept
is thoroughly developed and the corresponding methodologies are employed.
One effective way to improve the existing QT and specs is to
conduct, on a wide scale, Failure Oriented Accelerated Testing (FOAT) at the design
stage (DFOAT) and at the manufacturing stage (MFOAT), and, since DFOAT cannot do
without PM,
carry out, whenever and wherever possible, predictive modeling (PM) to understand
the physics of failure and to accumulate, when appropriate, failure statistics;
revisit, review and revise the existing QT and specs considering the DFOAT and, to a
lesser extent, MFOAT data for the most vulnerable elements of the device of interest;
develop and widely implement the PDfR methodologies having in mind that “nobody
and nothing is perfect”, that probability of failure is never zero, but could be predicted
and, if necessary, controlled and maintained during operation at an acceptable low
level.
Dr. E. Suhir Page 116
62. Conclusions-2
We believe that our new approach to the qualification
of the electronic devices will enable industry to
specify and the manufacturers to assure a predicted
and low enough probability of failure for a device that
passed the qualification specifications and will be
operated under the given stress (not necessarily
mechanical) conditions for the given time.
We expect that eventually the suggested new
approaches to the DfR and QT will be accepted by the
engineering and manufacturing communities,
implemented in a timely fashion into the engineering
practice and be adequately reflected in the future
editions of the qualification specifications and
methodologies.
Dr. E. Suhir Page 117