This document discusses predictive statistics and trend analysis techniques. It begins with an introduction to basic statistical concepts like linear regression analysis and examines how to calculate accuracy and the confidence interval of trend predictions. The document cautions that trend predictions assume the future will resemble the past and questions the assumptions behind trend analysis. It provides examples of how to perform linear regression calculations in Excel and SAS and discusses the impact of outliers on the regression fit.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Predictive Statistics (Trending) a Tutorial por Ray Wicks
1. 8/5/2008
Trade Marks, Copyrights & Stuff
This presentation is copyright by Ray Wicks 2008.
Many terms are trademarks of different companies
Predictive Statistics (Trending) and are owned by them.
a Tutorial This session is sponsored by
CMG Brazil
On foils that appear in this presentation
Ray Wicks are not in the handout. This is to prevent
561-236-5846 you from looking ahead and spoiling my
RayWicks@us.ibm.com jokes and surprises.
RayWicks@yahoo.com
IBM 2008 IBM 2008
Abstract How Accurate Is It?
Predictive Statistics (Trending) – A Tutorial
This session reviews some of the trending techniques which can be Prediction
useful in capacity planning. The introduction of the basic statistical
concept of regression analysis will examined. The simple linear
regression analysis will be shown.
This session is sponsored by
t0 Time
Starting from an initial point of maybe dubious accuracy, we apply a growth
rate (also dubious) and then recommend actions costing lots of money.
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 1
2. 8/5/2008
Accuracy How Accurate Is It?
Prediction Prediction
Prediction
p p
t0 Time t0 Time t0 t Time t0 t Time
At time t, is the prediction a precise point p or a fuzzy patch?
Accuracy is found in values that are close to the expected curve. This closeness
implies an expected bound or variation in reality. So a thicker line makes sense.
IBM 2008 IBM 2008
Statistical Discourse Perceptual Structure A Conversation
0.45
You: The answer is 42.67.
0.4
0.35
=Normdist(x,0,1,0)
0.3
0.25
0.2
Them: I measured it and the answer is 42.663!
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
X
You: Give me a break.
Conceptual Structure
Them: I just want to be exact.
You: OK the answer is around 42.67.
Them: How far around.
You: ????
Blah, blah, blah
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 2
3. 8/5/2008
Confidence Interval or
How Thick is the Line? Confidence Interval
Prediction
0.45
0.4
0.35
=Normdist(x,0,1,0)
0.3
0.25
0.2
Zα/2 [ μ – 1.96 σ/n , μ + 1.96 σ/n ]
0.15
t0 Time 0.1 [ μ – zα/2 σ/n , μ + zα/2 σ/n ]
0.05
0
-4 -3 -2 -1 0 1 2 3 4
X Using a Standard Normal Probability table,
95% confidence (2 tail) is found by looking
P[m-2s < X < m+2s] = 0.954
for a z score of 0.025.
P[m-1.96s < X < m+1.96s] = 0.95 or 95%
In Excel: =Confidence(μ, σ, n)
[L,U] is called the 100(1-α)% confidence interval.
=Confidence(0.5,1,100) = 1.96
1-α is called the level of confidence associated
with [L,U]
IBM 2008 IBM 2008
Linear Regression (for Trending)
Summary
1000
Given a list of numbers X={Xi} i=1 to n y = 3.0504x + 385.42
Statistics 900
Term Formula Excel
PS View
R2 = 0.7881
800
Count (number of items) n Number of points
=Count(X) plotted 700
MIPS Used
Average X=Sum(X)/n =Average(X) Center of gravity 600
Median§ X[ROUND DOWN 1+N*0.5] =MEDIAN(X) Middle number
Variance 2 Spread of data 500
V=(Xi-X) )/n =Var(X)
Standard Deviation s=SQRT(V) =Stnd(X) Spread of data 400
Coeficient of Variation Spread of data around 300
(Std/Avg) CV=s/X average
Minimum First in Sorted list =MIN(X) Bottom of plot 200
Maximum Last in Sorted list =Max(X) Top of plot 100
Range Distance between top
[Minimum,Maximum] and bottom 0
90th percentile§ X[ROUND DOWN 1+n*0.9] =Percentile(X,0.9) 10% from the top 0 50 100 150 200
Confidence interval Expected Variability of
Look in book =Confidence(0.05,s,n) average (a thick line) Week
§= Percentile formulae
Obtain a useful fit of the data (y= mx+b) and then extend the values
assume a sorted list; Low
to high. of X to obtain predicted values of Y. But remember as Niels Bohr
said: “Prediction is very hard to do. Especially about the future.”
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 3
4. 8/5/2008
Trending Assumptions & Questions Reality
80
70
60 The future will be like the past. 1800
How much history is too much?
50
CPU%
40
30 1600
You should look at Era segments.
20
10
0
1400
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Shape and scale of graph can be 1200
Week
MIPS Used
y = 3.0504x + 385.42
1000
interesting. 800
R2 = 0.7881
You may need more than 600
numbers.... The business and 400
technical environment? 200
0
Be smart and lazy…. What 0 50 100 150 200
questions are you answering? Week
Linear regression’s predictions assume that
the future looks like the past.
IBM 2008 IBM 2008
Coding Implementation
The Butterfly Effect Linear Fit for {Xi,Yi}
Y
Algorithm 1: Yi=B0 + B1Xi
Xn+1 = s*Xn if Xn < 0.5 Yi
Xn+1 = s*(1- Xn) otherwise e
In Excel: cell Xn+1 is =IF(Xn<0.5, S*Xn, S*(1-Xn)) Y
Algorithm 2: Yi
Xn+1 = s *(0.5 - |Xn – 0.5|)
B0
In Excel: cell Xn+1 is =S*(0.5-ABS(Xn-0.5))
X
Xi
Mathematically Equal.
On the line would be perfect.
(Yi - Y)2
(Ref. Chaos Under Control, section on Butterfly Effect.) Goodness of Fit R2 = Next to that would be a line
(Yi - Y)2 with minimum error (e).
Actually minimum e2 is better.
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 4
5. 8/5/2008
Excel Help Correlation
7000
DASD I/O Rate
6000
5000
4000
Search Excel Help for R Squared return: 3000
2000
1000
RSQ: Returns the square of the Pearson product 0
moment correlation coefficient through data points 0 20 40 60 80 100
CPU%
in known_y's and known_x's. For more
information, see PEARSON. The r-squared value
Correlation = COV(X,Y) / σx σy
can be interpreted as the proportion of the
variance in y attributable to the variance in x. = σxy2 / σx σy
= E[(x-μx)(y-μy)] / σx σy
Correlation [-1,1]
=CORREL(CPU%,DASDIO) = 0.86
IBM 2008 IBM 2008
Briefly: Correlation is not Causality Causality & Correlation
Claim: Eating Cheerios will lower your cholesterol
Cause → Effect (sufficient cause) Cause → Effect
~Effect → ~Cause (necessary cause) Cause: Eating Cheerios
Effect: Lower Cholesterol
R2 or CORR(C,E) may indicate a linear Test: Real cause
relationship without there being a causal Intervening Variable
connection.
Cheerios Lower Cholesterol
In cities of various sizes: Bacon & Eggs Cholesterol
C = number of TVs is highly correlated with E =
number of murders.
C = religious events is highly correlated with E =
X
Bacon & Eggs Lower Cholesterol
number of suicides. There is a correlation between Eating Cheerios and lower
Cholesterol but is there a causal relationship?
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 5
6. 8/5/2008
Matrix Solution for Linear Fit
B = (Mt * M)-1 * Mt * Y Excel Solution
Solve for Y = B0 + B1*X
X Y YH Sq (YH-YA) Sq (Y-YA) R2 80
M is 5x2 1 1.3 62.3 61.765 50.339025 43.0336 0.9262 =(SUM(F3:F7)/SUM(G3:G7))
y = 47.3x + 0.275
1 1.4 64.3 66.495 5.593225 20.7936 R2 = 0.9262
1 1.45 70.8 68.86 5.7678E-24 3.7636 75
1 1.5 71.1 71.225 5.593225 5.0176
1 1.6 75.8 75.955 50.339025 48.1636
Avg 68.86 70
CPU%
MT is 2x5 1 1 1 1 1 ctl-shift-enter
1.3 1.4 1.45 1.5 1.6 65
MT*M is 2x2 5 7.25
7.25 10.563 60
INV(MTM) is 2x2 42.25 -29
-29 20 55
IMTM*MT is 2x5 4.55 1.65 0.2 -1.25 -4.15
-3 -1 0 1 3 50
IMTMMT*Y is 2x1 0.275 B0 1.2 1.3 1.4 1.5 1.6 1.7
47.3 B1
Units of Work
IBM 2008 IBM 2008
Impact of Outlier A perfect fit is always possible
100 80
95 4 3 2
y = 58111x - 338194x + 736689x - 711801x + 257442
75 2
90 R =1
85 70
80
CPU%
CPU%
75 65
y = -50.8x + 149.06
70
R2 = 0.2358 60
65
60 55
55
50 50
1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65
1.2 1.3 1.4 1.5 1.6 1.7 Units of Work
Units of Work
Albeit meaningless in this case.
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 6
7. 8/5/2008
SAS
Confidence of Fit.
85
y = 47.3x + 0.275
80 2
R = 0.9262
75
CPU%
CPU%
70 LB
UB
65 Linear (CPU%)
60
55
50
1.2 1.3 1.4 1.5 1.6 1.7
Units of Work
IBM 2008 IBM 2008
Analyze -> Linear Regression Run
Root MSE 1.72313 R-Square 0.9262
Dependent Mean 68.86000 Adj R-Sq 0.9017
Coeff Var 2.50236
Parameter Estimates
Variable Label DF Parameter Standard t Value Pr > |t|
Estimate Error
Intercept Intercept 1 0.27500 11.20033 0.02 0.9820
X X 1 47.30000 7.70606 6.14 0.0087
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 7
8. 8/5/2008
Results Residuals
For each Xi, plot e = Y- Yi
Residual
10
5
Look for
0 random
Residual
0 100 200 300 400 500 600 700 800 900
-5 distribution
-10
around 0
-15
-20
Units of Work
IBM 2008 IBM 2008
Regression other than Linear
Interesting Case
40 0.0043x
40
y = 1.234e
35 35 2
y = 0.0335x R = 0.9457
30 30
2
R = 0.8569
CPU%
CPU%
25 25
20 20
15 15
10 10
5 5
0 0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Blocks Blocks
Notice the points are below the line until >600. Typical of DB/DC. Means less
efficient as the load increases? The residuals have a pattern. That usually
means a second level effect. Exponential fit is useful when computing compound growth
IBM 2008 IBM 2008
Trending CMG Brazil (c) Ray Wicks
2008 8