This webinar will provide an overview of the tools that SciPy and NumPy provide for regression analysis including linear and non-linear least-squares and a brief look at handling other error metrics. We will also demonstrate simple GUI tools that can make some problems easier and provide a quick overview of the new Scikits package statsmodels whose API is maturing in a separate package but should be incorporated into SciPy in the future.
5. Data Model
y = mx + b
m = 4.316
b = 2.763
a
y =
(b + ce−dx )
a = 7.06
b = 2.52
c = 26.14
d = −5.57
6. Curve Fitting or Regression?
Adrien-Marie
Francis Galton
Legendre
R.A. Fisher
Carl Gauss
7. or (my preferred) ... Bayesian Inference
Model Prior
Inference p(Y|X)p(X)
p(X|Y) =
p(Y)
Un
Da
p(Y|X)p(X)
kn
ta
=
ow
p(Y|X)p(X)dX
ns
Bayes Laplace
Harold Jeffreys Richard T. Cox Edwin T. Jaynes
8. More pedagogy
Machine Learning
Curve Fitting Regression
Parameter Estimation Bayesian Inference
Understated statistical Statistical model is
model more important
Just want “best” fit to Post estimation
data analysis of error and fit
9. Pragmatic look at the methods
• Because the concept is really at the heart of
science, many practical methods have been
developed.
• SciPy contains the building blocks to
implement basically any method.
• SciPy should get high-level interfaces to all
the methods in common use.
10. Methods vary in...
• The model used:
– parametric (specific model): y = f (x; θ)
– non-parametric (many unknowns) y = θi φi (x)
i
• The way error is modeled y = y +
ˆ
– few assumptions (e.g. zero-mean, homoscedastic)
– full probabilistic model
• What “best fit” means (i.e. what is distance
between the predicted and the measured).
– traditional least-squares
– robust methods (e.g. absolute difference)
11. Parametric Least Squares
T
y
ˆ = [y0 , y1 , ..., yN −1 ]
T
x = [x0 , x1 , ..., xN −1 ]
T
y = f (x; β) +
ˆ
β = [β0 , β1 , ..., βK−1 ]
K < N
ˆ
β = argmin J(ˆ, x, β)
y
β
ˆ
β = argmin (ˆ − f (x; β))T W(ˆ − f (x; β))
y y
β
12. Linear Least Squares
y = H(x)β +
ˆ
−1
ˆ = H(x)T WH(x)
β H(x) Wˆ
y
T
Quadratic Example:
yi = 2
axi + bxi + c
x20 x0 1
x2 x1 1 a
1
y=
ˆ .
. .
. .
. b +
. . . c
x2 −1
N xN −1 1
H(x) β
13. Non-linear least squares
ˆ
β = argmin J(ˆ, x, β)
y
β
ˆ
β = argmin (ˆ − f (x; β)) W(ˆ − f (x; β))
y y T
β
Logistic Example
a
yi =
b + ce−dxi
Optimization Problem!!
14. Tools in NumPy / SciPy
• polyfit (linear least squares)
• curve_fit (non-linear least-squares)
• poly1d (polynomial object)
• numpy.random (random number generators)
• scipy.stats (distribution objects
• scipy.optimize (unconstrained and
constrained optimization)
15. Polynomials
• p = poly1d(<coefficient array>) >>> p = poly1d([1,-2,4])
>>> print p
2
• p.roots (p.r) are the roots x - 2 x + 4
• p.coefficients (p.c) are the coefficients >>> g = p**3 + p*(3-2*p)
>>> print g
• p.order is the order 6 5 4 3 2
x - 6 x + 25 x - 51 x + 81 x - 58 x + 44
• p[n] is the coefficient of xn >>> print g.deriv(m=2)
4 3 2
• p(val) evaulates the polynomial at val 30 x - 120 x + 300 x - 306 x + 162
>>> print p.integ(m=2,k=[2,1])
• p.integ() integrates the polynomial 4 3 2
0.08333 x - 0.3333 x + 2 x + 2 x + 1
• p.deriv() differentiates the polynomial
>>> print p.roots
• Basic numeric operations (+,-,/,*) work [ 1.+1.7321j 1.-1.7321j]
>>> print p.coeffs
• Acts like p.c when used as an array [ 1 -2 4]
• Fancy printing
15
16. Statistics
scipy.stats — CONTINUOUS DISTRIBUTIONS
over 80
continuous
distributions!
METHODS
pdf entropy
cdf nnlf
rvs moment
ppf freeze
stats
fit
sf
isf
16
17. Using stats objects
DISTRIBUTIONS
>>> from scipy.stats import norm
# Sample normal dist. 100 times.
>>> samp = norm.rvs(size=100)
>>> x = linspace(-5, 5, 100)
# Calculate probability dist.
>>> pdf = norm.pdf(x)
# Calculate cummulative Dist.
>>> cdf = norm.cdf(x)
# Calculate Percent Point Function
>>> ppf = norm.ppf(x)
17
18. Setting location and Scale
NORMAL DISTRIBUTION
>>> from scipy.stats import norm
# Normal dist with mean=10 and std=2
>>> dist = norm(loc=10, scale=2)
>>> x = linspace(-5, 15, 100)
# Calculate probability dist.
>>> pdf = dist.pdf(x)
# Calculate cummulative dist.
>>> cdf = dist.cdf(x)
# Get 100 random samples from dist.
>>> samp = dist.rvs(size=100)
# Estimate parameters from data
>>> mu, sigma = norm.fit(samp) .fit returns best
>>> print “%4.2f, %4.2f” % (mu, sigma) shape + (loc, scale)
10.07, 1.95 that explains the data
18
19. Fitting Polynomials (NumPy)
POLYFIT(X, Y, DEGREE)
>>> from numpy import polyfit, poly1d
>>> from scipy.stats import norm
# Create clean data.
>>> x = linspace(0, 4.0, 100)
>>> y = 1.5 * exp(-0.2 * x) + 0.3
# Add a bit of noise.
>>> noise = 0.1 * norm.rvs(size=100)
>>> noisy_y = y + noise
# Fit noisy data with a linear model.
>>> linear_coef = polyfit(x, noisy_y, 1)
>>> linear_poly = poly1d(linear_coef)
>>> linear_y = linear_poly(x),
# Fit noisy data with a quadratic model.
>>> quad_coef = polyfit(x, noisy_y, 2)
>>> quad_poly = poly1d(quad_coef)
>>> quad_y = quad_poly(x))
19
20. Optimization
scipy.optimize — Unconstrained Minimization and Root Finding
Unconstrained Optimization Constrained Optimization
• fmin (Nelder-Mead simplex) • fmin_l_bfgs_b
• fmin_powell (Powell’s method) • fmin_tnc (truncated Newton code)
• fmin_bfgs (BFGS quasi-Newton • fmin_cobyla (constrained optimization by
method) linear approximation)
• fmin_ncg (Newton conjugate
gradient) • fminbound (interval constrained 1D
minimizer)
• leastsq (Levenberg-Marquardt)
• anneal (simulated annealing global Root Finding
minimizer) • fsolve (using MINPACK)
• brute (brute force global minimizer) • brentq
• brent (excellent 1-D minimizer) • brenth
• golden • ridder
• bracket
• newton
• bisect
• fixed_point (fixed point equation solver)
20
21. Optimization: Data Fitting
NONLINEAR LEAST SQUARES CURVE FITTING
>>> from scipy.optimize import curve_fit
# Define the function to fit.
>>> def function(x, a , b, f, phi):
... result = a * exp(-b * sin(f * x + phi))
... return result
# Create a noisy data set.
>>> actual_params = [3, 2, 1, pi/4]
>>> x = linspace(0,2*pi,25)
>>> exact = function(x, *actual_params)
>>> noisy = exact + 0.3 * randn(len(x))
# Use curve_fit to estimate the function parameters from the noisy data.
>>> initial_guess = [1,1,1,1]
>>> estimated_params, err_est = curve_fit(function, x, noisy, p0=initial_guess) >>>
estimated_params
array([3.1705, 1.9501, 1.0206, 0.7034])
# err_est is an estimate of the covariance matrix of the estimates
# (i.e. how good of a fit is it)
21
22. StatsModels
Josef Perktold
Canada
Economists
Skipper Seabold
PhD Student
American University
Washington, D.C.
23.
24. GUI example: astropysics (with TraitsUI)
Erik J. Tollerud
PhD Student
UC Irvine
Center for Cosmology
Irvine, CA
http://www.physics.uci.edu/~etolleru/
25. Scientific Python Classes
http://www.enthought.com/training
Sept 21-25 Austin
Oct 19-22 Silicon Valley
Nov 9-12 Chicago
Dec 7-11 Austin