Day 20 Page 1:
Chapter 3 / Chapter 6
Model Building and Regression
A quick note about extrapolation (mentioned at the end of yesterday's lecture): You need to be careful when using extrapolation because in many cases you have no reason to believe that your mathematical model is correct beyond the range of the original data.
For instance, if you were modeling the force required to stretch a rubber band a certain distance and came up with data and a linear fit as seen below, you would be mistaken to extrapolate the graph to an applied force of 100 Newtons (the rubber band will have broken long before that!). Extrapolation is useful for making tentative predictions. They should be tested with more measurements however.

You can be a little more confident about interpolation (especially if you have a lot of data points that are close together). But if there aren't many data points, then even interpolation will not be accurate--you just don't know with enough certainty what is happening between the data points. The lesson here is that when doing experiments, take as much data as possible. Then you will have more confidence in the fitted equation and interpolation and extrapolation will be more useful as well.
Regression
There are two types of analysis of data invloving two variables (like force and distance). They are correlation analysis and regression. Correlation analysis deals with situations where both variables are random and is interested in the relationship between them. Correlation analysis is an advanced topic in a statistics course. We won't cover it in this class. We will focus on regression.
Regression has one of the variables arranged such that you can measure it without error or can give it whatever value you want. That is called the independent variable (or controlled variable). The other variable is a random variable. Regression analysis deals with the relationship between the two variables. An example would be the dependence of the temperature of the air outside (the random variable) on the time of day (the independent variable)
In the three examples of yesterday, we did regression analysis using the polyfit function for linear functions or functions that could be converted to linear funcitons. The polyfit function is based on a least squares method.
The least squares method for coming up with the the best fit line is a little complicated. See the figure below.

The line that gives the best fit is the one that minimizes J, the sum of the squares of the vertical distances (shown in blue) between the fit line and the data points. These distances are called residuals. Obviously, the closer the data points are to the fit line, the smaller the residuals (and the smaller J is) and the better the equation fits the data.
The polyfit function
assumes a polynomial fit equation (degree = 1 for a linear fit). For a polynomial
,
the polyfit function
provides the solution that minimizes the residuals. The equation for J for
m data points is

Again, polyfit finds the coefficients (the a's) that minimize J. As we said last class, the syntax for polyfit is
polyfit(x,y,n)
where n is the degree of the polynomial (linear has n = 1, quadratic has n = 2, cubic has n = 3...) The result is a row vector of length n+1 and contains the polynomial coefficients (the a's in the equation above), in order of decreasing powers.
Let's use the polyfit function to fit the data below to a linear, a quadratic, a cubic and a 4th degree polynomial (quartic?) and compare the fits by seeing how small J is. Remember; small J = good fit.
% Script to fit data with linear, quadratic, cubic and 4th degree
polynomial equations.
% This would be a lot easier with for loops. I wouldn't
have to repeat so much code.
%
Oh well. We'll cover that next week.
x = 1:9;
y = [5 6 10 20 28 33 34 36 42];
pLinear = polyfit(x,y,1);
subplot(2,2,1)
xLinear = 1:0.1:9;
yLinear = pLinear(1)*xLinear+pLinear(2);
plot(x,y,'*', xLinear,yLinear)
Title('Linear fit'), xlabel('x'),ylabel('y')
% There is a polyvar function that helps with calculating J but it is
% pretty complicated to explain so I'll do J the long way.
JLinear = sum((pLinear(1)*x+pLinear(2)-y).^2);
disp(['J for a linear fit is: '
num2str(JLinear)])
% Now a quadratic
pQuad = polyfit(x,y,2);
subplot(2,2,2)
xQuad = 1:0.1:9;
yQuad = pQuad(1)*xQuad.^2+pQuad(2)*xQuad+pQuad(3);
plot(x,y,'*', xQuad,yQuad)
Title('Quadratic fit'), xlabel('x'),ylabel('y')
% J the long way.
JQuad = sum((pQuad(1)*x.^2+pQuad(2)*x+pQuad(3)-y).^2);
disp(['J for a quadratic fit is: '
num2str(JQuad)])
% Now a cubic fit
pCubic = polyfit(x,y,3);
subplot(2,2,3)
xCubic = 1:0.1:9;
yCubic = pCubic(1)*xCubic.^3+pCubic(2)*xCubic.^2+pCubic(3)*xCubic+pCubic(4);
plot(x,y,'*', xCubic,yCubic)
Title('Cubic fit'), xlabel('x'),ylabel('y')
% J the long way.
JCubic = sum((pCubic(1)*x.^3+pCubic(2)*x.^2+pCubic(3)*x+pCubic(4)-y).^2);
disp(['J for a quadratic fit is: '
num2str(JCubic)])
% Now a 4th degree fit
pQuartic = polyfit(x,y,4);
subplot(2,2,4)
xQuartic = 1:0.1:9;
yQuartic = pQuartic(1)*xQuartic.^4+pQuartic(2)*xQuartic.^3+pQuartic(3)*xQuartic.^2+pQuartic(4)*xQuartic+pQuartic(5);
plot(x,y,'*', xQuartic,yQuartic)
Title('4th degree fit'), xlabel('x'),ylabel('y')
% J the long way.
JQuartic = sum(( pQuartic(1)*x.^4+pQuartic(2)*x.^3+pQuartic(3)*x.^2+pQuartic(4)*x+pQuartic(5)-y).^2);
disp(['J for a 4th degree fit is: '
num2str(JQuartic)])
The ouput is
J for a linear fit is: 71.5389
J for a quadratic fit is: 56.6727
J for a quadratic fit is: 41.8838
J for a 4th degree fit is: 4.6566

As you can see, the 4th degree polynomial provides the best fit (It has more bends in it and so can fit the data easier and has the lowest value of J). You can type polyfit(x,y,4) to get the coefficients and write the polynomial equation out. The coefficients are:
0.0795 -1.6928 11.9331 -25.9613 21.0556
That means the polynomial equation that best fits the data is:
![]()
You might be tempted to always use a high degree polynomial to fit your data. But you might run into problems if you do that. Firstly, high degree polynomials have a lot of bends in them. That could mean when though it might fit the data very well (and even go through the points exactly), there might be a big bend in between points that doesn't really represent what is going on. For example, try the following:
% Does
a 5th degree polynomial fit to the data
clf
x = 0:5;
y = [0 1 60 40 41 47]
;
pQuintic = polyfit(x,y,5);
xQuintic = 0:0.05:5;
yQuintic = pQuintic(1)*xQuintic.^5+pQuintic(2)*xQuintic.^4+pQuintic(3)*xQuintic.^3+pQuintic(4)*xQuintic.^2+pQuintic(5)*xQuintic+pQuintic(6);
plot(x,y,'*',xQuintic,yQuintic)
Title('5th degree fit'),xlabel('x'),ylabel('y')
% J the long way.
JQuintic = sum((pQuintic(1)*x.^5+pQuintic(2)*x.^4+pQuintic(3)*x.^3+pQuintic(4)*x.^2+pQuintic(5)*x+pQuintic(6)-y).^2);
disp(['J for a 4th degree fit is: '
num2str(JQuintic)])
The output is
J for a 4th degree fit is: 7.6074e-024

As you can see, the fit is fantastic--the curve goes through all the data points and J is essentially zero. All is not well, however. There are some pretty wild bends between the 1st and 2nd points and the 5 and 6th points. You really shouldn't trust the fit in these regions (interpolation would be approximate at best).
Secondly, another problem with high-degree polynomial fit equations is that they can give large errors if the coefficients are not given with enough significant figures. You should use format long to avoid this problem. There are also inaccuracies that come up in the routine used to execute the polyfit function for high-degree polynomials. This can lead to bad fits as well. Sometimes it is better to break the fit up into parts--fit the data with two or more functions.
How Good Is The Fit?
As we have seen, the smaller J is the better the fit. There is another measure of the fit quality that is more commonly used. It is called the coefficient of determination or r-squared. It is given by
where
S is given by
.
S is the sum of the squares of the deviation of the y values
from their mean (y with a bar over it is the mean value of y).
S is a measure of how much the data is spread around the mean.
For a perfect fit, J = 0 (there would be no distance from the data points to the fit curve). That means r-squared = 1 for a perfect fit. Good fits have r-squared very close to 1 (usually a tiny bit less, like 0.99888). Bad fits have r-squared much less than 1 or even negative.
Let's add a calculation of r-squared to the data we fitted with the four different functions earlier. (I'll copy the entire script here and add the new code:)
% Script
to fit data with linear, quadratic, cubic and 4th degree polynomial equations.
% This would be a lot easier with for loops. I wouldn't have to repeat
so much code.
% Oh well. We'll cover that next week.
% This script includes a calculation of r-squared.
x = 1:9;
y = [5 6 10 20 28 33 34 36 42];
pLinear = polyfit(x,y,1);
subplot(2,2,1)
xLinear = 1:0.1:9;
yLinear = pLinear(1)*xLinear+pLinear(2);
plot(x,y,'*', xLinear,yLinear)
Title('Linear fit'), xlabel('x'),ylabel('y')
% There is a polyvar function that helps with calculating J but it is
% pretty complicated to explain so I'll do J the long way.
JLinear = sum((pLinear(1)*x+pLinear(2)-y).^2);
disp(['J for a linear fit is: '
num2str(JLinear)])
yBar = mean(y);
S = sum((y-yBar).^2);
rSquaredLinear = 1-JLinear/S;
disp(['r-Squared for linear fit: ' num2str(rSquaredLinear)])
% Now a quadratic
pQuad = polyfit(x,y,2);
subplot(2,2,2)
xQuad = 1:0.1:9;
yQuad = pQuad(1)*xQuad.^2+pQuad(2)*xQuad+pQuad(3);
plot(x,y,'*', xQuad,yQuad)
Title('Quadratic fit'), xlabel('x'),ylabel('y')
% J the long way.
JQuad = sum((pQuad(1)*x.^2+pQuad(2)*x+pQuad(3)-y).^2);
disp(['J for a quadratic fit is: '
num2str(JQuad)])
rSquaredQuad = 1-JQuad/S;
disp(['r-Squared for Quadratic fit: ' num2str(rSquaredQuad)])
% Now a cubic fit
pCubic = polyfit(x,y,3);
subplot(2,2,3)
xCubic = 1:0.1:9;
yCubic = pCubic(1)*xCubic.^3+pCubic(2)*xCubic.^2+pCubic(3)*xCubic+pCubic(4);
plot(x,y,'*', xCubic,yCubic)
Title('Cubic fit'), xlabel('x'),ylabel('y')
% J the long way.
JCubic = sum((pCubic(1)*x.^3+pCubic(2)*x.^2+pCubic(3)*x+pCubic(4)-y).^2);
disp(['J for a quadratic fit is: '
num2str(JCubic)])
rSquaredCubic = 1-JCubic/S;
disp(['r-Squared for Cubic fit: ' num2str(rSquaredCubic)])
% Now a 4th degree fit
pQuartic = polyfit(x,y,4);
subplot(2,2,4)
xQuartic = 1:0.1:9;
yQuartic = pQuartic(1)*xQuartic.^4+pQuartic(2)*xQuartic.^3+pQuartic(3)*xQuartic.^2+pQuartic(4)*xQuartic+pQuartic(5);
plot(x,y,'*', xQuartic,yQuartic)
Title('4th degree fit'), xlabel('x'),ylabel('y')
% J the long way.
JQuartic = sum(( pQuartic(1)*x.^4+pQuartic(2)*x.^3+pQuartic(3)*x.^2+pQuartic(4)*x+pQuartic(5)-y).^2);
disp(['J for a 4th degree fit is: '
num2str(JQuartic)])
rSquaredQuartic = 1-JQuartic/S;
disp(['r-Squared for 4th degree fit: ' num2str(rSquaredQuartic)])
The ouput of this script is (I'm not going to show the plot again):
J for a linear fit is: 71.5389
r-Squared for linear fit: 0.95419
J for a quadratic fit is: 56.6727
r-Squared for Quadratic fit: 0.96371
J for a quadratic fit is: 41.8838
r-Squared for Cubic fit: 0.97318
J for a 4th degree fit is: 4.6566
r-Squared for 4th degree fit: 0.99702
As you can see r-squared is much closer to 1 for the 4th degree polynomial. So it is the best fit of the data.
As you can see, curve fitting is pretty complicated stuff (but is very important and is done all the time in by engineers and scientists.) This was intended as a breif introduction to curve fitting (function discovery). Matlab has an additional Curve Fitting Toolbox that has many of the quantities we have calculated today as built in funtions (plus an ability to fit with many other types of equations). You can buy it from mathworks. We just bought a copy for the school computers but it hasn't been installed yet.
You can find more information about Matlab's somewhat limited built-in curve fitting functions by searching for curve fitting from the Matlab help menu.
The Basic Fitting Interface
MATLAB also supports curve fitting through the Basic Fitting Interface available from the figure window under the menu Tools/Basic Fitting:

Using this interface, you can quickly perform many curve fitting tasks within the same easy-to-use environment. The interface is designed so that you can amoung other things, fit data using a polynomial up to degree 10, plot multiple fits simultaneously for a given data set, plot the fit residuals, examine the numerical results of a fit, evaluate (interpolate or extrapolate) a fit, annotate the plot with the numerical fit results and the norm of residuals and save the fit and evaluated results to the MATLAB workspace.
It looks like:

Some of the key things to note are (taken from Matlab help):
Plot fits - This panel allows you to visually explore one or more fits to the current data set: Check to display fits on figure - Select the fits you want to display for the current data set. The polynomial fits use the polyfit function. You can choose as many fits for a given data set as you want. If your data set has N points, then you should use polynomials with, at most, N coefficients. If your fit uses polynomials with more than N coefficients, the interface automatically sets a sufficient number of coefficients to 0 during the calculation so that the system is not underdetermined.
Show equations - If checked, the fit equation is displayed on the plot. You can also get the coefficients of the fit equation in the middle of the screen.
Significant digits - Select the significant digits associated with the equation display.
Plot residuals - If checked, the fit residuals are displayed. The fit residuals are defined as the difference between the ordinate data point and the resulting fit for each abscissa data point. You can display the residuals as a bar plot, as a scatter plot, or as a line plot in the same figure window as the data or in a separate figure window. If you use subplots to plot multiple data sets, then residuals can be plotted only in a separate figure window. Plots of residuals are sometimes made to see how good the fit is. If the residuals are large, the fit is no good. If the plot shows a regular pattern, the fit isn't so good either. The residual plot should look fairly random for a good fit. Type in Analyzing Residuals in Matlab's Help menu search tool for more info on this.
Show norm of residuals - If checked, the norm of residuals are displayed. The norm of residuals is a measure of the goodness of fit, where a smaller value indicates a better fit than a larger value.
The other useful thing is the Find Y = f(x) over on the right side of the screen. There you can interpolate or extrapolate. Just enter a value of x and it will quickly show you the value of Y as calculated by the fit function.
For an example, I want to let Matlab calculate the 4th degree polynomial fit equation we did earlier. So I select data 7 (the data in the 4th degree fit window):

You should see the coefficients are the same as we got from our script (see the 4th degree polynomial equation from earlier). The figure window should show the equation on the plot as well.
That's it. Look for some homework either later tonight or tommorrow. It probably won't be due until next Thursday so you will have plenty of time.
Next Week: Chapter 4 Loops!