|
|
Residual Sum of Squares

Suppose you thought that a dog's height could be predicted by its weight. How could you tell if there was really a relationship there?  One way would be to choose a random sample of dogs, collect their weights and heights, and then graph your data. Surely if there is a relationship it would show up on a graph, right?  But even if it looks like there is a linear relationship, how could you be sure?  The principle of least-squares regression, also known as the residual sum of squares, can help you tell just how good a dog's weight is at predicting its height.

Mockup Schule

Explore our app and discover over 50 million learning materials for free.

Residual Sum of Squares

Illustration

Lerne mit deinen Freunden und bleibe auf dem richtigen Kurs mit deinen persönlichen Lernstatistiken

Jetzt kostenlos anmelden

Nie wieder prokastinieren mit unseren Lernerinnerungen.

Jetzt kostenlos anmelden
Illustration

Suppose you thought that a dog's height could be predicted by its weight. How could you tell if there was really a relationship there? One way would be to choose a random sample of dogs, collect their weights and heights, and then graph your data. Surely if there is a relationship it would show up on a graph, right? But even if it looks like there is a linear relationship, how could you be sure? The principle of least-squares regression, also known as the residual sum of squares, can help you tell just how good a dog's weight is at predicting its height.

Residual sum of squares linear regression

Let's continue with the example of trying to use a dog's adult weight to predict its height. You have done random sampling and your best to make sure your sample is representative of the overall adult dog population. The information you have gathered is in the table below, where the weight is in pounds and the height is in inches.

Table 1 - Dog Weights (in pounds) and Heights (in inches)

Weight

Height

Weight

Height

Weight

Height

\(10\)

\(10\)

\(75\)

\(23\)

\(12\)

\(12\)

\(63\)

\(25\)

\(80\)

\(25\)

\(45\)

\(22\)

\(60\)

\(23\)

\(20\)

\(15\)

\(50\)

\(18\)

\(100\)

\(26\)

\(46\)

\(24\)

\(36\)

\(17\)

\(6\)

\(12\)

\(62\)

\(23\)

\(95\)

\(27\)

\(48\)

\(20\)

\(45\)

\(18\)

\(34\)

\(24\)

\(40\)

\(19\)

\(32\)

\(17\)

\(57\)

\(21\)

\(50\)

\(21\)

\(19\)

\(10\)

\(37\)

\(23\)

The first thing to do is make a scatter plot.

Least-Squares Regression scatter plot of data in table StudySmarterFig. 1 - Scatter plot of the data in the table of dog weights and heights.

Next, you would check for any unusual points in the data.

Unusual Data Points

Let's take a look at the kinds of unusual points you might see that would affect your linear regression analysis.

Outliers

Remember that an outlier is a data point that is an abnormal distance from other points in the sample. In other words, the response variable (in this case the height of the dog) does not follow the general trend of the other data. Who gets to decide what points are outliers? The person looking at the data of course! In the scatter plot of the data above you can see that there doesn't appear to be any real outliers in the data.

High Leverage Points

What makes a data point of your sample a high leverage point?

A high leverage point is one that has an unusually large distance between it and the mean.

A high leverage point can either be above or below the mean. Points like this can have a large effect on linear regression.

Influential Points

Influence is a way to measure how much impact an outlier or a high leverage point has on your regression model.

A point is considered to be influential if it unduly influences any part of your regression analysis, like the line of best fit.

While outliers and high leverage points could be influential points, they are not always influential points. In order to say if an outlier or a high leverage point is actually influential, you would need to remove it from the data set, recalculate the linear regression, and then see how much it changed. The best way to check is to see if the \(R^2\) value has changed.

For a reminder about the \(R^2\) value, see the articles Linear Regression and Residuals.

Residual sum of squares geometric interpretation

Once you have made a scatter plot of the data, you can check to see if it looks linear. In this case, it might be, but the question is how to draw the line. As you can see in the picture below, any of the three lines drawn look like they might fit the data pretty well.

Least-Squares Regression Scatter plot showing three potential lines through the data StudySmarterFig. 2 - Scatter plot showing three potential lines through the data.

So what makes a line the "best" line? You want a line that is as close to as many data points in the sample as possible. For that, you need to look at the deviation, also called the residual. The residual of a data point is simply how far away the data point is from the potential line of best fit.

Least-Squares Regression scatter plot showing deviation of two points to the line, one above the line and one below StudySmarterFig. 3 - Scatter plot showing the deviation of two of the data points.

A negative residual means the point is below the line, and a positive residual means the point is above the line. If a point lies exactly on the line the residual would be zero. Because the residual could be positive or negative, it is common to look at the square of the residual so things don't get accidentally cancelled out.

Residual sum of squares definition

Let's look at the actual definition of the residual sum of squares. You will notice that it can be defined for any line \(y=a+bx\), not just for the line of best fit.

For \(n\) data points,

\[(x_1, y_1), (x_2, y_2), \dots (x_n, y_n),\]

one way to measure the fit of a line \(y=bx+a\) to bivariate data is the sum of squared residuals using the formula

\[\sum\limits_{i=1}^n (y_i - (a+bx_i))^2.\]

The goal is to make the sum of squared residuals as small as possible.

For an explanation of why the residual sum of squares is the best way to go about things, see the article Minimising the Sum of Squares Residual.

You might see the residual at point \((x_i,y_i)\) written as \(\epsilon_i\).

Formula for residual sum of squares

Now you can define the line of best fit, also known as the least-squares regression line.

The least-squares regression line is the line that minimises the sum of squared deviations to the sample data.

You still need a way to find the least-squares regression line! Thankfully other people have done all the math to find the slope and intercept of the line. The notation in the formulas is:

  • \(n\) number of sample points;

  • \(\bar{x}\) the average of the \(x_i\) values; and

  • \(\bar{y}\) the average of the \(y_i\) values.

The slope of the least-squares regression line is

\[ b = \frac{\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{ \sum\limits_{i=1}^n(x_i - \bar{x})^2 } = \frac{S_{xy}}{S_{xx}} ,\]

the \(y\)-intercept is

\[ a = \bar{y} - b\bar{x},\]

and the equation of the least-squares regression line is

\[ \hat{y} = a+bx,\]

where \(\hat{y}\) is the predicted value that results from substituting a given \(x\) into the equation.

\(S_{xx}\) and \(S_{xy}\) are called summary statistics, and their formulas may show up depending on what learning tools you are using.

Let's look at an example.

Going back to the table with the dog weights and heights, the dependent variable is the height (these would be the \(y_i\) values), and the independent variable is the weight (these would be the \(x_i\) values). There are \(24\) data points in the table, so \(n=24\). You can calculate

  • \( \bar{x} = 46.75\) and
  • \(\bar{y} = 19.79\),

rounded to two decimal places. Generally, you will use a spreadsheet or calculator to find the values of \(b\) and \(a\), especially when there are lots of data points! Here

  • \( a =11.69\) and
  • \(b = 0.17\),

where both have been rounded to two decimal places. So the equation of the least-squares regression line is

\[ \hat{y} = 11.69 + 0.17x.\]

Least-Squares Regression scatter plot of data showing line of best fit, also known as the least-squares regression line StudySmarterFig. 4 - Scatter plot with the line of best fit, also known as the least-squares regression line.

Now that you have a formula for the line, you can find the residual sum of squares deviation for this line. Using the formula,

\[\sum\limits_{i=1}^24 (y_i - (a+bx_i))^2 \approx 160.58.\]

In fact, the \(R^2\) value, also known as the coefficient of determination, is about \(R^2 = 0.73\), or \(73\%\).

Now let's look for influential points.

Going back to the table of data, if you look at the deviation for each point in the sample, one of them seems to contribute quite a bit more than the others to the sum of squares deviation. That data point is \( (37, 23)\) with a deviation of almost \(24\). That is considerably more than any of the other sample points, with the next highest being less than \(12\). This implies that the data point \( (37, 23)\) is a high leverage point, but you do need to show whether or not it is an influential point.

It might be the case that \( (37, 23)\) is an influential point. If you remove that point from the sample and then calculate the new \(R^2\) value, you get about \(0.77\), or \(77\%\), with a least-squares regression line of

\[\hat{y} = 11.31 + 0.18x,\] and a residual sum of squares deviation of \(135.36\).

Remember that the coefficient of determination, \(R^2\), is a measure of the variability in \(y\) that can be explained by a linear relationship between \(x\) and \(y\). The closer to \(1\) that \(R^2\) is, the closer to linear your sample data is. So by removing one point from the data set, you have changed the \(R^2\) value from \(73\%\) to \(77\%\), which is a big change! That means the data point \( (37, 23)\) is in fact an influential point.

Remember that variability can be decreased by increasing the sample size. See Unbiased Point Estimates for more information.

Once you have the least-squares regression line, what can you do with it?

Examples of residual sum of squares

There are a couple of important things to consider when using the least-squares regression line to make a prediction.

  • The least-squares regression line is a predictor of the population, not an individual.

  • Using the least-squares regression line to make a prediction for a value outside the range of the collected data might not work very well.

Let's look at an example of the kinds of problems that can occur when these considerations are ignored.

Least-Squares Regression bulldogs are very heavy given how short they generally are StudySmarterFig. 5 - Bulldogs are an example of why you can't necessarily make a prediction about an individual from a least-squares regression line.

Going back to the dog weight/height information, and using the least-squares regression line

\[\hat{y} = 11.31 + 0.18x,\]

you what can you predict about the height of a bulldog that weighs \(65\) pounds?

Answer:

Simply plugging in the weight of the bulldog, you get

\[\hat{y} = 11.31 + 0.18(65) = 23.01,\]

so the least-squares regression line predicts that the bulldog would be \(23.01\) inches tall. However, a bulldog of this weight will actually be about \(15\) inches tall, which is quite a difference! This is an example of why you can use the least-squares regression line to make a prediction about dogs in general (i.e. the population of dogs) and not about specific dogs.

What about a dog that has a weight of more than \(100\) pounds?

Least-Squares Regression bull mastiff dogs are one to a kid sized swimming pool StudySmarterFig. 6 - Bull mastiff dogs are definitely one to a kid sized wading pool!

A male bull mastiff dog can easily weigh \(130\) pounds. This is outside the range of the data collected in the table. When you use the least-squares regression line to make a prediction, you find that a bull mastiff dog should be

\[\hat{y} = 11.31 + 0.18(130) = 34.71\, \text{in},\]

tall. However in general this dog won't be more than \(27\) inches tall, which is considerably less than what the least-squares regression line predicts! That is because the weight of the dog is quite far outside of the data collected, so the least-squares regression line isn't a very good predictor.

Residual Sum of Squares - Key takeaways

  • The residual of a data point is how far away the data point is from the potential line of best fit. Deviation can be positive or negative.
  • For \(n\) data points,

    \[(x_1, y_1), (x_2, y_2), \dots (x_n, y_n),\]

    one way to measure the fit of a line \(y=mx+b\) to bivariate data is the residual sum of squared deviations using the formula

    \[\sum\limits_{i=1}^n (y_i - (a+bx_i))^2.\]

  • The least-squares regression line is the line that minimises the residual sum of squares.
  • The slope of the least-squares regression line is

    \[ \begin{align} b &=\frac{S_{xy}}{S_{xx}} \\ & = \frac{\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{ \sum\limits_{i=1}^n(x_i - \bar{x})^2 }, \end{align}\]

    the \(y\)-intercept is

    \[ a = \bar{y} - b\bar{x},\]

    and the equation of the least-squares regression line is

    \[ \hat{y} = a+bx,\]

    where \(\hat{y}\) is the predicted value that results from substituting a given \(x\) into the equation.

Frequently Asked Questions about Residual Sum of Squares

Find the residual for each observation, and then square it.  Add all of those together and you get the residual sum of squares.

It is a way to measure how far your line of best fit deviates from the observations.

RSS = residual sum of squares.

ESS = explained sum of squares.

It measures the level of variance in the residuals of a regression model.

An example of using the residual sum of squares is checking to see how well the observations in a data set fit the least squares regression line.  This can help you locate influential points.

Test your knowledge with multiple choice flashcards

True or False: The closer to \(1\) that \(R^2\) is, the closer to linear your sample data is. 

True or False: The the coefficient of determination, \(R^2\), is a measure of the variability in \(y\) that can be explained by a linear relationship between \(x\) and \(y\). 

True or false: the least-squares regression line is the only way to make a prediction about a population.

Next

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Mock-Exams
  • Smart Note-Taking
Join over 22 million students in learning with our StudySmarter App Join over 22 million students in learning with our StudySmarter App

Sign up to highlight and take notes. It’s 100% free.

Entdecke Lernmaterial in der StudySmarter-App

Google Popup

Join over 22 million students in learning with our StudySmarter App

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Mock-Exams
  • Smart Note-Taking
Join over 22 million students in learning with our StudySmarter App