Find study content
Learning Materials

Discover learning materials by subject, university or textbook.

Explanations
All Subjects

Anthropology

Archaeology

Architecture

Art and Design

Bengali

Biology

Business Studies

Chemistry

Chinese

Combined Science

Computer Science

Economics

Engineering

English

English Literature

Environmental Science

French

Geography

German

Greek

History

Hospitality and Tourism

Human Geography

Japanese

Italian

Law

Macroeconomics

Marketing

Math

Media Studies

Medicine

Microeconomics

Music

Nursing

Nutrition and Food Science

Physics

Politics

Polish

Psychology

Religious Studies

Sociology

Spanish

Sports Sciences

Translation
Features
Features

Discover all of these amazing features with a free account.

Flashcards

StudySmarter AI

Notes

Study Plans

Study Sets

Exams
What’s new?

Flashcards
Study your flashcards with three learning modes.

Study Sets
All of your learning materials stored in one place.

Notes
Create and edit notes or documents.

Study Plans
Organise your studies and prepare for exams.
Resources
Discover

All the hacks around your studies and career - in one place.

Find a job

Student Deals

Magazine

Mobile App
Featured

Magazine
Trusted advice for anyone who wants to ace their studies & career.

Job Board
The largest student job board with the most exciting opportunities.

StudySmarter Deals
Verified student deals from top brands.

Our App
Discover our mobile app to take your studies anywhere.

Learning Materials

Features

Discover

Residual Sum of Squares

Suppose you thought that a dog's height could be predicted by its weight. How could you tell if there was really a relationship there? One way would be to choose a random sample of dogs, collect their weights and heights, and then graph your data. Surely if there is a relationship it would show up on a graph, right? But even if it looks like there is a linear relationship, how could you be sure? The principle of least-squares regression, also known as the residual sum of squares, can help you tell just how good a dog's weight is at predicting its height.

Get started

+ Add tag
Immunology
Cell Biology
Mo

True or False: The closer to $1$ that $R^2$ is, the closer to linear your sample data is.

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

True or False: The the coefficient of determination, $R^2$, is a measure of the variability in $y$ that can be explained by a linear relationship between $x$ and $y$.

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

True or false: the least-squares regression line is the only way to make a prediction about a population.

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

In which of these cases should you not use the least-squares regression line to make a prediction?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Which statement is true about the residual of a sample point?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What formula gives the slope of a least-squares regression line?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What formula gives the $y$-intercept of a least-squares regression line?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Which statement is true?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Why do you square the residuals?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

How is the residual defined in least-squares regression analysis?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What should you be aware of when determining the least-squares regression line?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

True or False: The closer to $1$ that $R^2$ is, the closer to linear your sample data is.

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

True or False: The the coefficient of determination, $R^2$, is a measure of the variability in $y$ that can be explained by a linear relationship between $x$ and $y$.

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

True or false: the least-squares regression line is the only way to make a prediction about a population.

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

In which of these cases should you not use the least-squares regression line to make a prediction?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Which statement is true about the residual of a sample point?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What formula gives the slope of a least-squares regression line?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What formula gives the $y$-intercept of a least-squares regression line?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Which statement is true?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

Why do you square the residuals?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

How is the residual defined in least-squares regression analysis?

Show Answer

+ Add tag
Immunology
Cell Biology
Mo

What should you be aware of when determining the least-squares regression line?

Show Answer

Fact Checked Content
Last Updated: 09.01.2023
11 min reading time

Content creation process designed by
Content cross-checked by
Content quality checked by

Residual sum of squares linear regression

Let's continue with the example of trying to use a dog's adult weight to predict its height. You have done random sampling and your best to make sure your sample is representative of the overall adult dog population. The information you have gathered is in the table below, where the weight is in pounds and the height is in inches.

Table 1 - Dog Weights (in pounds) and Heights (in inches)

Weight	Height	Weight	Height	Weight	Height
$10$	$10$	$75$	$23$	$12$	$12$
$63$	$25$	$80$	$25$	$45$	$22$
$60$	$23$	$20$	$15$	$50$	$18$
$100$	$26$	$46$	$24$	$36$	$17$
$6$	$12$	$62$	$23$	$95$	$27$
$48$	$20$	$45$	$18$	$34$	$24$
$40$	$19$	$32$	$17$	$57$	$21$
$50$	$21$	$19$	$10$	$37$	$23$

The first thing to do is make a scatter plot.

$Least-Squares Regression scatter plot of data in table StudySmarter$ Fig. 1 - Scatter plot of the data in the table of dog weights and heights.

Next, you would check for any unusual points in the data.

Unusual Data Points

Let's take a look at the kinds of unusual points you might see that would affect your linear regression analysis.

Outliers

Remember that an outlier is a data point that is an abnormal distance from other points in the sample. In other words, the response variable (in this case the height of the dog) does not follow the general trend of the other data. Who gets to decide what points are outliers? The person looking at the data of course! In the scatter plot of the data above you can see that there doesn't appear to be any real outliers in the data.

High Leverage Points

What makes a data point of your sample a high leverage point?

A high leverage point is one that has an unusually large distance between it and the mean.

A high leverage point can either be above or below the mean. Points like this can have a large effect on linear regression.

Influential Points

Influence is a way to measure how much impact an outlier or a high leverage point has on your regression model.

A point is considered to be influential if it unduly influences any part of your regression analysis, like the line of best fit.

While outliers and high leverage points could be influential points, they are not always influential points. In order to say if an outlier or a high leverage point is actually influential, you would need to remove it from the data set, recalculate the linear regression, and then see how much it changed. The best way to check is to see if the $R^2$ value has changed.

For a reminder about the $R^2$ value, see the articles Linear Regression and Residuals.

Residual sum of squares geometric interpretation

Once you have made a scatter plot of the data, you can check to see if it looks linear. In this case, it might be, but the question is how to draw the line. As you can see in the picture below, any of the three lines drawn look like they might fit the data pretty well.

$Least-Squares Regression Scatter plot showing three potential lines through the data StudySmarter$ Fig. 2 - Scatter plot showing three potential lines through the data.

So what makes a line the "best" line? You want a line that is as close to as many data points in the sample as possible. For that, you need to look at the deviation, also called the residual. The residual of a data point is simply how far away the data point is from the potential line of best fit.

$Least-Squares Regression scatter plot showing deviation of two points to the line, one above the line and one below StudySmarter$ Fig. 3 - Scatter plot showing the deviation of two of the data points.

A negative residual means the point is below the line, and a positive residual means the point is above the line. If a point lies exactly on the line the residual would be zero. Because the residual could be positive or negative, it is common to look at the square of the residual so things don't get accidentally cancelled out.

Residual sum of squares definition

Let's look at the actual definition of the residual sum of squares. You will notice that it can be defined for any line $y=a+bx$, not just for the line of best fit.

For $n$ data points,

\[(x_1, y_1), (x_2, y_2), \dots (x_n, y_n),\]

one way to measure the fit of a line $y=bx+a$ to bivariate data is the sum of squared residuals using the formula

\[\sum\limits_{i=1}^n (y_i - (a+bx_i))^2.\]

The goal is to make the sum of squared residuals as small as possible.

For an explanation of why the residual sum of squares is the best way to go about things, see the article Minimising the Sum of Squares Residual.

You might see the residual at point $(x_i,y_i)$ written as $\epsilon_i$.

Formula for residual sum of squares

Now you can define the line of best fit, also known as the least-squares regression line.

The least-squares regression line is the line that minimises the sum of squared deviations to the sample data.

You still need a way to find the least-squares regression line! Thankfully other people have done all the math to find the slope and intercept of the line. The notation in the formulas is:

$n$ number of sample points;
$\bar{x}$ the average of the $x_i$ values; and
$\bar{y}$ the average of the $y_i$ values.

The slope of the least-squares regression line is

\[ b = \frac{\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{ \sum\limits_{i=1}^n(x_i - \bar{x})^2 } = \frac{S_{xy}}{S_{xx}} ,\]

the $y$-intercept is

\[ a = \bar{y} - b\bar{x},\]

and the equation of the least-squares regression line is

\[ \hat{y} = a+bx,\]

where $\hat{y}$ is the predicted value that results from substituting a given $x$ into the equation.

$S_{xx}$ and $S_{xy}$ are called summary statistics, and their formulas may show up depending on what learning tools you are using.

Let's look at an example.

Going back to the table with the dog weights and heights, the dependent variable is the height (these would be the $y_i$ values), and the independent variable is the weight (these would be the $x_i$ values). There are $24$ data points in the table, so $n=24$. You can calculate

$ \bar{x} = 46.75$ and
$\bar{y} = 19.79$,

rounded to two decimal places. Generally, you will use a spreadsheet or calculator to find the values of $b$ and $a$, especially when there are lots of data points! Here

$ a =11.69$ and
$b = 0.17$,

where both have been rounded to two decimal places. So the equation of the least-squares regression line is

\[ \hat{y} = 11.69 + 0.17x.\]

$Least-Squares Regression scatter plot of data showing line of best fit, also known as the least-squares regression line StudySmarter$ Fig. 4 - Scatter plot with the line of best fit, also known as the least-squares regression line.

Now that you have a formula for the line, you can find the residual sum of squares deviation for this line. Using the formula,

\[\sum\limits_{i=1}^24 (y_i - (a+bx_i))^2 \approx 160.58.\]

In fact, the $R^2$ value, also known as the coefficient of determination, is about $R^2 = 0.73$, or $73\%$.

Now let's look for influential points.

Going back to the table of data, if you look at the deviation for each point in the sample, one of them seems to contribute quite a bit more than the others to the sum of squares deviation. That data point is $ (37, 23)$ with a deviation of almost $24$. That is considerably more than any of the other sample points, with the next highest being less than $12$. This implies that the data point $ (37, 23)$ is a high leverage point, but you do need to show whether or not it is an influential point.

It might be the case that $ (37, 23)$ is an influential point. If you remove that point from the sample and then calculate the new $R^2$ value, you get about $0.77$, or $77\%$, with a least-squares regression line of

\[\hat{y} = 11.31 + 0.18x,\] and a residual sum of squares deviation of $135.36$.

Remember that the coefficient of determination, $R^2$, is a measure of the variability in $y$ that can be explained by a linear relationship between $x$ and $y$. The closer to $1$ that $R^2$ is, the closer to linear your sample data is. So by removing one point from the data set, you have changed the $R^2$ value from $73\%$ to $77\%$, which is a big change! That means the data point $ (37, 23)$ is in fact an influential point.

Remember that variability can be decreased by increasing the sample size. See Unbiased Point Estimates for more information.

Once you have the least-squares regression line, what can you do with it?

Examples of residual sum of squares

There are a couple of important things to consider when using the least-squares regression line to make a prediction.

The least-squares regression line is a predictor of the population, not an individual.
Using the least-squares regression line to make a prediction for a value outside the range of the collected data might not work very well.

Let's look at an example of the kinds of problems that can occur when these considerations are ignored.

Least-Squares Regression bulldogs are very heavy given how short they generally are StudySmarter Fig. 5 - Bulldogs are an example of why you can't necessarily make a prediction about an individual from a least-squares regression line.

Going back to the dog weight/height information, and using the least-squares regression line

\[\hat{y} = 11.31 + 0.18x,\]

you what can you predict about the height of a bulldog that weighs $65$ pounds?

Answer:

Simply plugging in the weight of the bulldog, you get

\[\hat{y} = 11.31 + 0.18(65) = 23.01,\]

so the least-squares regression line predicts that the bulldog would be $23.01$ inches tall. However, a bulldog of this weight will actually be about $15$ inches tall, which is quite a difference! This is an example of why you can use the least-squares regression line to make a prediction about dogs in general (i.e. the population of dogs) and not about specific dogs.

What about a dog that has a weight of more than $100$ pounds?

Least-Squares Regression bull mastiff dogs are one to a kid sized swimming pool StudySmarter Fig. 6 - Bull mastiff dogs are definitely one to a kid sized wading pool!

A male bull mastiff dog can easily weigh $130$ pounds. This is outside the range of the data collected in the table. When you use the least-squares regression line to make a prediction, you find that a bull mastiff dog should be

\[\hat{y} = 11.31 + 0.18(130) = 34.71\, \text{in},\]

tall. However in general this dog won't be more than $27$ inches tall, which is considerably less than what the least-squares regression line predicts! That is because the weight of the dog is quite far outside of the data collected, so the least-squares regression line isn't a very good predictor.

Residual Sum of Squares - Key takeaways

The residual of a data point is how far away the data point is from the potential line of best fit. Deviation can be positive or negative.
For $n$ data points,
\[(x_1, y_1), (x_2, y_2), \dots (x_n, y_n),\]
one way to measure the fit of a line $y=mx+b$ to bivariate data is the residual sum of squared deviations using the formula
\[\sum\limits_{i=1}^n (y_i - (a+bx_i))^2.\]
The least-squares regression line is the line that minimises the residual sum of squares.
The slope of the least-squares regression line is
\[ \begin{align} b &=\frac{S_{xy}}{S_{xx}} \\ & = \frac{\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{ \sum\limits_{i=1}^n(x_i - \bar{x})^2 }, \end{align}\]
the $y$-intercept is
\[ a = \bar{y} - b\bar{x},\]
and the equation of the least-squares regression line is
\[ \hat{y} = a+bx,\]
where $\hat{y}$ is the predicted value that results from substituting a given $x$ into the equation.

Flashcards in Residual Sum of Squares

Start learning

True or False: The closer to $1$ that $R^2$ is, the closer to linear your sample data is.

True.

True or False: The the coefficient of determination, $R^2$, is a measure of the variability in $y$ that can be explained by a linear relationship between $x$ and $y$.

True.

True or false: the least-squares regression line is the only way to make a prediction about a population.

False.

In which of these cases should you not use the least-squares regression line to make a prediction?

When you want to know about an individual.

Which statement is true about the residual of a sample point?

It could be positive or negative.

What formula gives the slope of a least-squares regression line?

$b = \dfrac{S_{xy}}{S_{xx}}$.

Already have an account? Log in

Frequently Asked Questions about Residual Sum of Squares

How to calculate residual sum of squares?

Find the residual for each observation, and then square it. Add all of those together and you get the residual sum of squares.

What is the residual sum of squares?

It is a way to measure how far your line of best fit deviates from the observations.

What is ESS and RSS?

RSS = residual sum of squares.

ESS = explained sum of squares.

What does the Residual Sum of Squares measure?

It measures the level of variance in the residuals of a regression model.

What is an example of Residual Sum of Squares?

An example of using the residual sum of squares is checking to see how well the observations in a data set fit the least squares regression line. This can help you locate influential points.

Save Article

Test your knowledge with multiple choice flashcards

Score

Access over 700 million learning materials

Study more efficiently with flashcards

Get better grades with AI

Already have an account? Log in

How we ensure our content is accurate and trustworthy?

At StudySmarter, we have created a learning platform that serves millions of students. Meet the people who work hard to deliver fact based content as well as making sure it is verified.

Content Creation Process:

Lily Hulatt is a Digital Content Specialist with over three years of experience in content strategy and curriculum design. She gained her PhD in English Literature from Durham University in 2022, taught in Durham University’s English Studies Department, and has contributed to a number of publications. Lily specialises in English Literature, English Language, History, and Philosophy.

Get to know Lily

Content Quality Monitored by:

Gabriel Freitas is an AI Engineer with a solid experience in software development, machine learning algorithms, and generative AI, including large language models’ (LLMs) applications. Graduated in Electrical Engineering at the University of São Paulo, he is currently pursuing an MSc in Computer Engineering at the University of Campinas, specializing in machine learning topics. Gabriel has a strong background in software engineering and has worked on projects involving computer vision, embedded AI, and LLM applications.

Get to know Gabriel

Discover learning materials with the free StudySmarter app

About StudySmarter

StudySmarter is a globally recognized educational technology company, offering a holistic learning platform designed for students of all ages and educational levels. Our platform provides learning support for a wide range of subjects, including STEM, Social Sciences, and Languages and also helps students to successfully master various tests and exams worldwide, such as GCSE, A Level, SAT, ACT, Abitur, and more. We offer an extensive library of learning materials, including interactive flashcards, comprehensive textbook solutions, and detailed explanations. The cutting-edge technology and tools we provide help students create their own learning materials. StudySmarter’s content is not only expert-verified but also regularly updated to ensure accuracy and relevance.

Learn more

StudySmarter Editorial Team

Team Math Teachers

11 minutes reading time
Checked by StudySmarter Editorial Team

Save Explanation

Study anywhere. Anytime.Across all devices.

Sign-up for free

Get Started Free

Explore our app and discover over 50 million learning materials for free.

94% of StudySmarter users achieve better grades with our free platform.

Weight	Height	Weight	Height	Weight	Height
\(10\)	\(10\)	\(75\)	\(23\)	\(12\)	\(12\)
\(63\)	\(25\)	\(80\)	\(25\)	\(45\)	\(22\)
\(60\)	\(23\)	\(20\)	\(15\)	\(50\)	\(18\)
\(100\)	\(26\)	\(46\)	\(24\)	\(36\)	\(17\)
\(6\)	\(12\)	\(62\)	\(23\)	\(95\)	\(27\)
\(48\)	\(20\)	\(45\)	\(18\)	\(34\)	\(24\)
\(40\)	\(19\)	\(32\)	\(17\)	\(57\)	\(21\)
\(50\)	\(21\)	\(19\)	\(10\)	\(37\)	\(23\)

Residual Sum of Squares

Residual sum of squares linear regression

Unusual Data Points

Outliers

High Leverage Points

Influential Points

Residual sum of squares geometric interpretation

Residual sum of squares definition

Formula for residual sum of squares

Examples of residual sum of squares

Residual Sum of Squares - Key takeaways

Flashcards in Residual Sum of Squares

Learn faster with the 11 flashcards about Residual Sum of Squares

Frequently Asked Questions about Residual Sum of Squares

How we ensure our content is accurate and trustworthy?

About StudySmarter