|
|
Least Squares Linear Regression

Imagine you've collected data from students on their exam mark and how many hours they studied. Plotting this information on a scatter graph, it looks like there is a positive linear relationship between the average grade and the number of hours studied.

Mockup Schule

Explore our app and discover over 50 million learning materials for free.

Least Squares Linear Regression

Illustration

Lerne mit deinen Freunden und bleibe auf dem richtigen Kurs mit deinen persönlichen Lernstatistiken

Jetzt kostenlos anmelden

Nie wieder prokastinieren mit unseren Lernerinnerungen.

Jetzt kostenlos anmelden
Illustration

Imagine you've collected data from students on their exam mark and how many hours they studied. Plotting this information on a scatter graph, it looks like there is a positive linear relationship between the average grade and the number of hours studied.

Can you use this data to predict someone's grade based on the number of hours studied?

Using linear regression, it is actually possible to make a reasonable estimate based on past data. This article will show you how to find the Least Squares Linear Regression line in order to make predictions based on data already collected.

Least Squares Linear Regression explanation

When analysing bivariate data, you have two variables: the dependent or response variable, usually denoted by \(y\), and the independent or explanatory variable usually denoted by \(x\).

When \(y\) is the dependent variable and \(x\) is the independent variable, you can say '\(y\) depends on \(x\)'.

Suppose you have collected data on two variables \(y\) and \(x\) where the result of \(y\) depends on \(x\). There also appears to be a linear relationship between the variables. How would you go about predicting a value of \(y\) for a given value of \(x\)?

At GCSE, you may have had to draw a line of best fit where you would use your own judgement to determine in which "direction" the data was going. The least squares regression line does this mathematically.

A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.

Residuals

If you've seen any bivariate data you'll know that very rarely do the data points fall exactly along a straight line, even if there is a confirmed linear 'relationship' between variables.

There could be a number of reasons for these inaccuracies (i.e. other factors effecting the dependent variable or inaccurate readings when collecting the data). There are so many possible factors and causes of these inaccuracies that you can assume these are entirely random.

In the image below, you can see a 'line of best fit' for the data points \((x_1,y_1)\), \((x_2,y_2)\), \((x_3,y_3)\) and \((x_4,y_4)\). Note that the line does not touch any of these points.

The vertical difference between these points and the line of best fit is labelled with \(\epsilon _1\), \(\epsilon _2\), \(\epsilon _3\) and \(\epsilon _4\). These are the residuals associated with each data point.

An upward sloping line of best fit with vertical dotted lines labelled 'eplison' between the data points and the line of best fit.Least squares regression line with residuals

The difference between the observed dependent variable (\(y_i\)) and the predicted dependent variable \(x_i\) is called the residual (\(\epsilon _i\)).

Although these residuals mean that the prediction is not 100% accurate, they are in fact crucial to how you find the least squares regression line: by minimising the squares of these residuals. Hence the name "least squares regression".

The least squares regression line of of \(y\) on \(x\) is that which minimises the sum of the squares of the residuals,

$$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$

where \(\epsilon _i\) is the residual of data point \((x_i,y_i)\).

Least Squares Linear Regression method

The Least Squares linear regression method is used to find the regression line. The main objective in this method is to minimize the sum of the squares of residuals of data points in a data set.

Deriving the Least Squares Linear Regression line

Although this may sound complicated, actually finding the regression line is pretty straightforward.

As with finding any straight line in mathematics, you need two things: a \(y\)-intercept and a gradient. Luckily, there is a straightforward formula for finding these.

Least Squares Linear Regression formula

The regression line of \(y\) on \(x\) is

$$y=ax+b$$

where \(a=\dfrac{S_{xy}}{S_{xx}}\) and \(b=\bar{y}-a\bar{x}\), where

$$S_{xy}=\sum x_iy_i - \dfrac{\sum x_i \sum y_i}{n}$$ $$S_{xx}=\sum x_i^2 - \dfrac{(\sum x_i)^2}{n}$$ $$S_{yy}=\sum y_i^2 - \dfrac{(\sum y_i)^2}{n}$$

The summary statistics \(S_{xy}\), \(S_{xx}\) and \(S_{yy}\) may be given to you in an exam, or you may also need to find them from the raw data using a calculator.

Least Squares Linear Regression solved example

You are now ready to apply this method to a possible exam question.

The number of hours students studied and their exam results are recorded in the table below.

Time studied in hours \(1\)\(2\)\(3\)\(4\)\(5\)
Exam result \(49\)\(81\)\(71\)\(83\)\(99\)

a. Calculate \(S_{xy}\) and \(S_{xx}\).
b. Find the regression line of \(y\) on \(x\).

c. Plot the data points and the regression line on the same graph.

d. Interpret the meaning of \(a=10.2\) and \(b=46\) in the context of the question.

e. Predict the grade for a student who studies for

i) \(2.5\) hours

ii) \(8\) hours.

f. Comment on your answers for part e).

Solution

a. Using your calculator, you can easily find the following results,

\(\sum x=15\) \(\sum x^2=55\) \(\bar{x}=3\) \(\sum xy=1,251\) \(\sum y=383\) \(\sum y^2=30,693\) \(\bar{y}=76.6\).

Simply plug these results into the formulae detailed above to get the summary statistics.

\( \begin{align} S_{xx} &=\sum x^2 - \dfrac{(\sum x)^2}{n} \\&= 55 - \dfrac{15^2}{5} \\&= 10. \end{align}\)

\( \begin{align} S_{xy} &= \sum xy - \dfrac{\sum x \sum y}{n}\\&= 1251 - \dfrac{15 \times 383}{5} \\&= 102. \end{align}\)

b. Starting with \(a\), the gradient of the line,

\[a=\dfrac{S_{xy}}{S_{xx}}=\frac{102}{10}=10.2.\]

Then, the \(y\)-intercept is

\(b=\bar{y}-a\bar{x}=76.6-10.2 \times 3=46\).

Therefore, the regression line is \(y=10.2x+46\).

c. This is a great question for double-checking your working - it'll be pretty obvious if you've made any serious calculation errors!

Upward-sloping regression line through 5 data points.Least square regression line, example

d. Since \(a=10.2\), for every extra hour increase along the \(x\)-axis, the student receives \(10.2\) more marks in the exam.

Since \(b=46\), if a student weren't to study at all, they would still (according to the regression line) receive 46 marks.

e. Simply input the above numbers for \(x\).

i) If \(x=2.5\), \(y=10.2\times 2.5+46=71.5\).

ii) If \(x=8\), \(y=10.2\times 8+46=127.6\).

f. There is a fundamental problem for part ii): since the exams are graded in percentages, the grade \(127.6\) doesn't exist! The truth is, for any amount of time longer than 5 hours, the data doesn't have any information on what happens to the grades of the students.

While you could deduce that for any length of time above 5 hours, 100% would be a good prediction, this is beyond the scope of the data and the linear regression model.

You should keep in mind that using a regression line should only ever be used to predict the values that fall within the range of the data from which you are deriving said regression line, i.e. interpolation.

If you attempt to make predictions outside of this range, it would be called extrapolation and is less reliable since the data may behave differently.

The most difficult thing in this topic is making sure you enter the correct numbers into your calculator! Make sure you double-check your calculations in the exam so you don't lose easy marks.

Least Squares Linear Regression - Key takeaways

  • A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.
  • The difference between the observed dependent variable (\(y_i\)) and the predicted dependent variable is called the residual (\(\epsilon _i\)).
  • The least squares regression line of of \(y\) on \(x\) is that which minimises the sum of the squares of the residuals:

    $$\epsilon _1 ^2 +\epsilon _2 ^2 + \epsilon _3 ^2 + ...$$

    where \(\epsilon _i\) is the residual of data point \((x_i,y_i)\).

  • The regression line of \(y\) on \(x\) is

    $$y=ax+b$$

    where \(a=\dfrac{S_{xy}}{S_{xx}}\) and \(b=\bar{y}-a\bar{x}\).

  • The summary statistics are:
    • \(S_{xy}=\sum xy - \dfrac{\sum x \sum y}{n}\)

      \(S_{xx}=\sum x^2 - \dfrac{(\sum x)^2}{n}\)

      \(S_{yy}=\sum y^2 - \dfrac{(\sum y)^2}{n}\)

Frequently Asked Questions about Least Squares Linear Regression

You can find the least squares regression line either from the raw data or from summary statistics.

The least squares method is a type of linear regression analysis.

The SSE is the sum of squares error and the SST is the sum of squares total. You do not need to know these at A-level.

Least squares regression is used for predicting a dependent variable given an independent variable using data you have collected.

A least squares regression line is used to predict the values of the dependent variable for a given independent variable when analysing bivariate data.

The least squares regression line is that which minimises the sum of the squares of the residuals.

Test your knowledge with multiple choice flashcards

Least squares linear regression is used to analyse...

A least squares regression line is used to...

What does a least squares regression line minimise?

Next

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Mock-Exams
  • Smart Note-Taking
Join over 22 million students in learning with our StudySmarter App Join over 22 million students in learning with our StudySmarter App

Sign up to highlight and take notes. It’s 100% free.

Entdecke Lernmaterial in der StudySmarter-App

Google Popup

Join over 22 million students in learning with our StudySmarter App

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Mock-Exams
  • Smart Note-Taking
Join over 22 million students in learning with our StudySmarter App