|
|
Chi Square Test for Independence

Say your city is trying to encourage its residents to recycle their household trash, so they come up with two methods for asking them to do so:

Mockup Schule

Explore our app and discover over 50 million learning materials for free.

Chi Square Test for Independence

Illustration

Lerne mit deinen Freunden und bleibe auf dem richtigen Kurs mit deinen persönlichen Lernstatistiken

Jetzt kostenlos anmelden

Nie wieder prokastinieren mit unseren Lernerinnerungen.

Jetzt kostenlos anmelden
Illustration

Say your city is trying to encourage its residents to recycle their household trash, so they come up with two methods for asking them to do so:

  1. mailing an educational pamphlet; and

  2. calling each resident.

Then, the city randomly selects \(200\) households and randomly assigns them to one of three categories:

  1. receiving the pamphlet;

  2. receiving a phone call;

  3. the control group (no form of intervention).

Finally, the city will use the results of this test to decide what is the best way to ask their residents to recycle more.

Can you guess which hypothesis test they will use to make this decision? A Chi-square test for independence!

Chi-Square Test of Independence Definition

Occasionally, you want to know if there is a relationship between two categorical variables.

Think of it this way:

If you know something about one variable, can you use that information to learn about the other variable?

You can use a Chi-square test of independence to do just that.

A Chi-square \( (\chi^{2}) \) test of independence is a non-parametric Pearson Chi-square test that you can use to determine whether two categorical variables in a single population are related to each other or not.

If there is a relationship between the two categorical variables, then knowing the value of one variable tells you something about the value of the other variable.

If there is no relationship between the two categorical variables, then they are independent.

Assumptions for a Chi-Square Test of Independence

All the Pearson Chi-square tests, for independence, homogeneity, and goodness of fit, share the same basic assumptions. The main difference is how the assumptions apply in practice. To be able to use this test, the assumptions for a Chi-square test of independence are:

  • The two variables must be categorical.

    • This Chi-square test uses cross-tabulation, counting observations that fall in each category.

  • Groups must be mutually exclusive; i.e., the sample is randomly selected.

    • Continuing from the introductory example, three months after the city's intervention methods are tested, they look at the outcome and put the data into a contingency table. The groups that must be mutually exclusive are the subgroups: (Recycles-Pamphlet), (Does Not Recycle-Control), etc.

Table 1. Contingency table, Chi-square test for independence.

Contingency Table
Intervention
Recycles
Does Not Recycle
Row Totals
Pamphlet
461856
Phone Call
471977
Control492167
Column Totals14258\(n =\) 200

  • Expected counts must be at least \(5\).

    • This means the sample size must be large enough, but how large is difficult to determine beforehand. In general, making sure there are more than \(5\) in each category should be fine.

  • Observations must be independent.

    • This is about how the data is collected. In the city recycling example, the researcher should not sample houses that are near each other. That is, it is more likely that a street of household recycle than that households chosen from different neighborhoods recycle.

Null Hypothesis and Alternative Hypothesis for a Chi-Square Test of Independence

When it comes to independence of variables, you almost always assume that two variables are independent, then try to prove that they aren’t.

  • The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“Variable A” and “Variable B” are not related.} \]

  • The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“Variable A” and “Variable B” are related.} \]

Notice that the Chi-square test for independence makes no claims about the kind of relationship between the two categorical variables, only whether a relationship exists.

Replacing “Variable A” and “Variable B” with the variables in the city recycling example, you get:

Your population is all the households in your city.

  • Null Hypothesis \[ \begin{align}H_{0}: &\text{“if a household recycles” and} \\&\text{“the type of intervention received”} \\&\text{are not related.}\end{align} \]
  • Alternative Hypothesis \[ \begin{align}H_{a}: &\text{“if a household recycles” and} \\&\text{“the type of intervention received”} \\&\text{are related.}\end{align} \]

Expected Frequencies of a Chi-Square Test of Independence

As with other Chi-square tests, a Chi-square test of independence works by comparing your observed and expected frequencies. You calculate expected frequencies using the contingency table. So, the expected frequency for row \(r\) and column \(c\) is given by the formula:

\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]

where,

  • \(E_{r,c}\) is the expected frequency for population (or, row) \(r\) at level (or, column) \(c\) of the categorical variable,

  • \(r\) is the number of populations, which is also the number of rows in a contingency table,

  • \(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table,

  • \(n_{r}\) is the number of observations from population (or, row) \(r\),

  • \(n_{c}\) is the number of observations from level (or, column) \(c\) of the categorical variable, and

  • \(n\) is the total sample size.

Continuing with the city recycling example:

Your city now calculates the expected frequencies using the formula above and the contingency table.

  • \(E_{1,1}=\frac{56 \cdot 142}{200} = 39.76\)
  • \(E_{1,2}=\frac{56 \cdot 58}{200} = 16.24\)
  • \(E_{2,1}=\frac{77 \cdot 142}{200} = 54.67\)
  • \(E_{2,2}=\frac{77 \cdot 58}{200} = 22.33\)
  • \(E_{3,1}=\frac{67 \cdot 142}{200} = 47.57\)
  • \(E_{3,2}=\frac{67 \cdot 58}{200} = 19.43\)

Table 2. Contingency table with observed frequencies and expected frequencies, Chi-square test for independence.

Contingency Table with Observed (O) Frequencies and Expected Frequencies (E)

Intervention

RecyclesDoes Not RecycleRow Totals
PamphletO1,1 = 46E1,1 = 39.76O1,2 = 18E1,2 = 16.2456
Phone CallO2,1 = 47E2,1 = 54.67O2,2 = 19E2,2 = 22.3377
ControlO3,1 = 49E3,1 = 47.57O3,2 = 21E3,2 = 19.4367
Column Totals14258\(n =\) 200

Degrees of Freedom for a Chi-Square Test of Independence

Like in the Chi-square test for homogeneity, you are comparing two variables and need the contingency table to add up in both dimensions.

The formula for the degrees of freedom is the same in both the homogeneity and independence tests:

\[ k = (r - 1) (c - 1) \]

where,

  • \(k\) is the degrees of freedom,

  • \(r\) is the number of populations, which is also the number of rows in a contingency table, and

  • \(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table.

Formula for a Chi-Square Test of Independence

The formula (also called a test statistic) for a Chi-square test of independence is:

\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]

where,

  • \(O_{r,c}\) is the observed frequency for population \(r\) at level \(c\), and

  • \(E_{r,c}\) is the expected frequency for population \(r\) at level \(c\).

The Chi-square test statistic measures how much your observed frequencies differ from your expected frequencies if the two variables are unrelated.

Steps to Calculate the Test Statistic for a Chi-Square Test of Independence

Step \(1\): Create a Table

Using your contingency table, create a table that separates your observed and expected values into two columns.

Table 3. Table of observed frequencies and expected frequencies, Chi-square test for independence.

Table of Observed and Expected Frequencies
InterventionOutcomeObserved FrequencyExpected Frequency
PamphletRecycles4639.76
Does Not Recycle1816.24
Phone CallRecycles4754.67
Does Not Recycle1922.33
ControlRecycles4947.57
Does Not Recycle2119.43

Step \(2\): Subtract Expected Frequencies from Observed Frequencies

Add a new column to your table called “O – E”. In this column, put the result of subtracting the expected frequency from the observed frequency.

Table 4. Table of observed frequencies and expected frequencies, Chi-square test for independence.

Table of Observed, Expected, and O-E Frequencies
InterventionOutcomeObserved FrequencyExpected FrequencyO – E
PamphletRecycles4639.766.24
Does Not Recycle1816.241.76
Phone CallRecycles4754.67-7.67
Does Not Recycle1922.33-3.33
ControlRecycles4947.571.43
Does Not Recycle2119.431.57

Decimals in this table are rounded to \(2\) digits.

Step \(3\): Square the Results from Step \(2\)

Add a new column to your table called “(O – E)2”. In this column, put the result of squaring the results from the previous column.

Table 5. Table of observed frequencies and expected frequencies, Chi-square test for independence.

Table of Observed, Expected, O-E, and (O-E)2 Frequencies
InterventionOutcomeObserved FrequencyExpected FrequencyO – E(O – E)2
PamphletRecycles4639.766.2438.94
Does Not Recycle1816.241.763.10
Phone CallRecycles4754.67-7.6758.83
Does Not Recycle1922.33-3.3311.09
ControlRecycles4947.571.432.04
Does Not Recycle2119.431.572.46

Decimals in this table are rounded to \(2\) digits.

Step \(4\): Divide the Results from Step \(3\) by the Expected Frequencies

Add a new column to your table called “(O – E)2”/E. In this column, put the result of dividing the results from the previous column by their expected frequencies.

Table 6. Table of observed frequencies and expected frequencies, Chi-square test for independence.

Table of Observed, Expected, O-E, (O-E)2, and (O-E)2/E Frequencies
InterventionOutcomeObserved FrequencyExpected FrequencyO – E(O – E)2(O – E)2/E
PamphletRecycles4639.766.2438.940.98
Does Not Recycle1816.241.763.100.19
Phone CallRecycles4754.67-7.6758.831.08
Does Not Recycle1922.33-3.3311.090.50
ControlRecycles4947.571.432.040.04
Does Not Recycle2119.431.572.460.13

Decimals in this table are rounded to \(2\) digits.

Step \(5\): Add the Results from Step \(4\) to get the Chi-Square Test Statistic

Finally, add up all the values in the last column of your table to calculate your Chi-square test statistic:

\[ \begin{align}\chi^{2} &= \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \\&= 0.9793 + 0.1907 + 1.0761 + 0.4966 + 0.04299 + 0.1269 \\&= 2.91259\end{align} \]

The formula here uses the non-rounded numbers from the tables above to get a more accurate answer.

The Chi-square test statistic for the Chi-square test of independence in the city recycling example is:

\[ \chi^{2} = 2.91259 \]

Steps to Perform a Chi-Square Test of Independence

If your calculated test statistic is large enough, then you can draw the conclusion that the observed frequencies are not what you would expect if the variables are indeed unrelated. But what is considered “large enough”?

To determine whether the test statistic is large enough to reject the null hypothesis, you compare the test statistic to a critical value from a Chi-square distribution table. This act of comparison is the heart of the Chi-square test of independence.

Follow the \(6\) steps below to perform a Chi-square test of independence.

Note that steps \(1, 2\) and \(3\) were outlined in detail above.

Step \(1\): State the Hypotheses

  • The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“Variable A” and “Variable B” are not related.} \]

  • The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“Variable A” and “Variable B” are related.} \]

Step \(2\): Calculate the Expected Frequencies

Use your contingency table to calculate the expected frequencies using the formula:

\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]

Step \(3\): Calculate the Chi-Square Test Statistic

Use the formula for a Chi-square test of independence to calculate the Chi-square test statistic:

\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]

Step \(4\): Find the Critical Chi-Square Value

You have two options for finding the critical value:

  1. use a Chi-square distribution table, or

  2. use a critical value calculator.

Either way, there are two pieces of information you need to know to find the critical value:

  1. the degrees of freedom, \(k\), given by the formula:

    \[ k = (r - 1) (c - 1) \]

  2. and the significance level, \( \alpha \), which is usually \( 0.05 \).

Referring back to the city recycling example, find the critical value.

Find the critical Chi-square value.

  1. Calculate the degrees of freedom.
    • Using the contingency table for the city recycling example, recall that there are \(3\) intervention groups (the rows of the contingency table) and \(2\) outcome groups (the columns of the contingency table). So, the degrees of freedom are:\[ \begin{align} k &= (r - 1) (c - 1) \\&= (3 - 1) (2 - 1) \\&= 2 \text{ degrees of freedom}\end{align} \]
  2. Choose a significance level.
    • Typically, a significance level of \( 0.05 \) is used, so use that here.
  3. Using either a Chi-square distribution table or a critical value calculator, determine the critical value.
    • According to the Chi-square distribution table below, for \(k = 2\) and \( \alpha = 0.05 \), the critical value is:\[ \chi^{2} \text{critical value} = 5.99 \]

Table 7. Percentage of points, Chi-square test for independence.

Percentage Points of the Chi-Square Distribution
Degrees of Freedom (k)Probability of a Larger Value of X2; Significance Level (α)
0.990.950.900.750.500.250.100.050.01
10.0000.0040.0160.1020.4551.322.713.846.63
20.0200.1030.2110.5751.3862.774.615.999.21
30.1150.3520.5841.2122.3664.116.257.8111.34

Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value

Now for the moment of truth! Is your test statistic large enough to reject the null hypothesis? Compare it to the critical value you just found to find out.

Again, continuing with the city recycling example, compare the test statistic to the critical value.

The Chi-square test statistic is: \( \chi^{2} = 2.91259 \)

The critical value is: \( 5.99 \)

The Chi-square test statistic is less than the critical value.

Step \(6\): Decide Whether to Reject the Null Hypothesis

Finally, decide whether to reject the null hypothesis.

  • If the Chi-square value is greater than the critical value, then the difference between the observed and expected frequencies is significant; \( (p < \alpha) \)

    • This means you reject the null hypothesis that the variables are unrelated, and you have support that the alternative hypothesis is true.

  • If the Chi-square value is less than the critical value, then the difference between the observed and expected frequencies is not significant; \( (p > \alpha) \)

    • This means you do not reject the null hypothesis, but you do not have support that the alternative hypothesis is true.

Decide whether to reject the null hypothesis for the city recycling example.

The Chi-square value is less than the critical value.

  • So, the city does not reject the null hypothesis that whether a household recycles and the type of intervention they receive are unrelated.
  • There is not a significant difference between the observed frequencies and the expected frequencies. This suggests that the proportion of households that recycle is the same for all interventions.

The city concludes that their interventions do not have an effect on whether households choose to recycle.

Using Critical Value VS Using P-Value

In the steps to perform a Chi-square test of independence, you calculated and used the critical value to decide whether to reject the null hypothesis.

A critical value of a Chi-square test of independence is a value that is compared to the value of the test statistic, so you can determine whether to reject the null hypothesis.

It is important to know, however, that there is another option you can use: the \(p\)-value.

The \(p\)-value of a Chi-square test of independence is associated with the calculated value of its test statistic. It is the area to the right of the \( \chi^{2} \) under the chi square curve, and it has \(k\) degrees of freedom.

The image below sums up the critical value approach vs. the \(p\)-value approach.

Chi-Square Test for Independence, Figure 1. A diagram showing how you can use either a p-value or a critical value to determine whether to reject the null hypothesis. StudySmarterFigure 1. A diagram showing how you can use either a \(p\)-value or a critical value to determine whether to reject the null hypothesis.

Chi-Square Test for Independence – Example

Many jobseekers are applying via online job boards these days. Sites like Indeed, ZipRecruiter, and CareerBuilder have thousands of enticing posts inviting people to apply. It’s never been easier for fraudulent recruiters to lure in unsuspecting and vulnerable people.

Are fraudulent recruiters more prevalent in some industries than others?

The contingency table below contains real counts of fraudulent and non-fraudulent online job openings, by industry. These are the \(10\) most common industries in the dataset. This is quite a big dataset, but a good representation of what statisticians do in the real world.

Table 7. Contingency table, Chi-square test for independence.

Contingency Table
Industry
Real
Fraud
Row Totals
Information Technology
1702
32
1734
Computer Software
1371
5
1376
Internet
1062
0
1062
Marketing / Advertising
783
45
828
Education
822
0
822
Financial Services
744
35
779
Healthcare
446
51
497
Consumer Services
334
24
358
Telecom.
316
26
342
Oil / Energy
178
109
287
Column Totals
7758
327
\(n=\) 8085

Solution:

Step \(1\): State the Hypotheses.

  • The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“if a job post is real” and “the job industry” are not related.} \]

  • The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“if a job post is real” and “the job industry” are related.} \]

Step \(2\): Calculate Expected Frequencies.
  • Using the contingency table above and the formula:\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n}, \]create a table that has your calculated expected frequencies.

Table 7. Table of expected frequencies, Chi-square test for independence.

Table of Expected Frequencies
IndustryRealFraudRow Totals
Information Technology1663.867970.13211734
Computer Software1320.347355.65271376
Internet1019.047142.95291062
Marketing / Advertising794.511333.4887828
Education788.75433.246822
Financial Services747.493131.5069779
Healthcare476.898720.1013497
Consumer Services343.520614.4794358
Telecom.328.167713.8323324
Oil / Energy275.392211.6078287
Column Totals7758327\(n =\) 8085

Step \(3\): Calculate the Chi-Square Test Statistic.

  • Create a table to hold your calculated values and use the formula:\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]to calculate your test statistic.

Table 7. Chi-square test statistics.

Using a Table to Calculate the Chi-Square Test Statistic
IndustryJob Post StatusObserved FrequencyExpected FrequencyO – E(O – E)2(O – E)2/E
Information TechnologyReal17021633.86868.1324641.9832.841
Fraud3270.132-38.1321454.05720.733
Computer SoftwareReal13711320.34750.6532565.6961.943
Fraud555.653-50.6532565.69646.102
InternetReal10621019.04742.9531844.9521.811
Fraud042.953-42.9531844.95242.953
Marketing / AdvertisingReal783794.511-11.511132.5100.167
Fraud4533.488811.511132.5103.957
EducationReal822788.75433.2461105.2971.401
Fraud033.246-33.2461105.29733.246
Financial ServicesReal744747.493-3.49312.2020.016
Fraud3531.5073.49312.2020.387
HealthcareReal446476.899-30.899954.7302.002
Fraud5120.10130.899954.73047.496
Consumer ServicesReal334343.521-9.52190.6420.264
Fraud2414.4799.52190.6426.260
Telecom.Real316328.168-12.168148.0530.451
Fraud2613.83212.168148.05310.703
Oil / EnergyReal178275.392-97.3929485.24134.443
Fraud10911.60897.3929485.241817.144

Decimals in this table are rounded to \(3\) digits.

  • Add all the values in the last column of the table above to calculate the test statistic:\[ \begin{align}\chi^{2} &= 2.8411 + 20.7331 + 1.9432 + 46.1019 + 1.8105 \\&+ 42.9529 + 0.1668 + 3.9569 + 1.4013 + 33.246 \\&+ 0.0163 + 0.3873 + 2.0020 + 47.4959 + 0.2639 \\&+ 6.2601 + 0.4512 + 10.7034 + 34.4427 + 817.1437 \\&= 1074.319971.\end{align} \]
  • The formula here uses the non-rounded numbers from the table above to get a more accurate answer.

  • The Chi-square test statistic is:\[ \chi^{2} = 1074.319971 .\]

Step \(4\): Find the Critical Chi-Square Value and the \(P\)-Value.

In the real world, a statistician would likely be more interested in calculating the \(p\)-value than simply reporting whether there was a significant result, but people much prefer to get a more specific conclusion. Say you want to be really sure that there is a relationship before you report one, and choose a significance level of \(\alpha = 0.01\).

  • Calculate the degrees of freedom: \[ \begin{align}k &= (r - 1)(c - 1) \\&= (2 - 1) (10 - 1) \\&= 1 \cdot 9 \\&= 9 \text{ degrees of freedom}\end{align} \]
  • Using a Chi-square distribution table, look at the row for \(9\) degrees of freedom and the column for \(0.01\) significance to find the critical value of \(21.67\).
  • To use a \(p\)-value calculator, you need the test statistic and degrees of freedom.
    • Plugging the degrees of freedom and the test statistic into a \(p\)-value calculator, you get a \(p\)-value very close to \(0\).

Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value.

  • The test statistic of \(1074.319971\) is much, much larger than the critical value of \(21.67\), which means you have sufficient evidence to reject the null hypothesis.
  • The \(p\)-value is also very low, much less than the significance level, which would also let you reject the null hypothesis.

Step \(6\): Decide Whether to Reject the Null Hypothesis.

  • It looks like there is a strong relationship between industry and the number of fraudulent recruiters out there.
  • Look at the table from step \(2\).
    • Here, you can see that the number of fraudulent jobs in the Oil industry is way higher than expected, and by itself contributes enough for you to conclude that industry and recruiter scams are not independent.

Therefore, you can confidently reject the null hypothesis.

Chi-Square Test for Independence – Key takeaways

  • A Chi-square test of independence is a non-parametric Pearson Chi-square test that you can use to determine whether two categorical variables in a single population are related to each other or not.
  • The following must be true in order to use a Chi-square test of independence:
    • The two variables must be categorical.
    • Groups must be mutually exclusive; i.e., the sample is randomly selected.
    • Expected counts must be at least \(5\).
    • Observations must be independent.
  • The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.
  • The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.
  • The expected frequency for row \(r\) and column \(c\) of a Chi-square test of independence is given by the formula:

    \[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]

  • The degrees of freedomfor a Chi-square test of independence is given by the formula:

    \[ k = (r - 1) (c - 1) \]

  • The formula (also called a test statistic) for a Chi-square test of independence is:

    \[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]

Frequently Asked Questions about Chi Square Test for Independence

The following requirements must be met if you want to perform a chi-square test for independence:

  • The variables must be categorical.
  • Groups must be mutually exclusive.
  • Expected counts must be at least 5.
  • Observations must be independent.

A chi square test of independence is a non-parametric Pearson chi square test that you can use to determine whether two categorical variables are related to each other or not.

You use a chi square test for independence when you meet all the following:

  • You want to test a hypothesis about the relationship between two categorical variables.
  • The sample was randomly selected.
  • There is a minimum of 5 observations expected in each combined group.

A chi-square test of independence has two categorical variables.

Yes, along with all other chi-square tests, the chi-square test of independence is a non parametric test.

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Mock-Exams
  • Smart Note-Taking
Join over 22 million students in learning with our StudySmarter App Join over 22 million students in learning with our StudySmarter App

Sign up to highlight and take notes. It’s 100% free.

Entdecke Lernmaterial in der StudySmarter-App

Google Popup

Join over 22 million students in learning with our StudySmarter App

Join over 22 million students in learning with our StudySmarter App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Mock-Exams
  • Smart Note-Taking
Join over 22 million students in learning with our StudySmarter App