Lerne mit deinen Freunden und bleibe auf dem richtigen Kurs mit deinen persönlichen Lernstatistiken
Jetzt kostenlos anmeldenNie wieder prokastinieren mit unseren Lernerinnerungen.
Jetzt kostenlos anmeldenSay your city is trying to encourage its residents to recycle their household trash, so they come up with two methods for asking them to do so:
mailing an educational pamphlet; and
calling each resident.
Then, the city randomly selects \(200\) households and randomly assigns them to one of three categories:
receiving the pamphlet;
receiving a phone call;
the control group (no form of intervention).
Finally, the city will use the results of this test to decide what is the best way to ask their residents to recycle more.
Can you guess which hypothesis test they will use to make this decision? A Chi-square test for independence!
Occasionally, you want to know if there is a relationship between two categorical variables.
Think of it this way:
If you know something about one variable, can you use that information to learn about the other variable?
You can use a Chi-square test of independence to do just that.
A Chi-square \( (\chi^{2}) \) test of independence is a non-parametric Pearson Chi-square test that you can use to determine whether two categorical variables in a single population are related to each other or not.
If there is a relationship between the two categorical variables, then knowing the value of one variable tells you something about the value of the other variable.
If there is no relationship between the two categorical variables, then they are independent.
All the Pearson Chi-square tests, for independence, homogeneity, and goodness of fit, share the same basic assumptions. The main difference is how the assumptions apply in practice. To be able to use this test, the assumptions for a Chi-square test of independence are:
The two variables must be categorical.
This Chi-square test uses cross-tabulation, counting observations that fall in each category.
Groups must be mutually exclusive; i.e., the sample is randomly selected.
Continuing from the introductory example, three months after the city's intervention methods are tested, they look at the outcome and put the data into a contingency table. The groups that must be mutually exclusive are the subgroups: (Recycles-Pamphlet), (Does Not Recycle-Control), etc.
Table 1. Contingency table, Chi-square test for independence.
Contingency Table | |||
---|---|---|---|
Intervention | Recycles | Does Not Recycle | Row Totals |
Pamphlet | 46 | 18 | 56 |
Phone Call | 47 | 19 | 77 |
Control | 49 | 21 | 67 |
Column Totals | 142 | 58 | \(n =\) 200 |
Expected counts must be at least \(5\).
This means the sample size must be large enough, but how large is difficult to determine beforehand. In general, making sure there are more than \(5\) in each category should be fine.
Observations must be independent.
This is about how the data is collected. In the city recycling example, the researcher should not sample houses that are near each other. That is, it is more likely that a street of household recycle than that households chosen from different neighborhoods recycle.
When it comes to independence of variables, you almost always assume that two variables are independent, then try to prove that they aren’t.
The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“Variable A” and “Variable B” are not related.} \]
The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“Variable A” and “Variable B” are related.} \]
Notice that the Chi-square test for independence makes no claims about the kind of relationship between the two categorical variables, only whether a relationship exists.
Replacing “Variable A” and “Variable B” with the variables in the city recycling example, you get:
Your population is all the households in your city.
As with other Chi-square tests, a Chi-square test of independence works by comparing your observed and expected frequencies. You calculate expected frequencies using the contingency table. So, the expected frequency for row \(r\) and column \(c\) is given by the formula:
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
where,
\(E_{r,c}\) is the expected frequency for population (or, row) \(r\) at level (or, column) \(c\) of the categorical variable,
\(r\) is the number of populations, which is also the number of rows in a contingency table,
\(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table,
\(n_{r}\) is the number of observations from population (or, row) \(r\),
\(n_{c}\) is the number of observations from level (or, column) \(c\) of the categorical variable, and
\(n\) is the total sample size.
Continuing with the city recycling example:
Your city now calculates the expected frequencies using the formula above and the contingency table.
Table 2. Contingency table with observed frequencies and expected frequencies, Chi-square test for independence.
Contingency Table with Observed (O) Frequencies and Expected Frequencies (E) | |||
---|---|---|---|
Intervention | Recycles | Does Not Recycle | Row Totals |
Pamphlet | O1,1 = 46E1,1 = 39.76 | O1,2 = 18E1,2 = 16.24 | 56 |
Phone Call | O2,1 = 47E2,1 = 54.67 | O2,2 = 19E2,2 = 22.33 | 77 |
Control | O3,1 = 49E3,1 = 47.57 | O3,2 = 21E3,2 = 19.43 | 67 |
Column Totals | 142 | 58 | \(n =\) 200 |
Like in the Chi-square test for homogeneity, you are comparing two variables and need the contingency table to add up in both dimensions.
The formula for the degrees of freedom is the same in both the homogeneity and independence tests:
\[ k = (r - 1) (c - 1) \]
where,
\(k\) is the degrees of freedom,
\(r\) is the number of populations, which is also the number of rows in a contingency table, and
\(c\) is the number of levels of the categorical variable, which is also the number of columns in a contingency table.
The formula (also called a test statistic) for a Chi-square test of independence is:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
where,
\(O_{r,c}\) is the observed frequency for population \(r\) at level \(c\), and
\(E_{r,c}\) is the expected frequency for population \(r\) at level \(c\).
The Chi-square test statistic measures how much your observed frequencies differ from your expected frequencies if the two variables are unrelated.
Step \(1\): Create a Table
Using your contingency table, create a table that separates your observed and expected values into two columns.
Table 3. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed and Expected Frequencies | |||
---|---|---|---|
Intervention | Outcome | Observed Frequency | Expected Frequency |
Pamphlet | Recycles | 46 | 39.76 |
Does Not Recycle | 18 | 16.24 | |
Phone Call | Recycles | 47 | 54.67 |
Does Not Recycle | 19 | 22.33 | |
Control | Recycles | 49 | 47.57 |
Does Not Recycle | 21 | 19.43 |
Step \(2\): Subtract Expected Frequencies from Observed Frequencies
Add a new column to your table called “O – E”. In this column, put the result of subtracting the expected frequency from the observed frequency.
Table 4. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed, Expected, and O-E Frequencies | ||||
---|---|---|---|---|
Intervention | Outcome | Observed Frequency | Expected Frequency | O – E |
Pamphlet | Recycles | 46 | 39.76 | 6.24 |
Does Not Recycle | 18 | 16.24 | 1.76 | |
Phone Call | Recycles | 47 | 54.67 | -7.67 |
Does Not Recycle | 19 | 22.33 | -3.33 | |
Control | Recycles | 49 | 47.57 | 1.43 |
Does Not Recycle | 21 | 19.43 | 1.57 |
Decimals in this table are rounded to \(2\) digits.
Step \(3\): Square the Results from Step \(2\)
Add a new column to your table called “(O – E)2”. In this column, put the result of squaring the results from the previous column.
Table 5. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed, Expected, O-E, and (O-E)2 Frequencies | |||||
---|---|---|---|---|---|
Intervention | Outcome | Observed Frequency | Expected Frequency | O – E | (O – E)2 |
Pamphlet | Recycles | 46 | 39.76 | 6.24 | 38.94 |
Does Not Recycle | 18 | 16.24 | 1.76 | 3.10 | |
Phone Call | Recycles | 47 | 54.67 | -7.67 | 58.83 |
Does Not Recycle | 19 | 22.33 | -3.33 | 11.09 | |
Control | Recycles | 49 | 47.57 | 1.43 | 2.04 |
Does Not Recycle | 21 | 19.43 | 1.57 | 2.46 |
Decimals in this table are rounded to \(2\) digits.
Step \(4\): Divide the Results from Step \(3\) by the Expected Frequencies
Add a new column to your table called “(O – E)2”/E. In this column, put the result of dividing the results from the previous column by their expected frequencies.
Table 6. Table of observed frequencies and expected frequencies, Chi-square test for independence.
Table of Observed, Expected, O-E, (O-E)2, and (O-E)2/E Frequencies | ||||||
---|---|---|---|---|---|---|
Intervention | Outcome | Observed Frequency | Expected Frequency | O – E | (O – E)2 | (O – E)2/E |
Pamphlet | Recycles | 46 | 39.76 | 6.24 | 38.94 | 0.98 |
Does Not Recycle | 18 | 16.24 | 1.76 | 3.10 | 0.19 | |
Phone Call | Recycles | 47 | 54.67 | -7.67 | 58.83 | 1.08 |
Does Not Recycle | 19 | 22.33 | -3.33 | 11.09 | 0.50 | |
Control | Recycles | 49 | 47.57 | 1.43 | 2.04 | 0.04 |
Does Not Recycle | 21 | 19.43 | 1.57 | 2.46 | 0.13 |
Decimals in this table are rounded to \(2\) digits.
Step \(5\): Add the Results from Step \(4\) to get the Chi-Square Test Statistic
Finally, add up all the values in the last column of your table to calculate your Chi-square test statistic:
\[ \begin{align}\chi^{2} &= \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \\&= 0.9793 + 0.1907 + 1.0761 + 0.4966 + 0.04299 + 0.1269 \\&= 2.91259\end{align} \]
The formula here uses the non-rounded numbers from the tables above to get a more accurate answer.
The Chi-square test statistic for the Chi-square test of independence in the city recycling example is:
\[ \chi^{2} = 2.91259 \]
If your calculated test statistic is large enough, then you can draw the conclusion that the observed frequencies are not what you would expect if the variables are indeed unrelated. But what is considered “large enough”?
To determine whether the test statistic is large enough to reject the null hypothesis, you compare the test statistic to a critical value from a Chi-square distribution table. This act of comparison is the heart of the Chi-square test of independence.
Follow the \(6\) steps below to perform a Chi-square test of independence.
Note that steps \(1, 2\) and \(3\) were outlined in detail above.
Step \(1\): State the Hypotheses
The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“Variable A” and “Variable B” are not related.} \]
The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“Variable A” and “Variable B” are related.} \]
Step \(2\): Calculate the Expected Frequencies
Use your contingency table to calculate the expected frequencies using the formula:
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
Step \(3\): Calculate the Chi-Square Test Statistic
Use the formula for a Chi-square test of independence to calculate the Chi-square test statistic:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
Step \(4\): Find the Critical Chi-Square Value
You have two options for finding the critical value:
use a Chi-square distribution table, or
use a critical value calculator.
Either way, there are two pieces of information you need to know to find the critical value:
the degrees of freedom, \(k\), given by the formula:
\[ k = (r - 1) (c - 1) \]
and the significance level, \( \alpha \), which is usually \( 0.05 \).
Referring back to the city recycling example, find the critical value.
Find the critical Chi-square value.
Table 7. Percentage of points, Chi-square test for independence.
Percentage Points of the Chi-Square Distribution | |||||||||
---|---|---|---|---|---|---|---|---|---|
Degrees of Freedom (k) | Probability of a Larger Value of X2; Significance Level (α) | ||||||||
0.99 | 0.95 | 0.90 | 0.75 | 0.50 | 0.25 | 0.10 | 0.05 | 0.01 | |
1 | 0.000 | 0.004 | 0.016 | 0.102 | 0.455 | 1.32 | 2.71 | 3.84 | 6.63 |
2 | 0.020 | 0.103 | 0.211 | 0.575 | 1.386 | 2.77 | 4.61 | 5.99 | 9.21 |
3 | 0.115 | 0.352 | 0.584 | 1.212 | 2.366 | 4.11 | 6.25 | 7.81 | 11.34 |
Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value
Now for the moment of truth! Is your test statistic large enough to reject the null hypothesis? Compare it to the critical value you just found to find out.
Again, continuing with the city recycling example, compare the test statistic to the critical value.
The Chi-square test statistic is: \( \chi^{2} = 2.91259 \)
The critical value is: \( 5.99 \)
The Chi-square test statistic is less than the critical value.
Step \(6\): Decide Whether to Reject the Null Hypothesis
Finally, decide whether to reject the null hypothesis.
If the Chi-square value is greater than the critical value, then the difference between the observed and expected frequencies is significant; \( (p < \alpha) \)
This means you reject the null hypothesis that the variables are unrelated, and you have support that the alternative hypothesis is true.
If the Chi-square value is less than the critical value, then the difference between the observed and expected frequencies is not significant; \( (p > \alpha) \)
This means you do not reject the null hypothesis, but you do not have support that the alternative hypothesis is true.
Decide whether to reject the null hypothesis for the city recycling example.
The Chi-square value is less than the critical value.
The city concludes that their interventions do not have an effect on whether households choose to recycle.
In the steps to perform a Chi-square test of independence, you calculated and used the critical value to decide whether to reject the null hypothesis.
A critical value of a Chi-square test of independence is a value that is compared to the value of the test statistic, so you can determine whether to reject the null hypothesis.
It is important to know, however, that there is another option you can use: the \(p\)-value.
The \(p\)-value of a Chi-square test of independence is associated with the calculated value of its test statistic. It is the area to the right of the \( \chi^{2} \) under the chi square curve, and it has \(k\) degrees of freedom.
The image below sums up the critical value approach vs. the \(p\)-value approach.
Many jobseekers are applying via online job boards these days. Sites like Indeed, ZipRecruiter, and CareerBuilder have thousands of enticing posts inviting people to apply. It’s never been easier for fraudulent recruiters to lure in unsuspecting and vulnerable people.
Are fraudulent recruiters more prevalent in some industries than others?
The contingency table below contains real counts of fraudulent and non-fraudulent online job openings, by industry. These are the \(10\) most common industries in the dataset. This is quite a big dataset, but a good representation of what statisticians do in the real world.
Table 7. Contingency table, Chi-square test for independence.
Contingency Table | |||
---|---|---|---|
Industry | Real | Fraud | Row Totals |
Information Technology | 1702 | 32 | 1734 |
Computer Software | 1371 | 5 | 1376 |
Internet | 1062 | 0 | 1062 |
Marketing / Advertising | 783 | 45 | 828 |
Education | 822 | 0 | 822 |
Financial Services | 744 | 35 | 779 |
Healthcare | 446 | 51 | 497 |
Consumer Services | 334 | 24 | 358 |
Telecom. | 316 | 26 | 342 |
Oil / Energy | 178 | 109 | 287 |
Column Totals | 7758 | 327 | \(n=\) 8085 |
Solution:
Step \(1\): State the Hypotheses.
The null hypothesis is that the two categorical variables are independent, i.e., there is no association between them, they are not related.\[ H_{0}: \text{“if a job post is real” and “the job industry” are not related.} \]
The alternative hypothesis is that the two categorical variables are not independent, i.e., there is an association between them, they are related.\[ H_{a}: \text{“if a job post is real” and “the job industry” are related.} \]
Table 7. Table of expected frequencies, Chi-square test for independence.
Table of Expected Frequencies | |||
---|---|---|---|
Industry | Real | Fraud | Row Totals |
Information Technology | 1663.8679 | 70.1321 | 1734 |
Computer Software | 1320.3473 | 55.6527 | 1376 |
Internet | 1019.0471 | 42.9529 | 1062 |
Marketing / Advertising | 794.5113 | 33.4887 | 828 |
Education | 788.754 | 33.246 | 822 |
Financial Services | 747.4931 | 31.5069 | 779 |
Healthcare | 476.8987 | 20.1013 | 497 |
Consumer Services | 343.5206 | 14.4794 | 358 |
Telecom. | 328.1677 | 13.8323 | 324 |
Oil / Energy | 275.3922 | 11.6078 | 287 |
Column Totals | 7758 | 327 | \(n =\) 8085 |
Step \(3\): Calculate the Chi-Square Test Statistic.
Table 7. Chi-square test statistics.
Using a Table to Calculate the Chi-Square Test Statistic | ||||||
---|---|---|---|---|---|---|
Industry | Job Post Status | Observed Frequency | Expected Frequency | O – E | (O – E)2 | (O – E)2/E |
Information Technology | Real | 1702 | 1633.868 | 68.132 | 4641.983 | 2.841 |
Fraud | 32 | 70.132 | -38.132 | 1454.057 | 20.733 | |
Computer Software | Real | 1371 | 1320.347 | 50.653 | 2565.696 | 1.943 |
Fraud | 5 | 55.653 | -50.653 | 2565.696 | 46.102 | |
Internet | Real | 1062 | 1019.047 | 42.953 | 1844.952 | 1.811 |
Fraud | 0 | 42.953 | -42.953 | 1844.952 | 42.953 | |
Marketing / Advertising | Real | 783 | 794.511 | -11.511 | 132.510 | 0.167 |
Fraud | 45 | 33.4888 | 11.511 | 132.510 | 3.957 | |
Education | Real | 822 | 788.754 | 33.246 | 1105.297 | 1.401 |
Fraud | 0 | 33.246 | -33.246 | 1105.297 | 33.246 | |
Financial Services | Real | 744 | 747.493 | -3.493 | 12.202 | 0.016 |
Fraud | 35 | 31.507 | 3.493 | 12.202 | 0.387 | |
Healthcare | Real | 446 | 476.899 | -30.899 | 954.730 | 2.002 |
Fraud | 51 | 20.101 | 30.899 | 954.730 | 47.496 | |
Consumer Services | Real | 334 | 343.521 | -9.521 | 90.642 | 0.264 |
Fraud | 24 | 14.479 | 9.521 | 90.642 | 6.260 | |
Telecom. | Real | 316 | 328.168 | -12.168 | 148.053 | 0.451 |
Fraud | 26 | 13.832 | 12.168 | 148.053 | 10.703 | |
Oil / Energy | Real | 178 | 275.392 | -97.392 | 9485.241 | 34.443 |
Fraud | 109 | 11.608 | 97.392 | 9485.241 | 817.144 |
Decimals in this table are rounded to \(3\) digits.
The formula here uses the non-rounded numbers from the table above to get a more accurate answer.
Step \(4\): Find the Critical Chi-Square Value and the \(P\)-Value.
In the real world, a statistician would likely be more interested in calculating the \(p\)-value than simply reporting whether there was a significant result, but people much prefer to get a more specific conclusion. Say you want to be really sure that there is a relationship before you report one, and choose a significance level of \(\alpha = 0.01\).
Step \(5\): Compare the Chi-Square Test Statistic to the Critical Chi-Square Value.
Step \(6\): Decide Whether to Reject the Null Hypothesis.
Therefore, you can confidently reject the null hypothesis.
\[ E_{r,c} = \frac{n_{r} \cdot n_{c}}{n} \]
\[ k = (r - 1) (c - 1) \]
The formula (also called a test statistic) for a Chi-square test of independence is:
\[ \chi^{2} = \sum \frac{(O_{r,c} - E_{r,c})^{2}}{E_{r,c}} \]
The following requirements must be met if you want to perform a chi-square test for independence:
A chi square test of independence is a non-parametric Pearson chi square test that you can use to determine whether two categorical variables are related to each other or not.
You use a chi square test for independence when you meet all the following:
A chi-square test of independence has two categorical variables.
Yes, along with all other chi-square tests, the chi-square test of independence is a non parametric test.
What is a Chi-square test of independence?
A Chi-square test of independence is a non-parametric Pearson Chi-square test that you can use to determine whether two categorical variables in a single population are related to each other or not.
True or False?
All the Pearson Chi-square tests, for independence, homogeneity, and goodness of fit, share the same basic assumptions.
True
To be able to use this test, the assumptions for a Chi-square test of independence are:
The two variables must be categorical.
When can you use a Chi-square test for independence?
How many variables does a Chi-square test of independence have?
A Chi-square test of independence has two categorical variables.
Is a Chi-square test of independence a non-parametric test?
Yes, along with all other Chi-square tests, the Chi-square test of independence is a non-parametric test.
Already have an account? Log in
Open in AppThe first learning app that truly has everything you need to ace your exams in one place
Sign up to highlight and take notes. It’s 100% free.
Save explanations to your personalised space and access them anytime, anywhere!
Sign up with Email Sign up with AppleBy signing up, you agree to the Terms and Conditions and the Privacy Policy of StudySmarter.
Already have an account? Log in
Already have an account? Log in
The first learning app that truly has everything you need to ace your exams in one place
Already have an account? Log in