Section 2G - The Central Limit Theorem

Repeatedly throughout the last several sections, we have reiterated the challenges presented when trying to infer information about a population from information we gather from a sample. For example, a recent news report claimed

"Nearly eight and a half percent of homeowners in the United States are in danger of foreclosure."

We recognize that the percentage stated was most likely not generated by gathering information from every single homeowner in the United States. Instead, some sample was selected and information was gathered from that sample. But, we are confronted with the fact that if the survey was conducted again, that is if we contacted a different sample of the same size, we would not expect to find that exactly eight and a half percent of that sample would also be facing foreclosure. So if we re-sample, we expect the resulting statistic will most likely be different. The following example highlights this issue.

Example In order to assess the damage to ponderosa pine trees due to bark beetles, U.S. Forest Service employees walk through specified patches of forest and count the number of trees which display bark beetle damage.

In an effort to ensure their statistics were representative of all of the trees in the Coconino National Forest (that is, representative of the overall population), Forest Service employees repeated their surveying procedure multiple times. The following table summarizes their findings.

Survey Number # of Trees w/ Bark Beetle Damage Total # of Trees Inspected % of Sample w/ Bark Beetle Damage
1 206 380 54.2%
2 184 380 48.4%
3 209 380 55.0%
4 204 380 53.7%
5 214 380 56.3%
6 223 380 58.7%

What conclusions can be drawn from these results? Which sample proportion is most trustworthy?

Solution First, we note that each sample was of the same size, namely n = 380. But in spite of this consistancy, a different percentage resulted from each sample. That is, for each new sample of size 380, a different sample proportion was found. We have no reason to believe any of the surveys were conducted irresponsibly (by using a poor sampling methods, for example), so we have no way of distinguishing between the resulting statistics. Unfortunately, we can't identify one statistic which is more trustworthy than the others and, worse, we must concede that if we were to continue surveying 380 trees at a time, we would expect to continue finding different percentages of those samples which displayed bark beetle damage.

Recall that our goal in such a situation is to try to infer the actual percentage of all ponderosa pine trees in the Coconino National Forest which have bark beetle damage. But since we cannot efficiently access this entire population, we are left to draw information from samples of trees - but these samples produce conflicting results!

We need the more advanced tools and ways of reasoning developed in this section to confront a situation such as this.

The fact that different statistics resulted from different samples should not be surprising - this is simply a consequence of sampling variability. Before we discuss the mathematical theory which allows us to responsibly address issues such as those raised in the previous example, we need to define one important piece of terminology.

Definition If we could take all possible samples (of a specific size) from a population and calculate a specified statistic for each sample, the data set consisting of all the statistics we would calculate is called the sampling distribution of the statistic.

Referring to the previous example, we can imagine taking all possible samples of size 380 of ponderosa pine trees in the Coconino National Forest (which would be a huge number of samples!) and calculating the percentage of each sample which displayed bark beetle damage. The percentages provided in the table in the earlier example are just a few of the values we would expect to get. If we collected all of these possible percentages, we would have the sampling distribution for the sample proportion.

Usually, our population is too large (or too difficult to access in its entirety) for us to actually take all possible samples of a specific size. We would never actually consider trying to take all possible samples of 380 trees within the Coconino National Forest, for example.

If our population is very small, though, it is possible to take every single sample of a certain size and calculate a specified statistic for each sample. If we do, we will generate the entire sampling distribution for the statistic we calculate from the samples. The following problem provides one example of actually generating the entire sampling distribution for a certain statistic and reveals some important concepts from probability topics discussed earlier in this chapter which underlie the main theorem from this section.

Example Suppose we would like to analyze the population consisting of the heights of the 5 women who make up the starting line-up for NAU's women's basketball team. Their heights (in inches) are provided in the table below.

Player Aubrey Paige Tyler Chanel Jasmine
Height 75 68 70 73 64

What is the population mean height of all five player?

Solution In this example, the statistic we will be focusing on is the mean of the given heights. Our population just consists of 5 players, so we can easily find the population mean

\(\mu=\frac{75+68+70+73+64}{5}=\frac{350}{5}=70\)

Typically, we cannot access the entire population, so we cannot access the population mean, but in this example, our population is small enough for this calculation to be manageable, and we have found that this population has a mean height of \(\mu = 70 \) inches.

Taking samples of size 2, find the sample mean height of every possible sample.

Solution First note that we would typically not take every possible sample, but since our population is quite small, finding the mean of every possible sample will not be too challenging.

We will abbreviate the players' name, referring to just the first letter. So when we work with the sample consisting of Aubrey and Tyler, for example, we will list this sample as A, T.

A, P   \(\bar{x}=\frac{75+68}{2}=71.5\)
A, T   \(\bar{x}=\frac{75+70}{2}=72.5\)
A, C   \(\bar{x}=\frac{75+73}{2}=74\)
A, J   \(\bar{x}=\frac{75+64}{2}=69.5\)
P, T   \(\bar{x}=\frac{68+70}{2}=69\)
P, C   \(\bar{x}=\frac{68+73}{2}=70.5\)
P, J   \(\bar{x}=\frac{68+64}{2}=66\)
T, C   \(\bar{x}=\frac{70+73}{2}=71.5\)
T, J   \(\bar{x}=\frac{70+64}{2}=67\)
C, J   \(\bar{x}=\frac{73+64}{2}=68.5\)

Since we have taken every possible sample of size 2 and calcuated the mean of each sample, the data set 71.5, 72.5, 74, ..., 67, 68.5 consisting of all of the means is the entire sampling distribution for this statistic.

We can actually interpret this sampling distribution as a probability model, considering the sample means (71.5, 72.5, etc.) as the outcomes and assuming each sample is equally likely to occur. Thus, we can represent our sampling distribution in the same way we displayed many probability models in previous sections:
Outcome 66 67 68.5 69 69.5 70 71.5 72.5 74
Probability \(\frac{1}{10}\) \(\frac{1}{10}\) \(\frac{1}{10}\) \(\frac{1}{10}\) \(\frac{1}{10}\) \(\frac{1}{10}\) \(\frac{2}{10}\) \(\frac{1}{10}\) \(\frac{1}{10}\)

Interpreting the sampling distribution as a probability distribution, for a random sample of size 2, what is the probability the sample mean will equal the population mean? That is, what is \(P(\bar{x}=\mu)\)?

Solution Recall that we calculated our population mean in an earlier part of this problem and have \(\mu=70\). Referencing the table above, we see that one of our sample means equals exactly 70, so the probability of the sample mean equalling the population mean is \(\frac{1}{10}\).

For a random sample of size 2, what is the probability the sample mean is within 2 inches of the population mean?

Solution Since our population mean is \(\mu=70\), we are interested in sample means between 68 and 72 inches. From our probability table, we see that 6 outcomes out of our total of 10 outcomes fall in this range. So the probability that the sample mean is within 2 inches of the population mean is \(\frac{6}{10}\).

Find and interpret the expected value of the probability distribution representing the sampling distribtuion.

Solution Recall that to calculate the expected value of a probability distribution, we multiply each outcome by its probability, adding all of the resulting products. So we have

\(E(x) = 66\frac{1}{10}+67\frac{1}{10}+68.5\frac{1}{10}+69\frac{1}{10}+69.5\frac{1}{10}+70\frac{1}{10}+71.5\frac{2}{10}+72.5\frac{1}{10}+74\frac{1}{10}\)


\(=\frac{66}{10}+\frac{67}{10}+\frac{68.5}{10}+\frac{69}{10}+\frac{69.5}{10}+\frac{70}{10}+\frac{143}{10}+\frac{72.5}{10}+\frac{74}{10}\)


\(=70\)


From a probability perspective, the expected value is the number we anticipate we will get if we "average" the outcomes of a large number of trials. In the context of this problem, the "trials" consist of taking samples of two players and finding their mean height, so calculating the expected value simulates taking many samples of size 2 and finding the mean of each sample - but we don't need to simulate taking samples in this case because we actually took every possible sample. Thus it is reasonable to anticipate the value we get from our expected value calculation should be the actual mean of the underlying population. And this is precisely what our calculations have just verified.

It is important to note that we rarely - if ever! - deal with populations that are small enough to allow us to find every possible sample of a specific size. Therefore, we need tools to be able to describe the sampling distribution without actually calculating it in its entirety.

The following theorem, which is one of the most important theorems in mathematical statistics, provides powerful conclusions about the sampling distribution of a statistic.

Theorem THE CENTRAL LIMIT THEOREM

- If the sample size is sufficiently large, the sampling distribution of a statistic is approximately normal - even if the underlying population is not normally distributed.

- The mean of the sampling distribution for a statistic is the related population parameter value and the standard deviation of the sampling distribution is the population standard deviation divided by the square root of the sample size.

The statement of the Central Limit Theorem may seem technical and difficult to decipher, but the fundamental conclusions we can draw from it are vitally important.

The first conclusion the Central Limit Theorem provides is that the sampling distribution of a statistic is guaranteed to be approximately a normal distribution (provided our sample size is sufficiently large). This is extremely powerful because, if we would like to draw conclusions about our sampling distribution, we can use all of the tools we have already developed for normal distributions - z-scores, percentiles, the 68-95-99.7 Rule, etc.

This conclusion is not surprising if the population we are sampling from is itself normally distributed (if you take repeated samples from a normal distribution, you'd clearly expect the statistics you generate to themselves be normally distributed). But the real power of the first part of the Central Limit Theorem is that we can be confident that the sampling distribution will be normally distributed even if the underlying population is not normally distributed. So no matter how "bad" our underlying population is (skewed, terrible outliers, several distinct "peaks," etc.), we know that if we were to take samples from that population over and over again, the resulting statistics would be approximately normal.

Thus, no matter how "non-normal" the underlying population may be, we can always apply the tools we have developed for normal distributions to the sampling distribution.

Recall that all conclusions about a normal distribution are derived from two defining properties of that distribution - its mean and its standard deviation. So if we would like to work with the sampling distribution of a statistic, we simply need to know its mean and standard deviation.

The second half of the Central Limit Theorem defines these values for a sampling distribution, as demonstrated by the following examples.

Example Suppose the outcomes of a random phenomenon are normally distributed with a mean of \(\mu = 4.6\) and a standard deviation of \(\sigma=0.8\). Suppose a sample of size 12 is taken from this population and the mean of this sample, \(\bar{x}\), is calculated.

According to the Central Limit Theorem, what is the mean of the sampling distribution for \(\bar{x}\)?

Solution The second part of the Central Limit Theorem states that the mean of the sampling distribution for a statistic is the related population parameter value. In this example, the statistic we are calculating is a sample mean, so the related population parameter value is the population mean. Thus, the mean of the sampling distribution is 4.6.

This implies that if we were to repeatedly take samples from this population and calculate the mean of each sample, the data set consisting of those means would be centered at the actual mean value for the entire population, namely 4.6.

According to the Central Limit Theorem, what is the standard deviation of the sampling distribution for \(\bar{x}\)?

Solution The second part of the Central Limit Theorem states that the standard deviation of the sampling distribution for a statistic is the population standard deviation divided by the square root of the sample size. As indicated in this example, the standard deviation for the entire population is often represented by the variable \(\sigma\).

The conclusion of the second part of the Central Limit Theorem is, then, that the standard deviation for the sampling distribution is given by \(\frac{\sigma}{\sqrt{n}}\).

So if we take samples of size 12 (so \(n = 12\)), we get a standard deviation for our sampling distribution of

\(\frac{\sigma}{\sqrt{n}}=\frac{0.8}{\sqrt{12}} \approx \frac{0.8}{3.464} \approx 0.231\)

Note that the standard deviation for our sampling distribution is smaller than the standard deviation for our original "parent" population. Since we calculate the standard deviation of the sampling distribution by dividing the population's standard deviation by a number, the result will always be smaller than the population's standard deviation.

It is important to note that the previous example is somewhat unrealistic because we are provided with parameter values from the population. That is, we are given the population mean, \(\mu\) and the population standard deviation, \(\sigma\) - eventhough we typically do not know the actual mean and standard deviation of the entire population. The next example highlights these issues in the context of the data discussed previously regarding bark beetle damage in the Coconino National Forest.

Example Utilizing the Central Limit Theorem, identify the mean and standard deviation of the sampling distribution of the sample proportions discussed in the earlier example regarding bark beetle damage in the Coconino National Forest.

Solution Recall that the sampling distribution would be generated by taking every possible sample of 380 ponderosa pine trees in the Coconino National Forest and calculating the percentage of each and every sample which have bark beetle damage. By the first part of the Central Limit Theorem, we know this sampling distribution will be approximately a normal distribution. The second part of the Central Limit Theorem defines the key properties of this sampling distribution - its mean and standard deviation.

By the second part of the Central Limit Theorem, we can state that the sampling distribution's mean is the related population proportion. So the sampling distribution's mean is the actual proportion (or percentage) of all ponderosa pine trees in the Coconino National Forest which have bark beetle damage - but we don't know this overall population percentage (if we did, we wouldn't bother taking a sample!).

The second part of the Central Limit Theorem also defines the standard deviation of the sampling distribution - again in terms of a population value - we need the overall population's standard deviation, then divide that by the square root of the sample size. But we are not given (and we cannot calculate from a sample) the population standard deviation.

Thus it does not appear that we are given enough information to discern the mean and standard deviation of the sampling distribution.

This example is more realistic - and more frustrating. In "real life," we don't know any parameter values from the original parent population, so the conclusions stated in the second part of the Central Limit Theorem do not provide us with any useable information about our sampling distribution.

Fortunately, there is sophisticated mathematical reasoning which provides rigorous justification for us to take the following extremely practical step: we use sample statistics in place of the required population parameter values. That is, we can simply use the numbers we generate from our sample in our calculations instead of the population values required by the Central Limit Theorem.

For the remainder of this section, we will focus our attention on one specific type of statistic - a sample proportion. As most of the examples from this chapter reveal, it is very often our goal to approximate the percentage of the entire population which has a certain characteristic. We approximate this percentage (or proportion) by measuring what percentage of a sample have the characteristic, thus generating a sample proportion.

Within this context, the values we will use to draw conclusions about our sampling distribution are given in the table below.

Mean of Sampling Distribution \(\bar{p}\) = sample proportion

(percentage of sample which have stated characteristic - taken from a single sample)
Standard Deviation of Sampling Distribution \( \sqrt{\frac{\bar{p}(1-\bar{p})}{n}\)

where \(\bar{p}\) is the sample proportion and \(n\) is the sample size

The following example demonstrates how to calculate the values given above as well as how they can be used to draw conclusions about a sampling distribution.

Example A 2012 Zogby poll concluded that "76% of adults say that texting is diminishing the writing skills of most people." The details of the report indicate that this claim is based on a survey of 185 adults.

Find the mean and standard deviation of the sampling distribution for this statistic. Discuss conclusions which can be drawn about this sampling distribution based on its properties as a normal distribution.

Solution First, recall that the sampling distribution would consist of all of the sample proportions which would hypothetically be generated if we took every possible sample of 185 adults. We know, by the Central Limit Theorem, that this sampling distribution would be approximately a normal distribution. In addition, we can specify its mean and standard deviation based on the information given in the table above.

The mean of the sampling distribution will be the sample proportion, or 76% in this example. When plugged into calculations/formulas, we'll write this percentage as the decimal 0.76.

The standard deviation of the sampling distribution will be calculated by the formula

\(s.d.=\sqrt{\frac{\bar{p}(1-\bar{p})}{n}}\)

Plugging in the specific values from this problem, we get

\(s.d.=\sqrt{\frac{0.76(1-0.76)}{185}}\)

\(=\sqrt{\frac{0.76(0.24)}{185}}\)

\(=\sqrt{\frac{0.1824}{185}}\)

\(=\sqrt{0.001}\)

\(\approx 0.03\)

Therefore our sampling distribution is approximately a normal distribution with a mean of 0.76 and a standard deviation of 0.03.

Knowing this, we can apply all of the tools developed earlier in the chapter for normal distributions - z-scores, percentiles, etc. Usually, we will not need to analyze this sampling distribution with the precision provided by z-scores and percentiles, but will instead simply approximate important ranges of values based on the 68-95-99.7 Rule.

The following graph is labeled with the information provided by the 68-95-99.7 Rule for this particular problem:

So we can conclude that (approximately) the middle 68% of our sampling distribution falls between 0.73 and 0.79, or between 73% and 79%. Similarly, we see that the middle 98% of our sampling distribution (or "almost all" of the samplind distribution) falls between 0.70 and 0.82 (between 70% and 82%) while the middle 99.7% of our sampling distribution falls between 0.67 and 0.85.

Recall that we interpret our sampling distribution as deriving from all possible samples of 185 adults (for this example), so the conclusions in the previous paragraph seem to the percentages of those possible samples which will fall in the given ranges. Typically, we modify this conclusion slightly and re-word our claim to indicate how confident we are that the stated interval contains the actual population parameter value. For this particular scenario, we would claim, for example, that we are 68% confident that the actual percentage of all adults who feel texting is diminishing writing skills is between 73% and 79%. Also, we would claim that we are 95% confident that the actual percentage is between 70% and 82%. Finally, we would claim that we are 99.7% confident that the actual percentage of adults who feel texting is diminishing writing skills is between 67% and 85%.

The example above is too detailed. Usually, we simply want a quick way of determining a range of values which captures "almost all" of the sample proportions that could be generated. Once we have constructed such a range of values, or interval, we will conclude that we are 95% confident that the true population value lies somewhere within this interval. For this reason, the range of values we construct is called a 95% Confidence Interval.

Example Find and interpret a 95% Confidence Interval for the following claim from the media:

A survey of 1110 registered voters in Utah found that 445 plan to vote for Republican candidate Mia Love in the upcoming congressional election.


Solution First, we need to find our sample proportion - the percent of our sample which have the characteristic we are interested in. This scenario seems to focus on support for candidate Mia Love, so we find that \(\frac{445}{1110} = 0.40\) of the sample have our desired characteristic. So \(\bar{p}=0.40\).

We can easily identify our sample size (\(n=1110\)), so we have all of the necessary values to plug into the formula for calculating the standard deviation of our sampling distribution.

\(s.d.=\sqrt{\frac{\bar{p}(1-\bar{p})}{n}}\)

Plugging in the specific values from this problem, we get

\(s.d.=\sqrt{\frac{0.40(1-0.40)}{1110}}\)

\(=\sqrt{\frac{0.40(0.60)}{1110}}\)

\(=\sqrt{\frac{0.24}{1110}}\)

\(=\sqrt{0.0002}\)

\(\approx 0.015\)

So our sampling distribution has a mean of \(\bar{p}=0.40\) and a standard deviation of \(s.d. = 0.015\).

In the previous example, we built all of the intervals provided by the 68-95-99.7 Rule, but in this example, we will only construct the interval determined by the "95" part of the rule. So we need a range of values which are within two standard deviations of the mean. Note that two standard deviations equals 0.03, so by subtracting this from the mean, we have a lower boundary of 0.37. Adding 0.03 to the mean yields an upper boundary of 0.43, creating an overall interval ranging from 0.37 to 0.43.

Thus we can conclude that we are 95% confident that the actual percentage of Utah voters who support Love is between 37% and 43%.

So the Central Limit Theorem provides the framework for drawing conclusions about the sampling distribution of a statistic. These conclusions enable us to construct an interval of values around our statistic so that we do not have to put our full confidence in the value calculated from a single sample. Thus we can infer (with a specific level of confidence) information about a value from the population based solely on information we gather from a sample.



Go to top