Repeatedly throughout the last several sections, we have reiterated the challenges presented when trying to infer information about a *population* from information we gather from a *sample*. For example, a recent news report claimed

"Nearly eight and a half percent of homeowners in the United States are in danger of foreclosure."

We recognize that the percentage stated was most likely *not* generated by gathering information from *every single* homeowner in the United States. Instead, some *sample* was selected and information was gathered from that sample. But, we are confronted with the fact that if the survey was conducted again, that is if we contacted a *different sample* of the same size, we would not expect to find that exactly eight and a half percent of *that* sample would also be facing foreclosure. So if we re-sample, we expect the resulting statistic will most likely be different. The following example highlights this issue.

**Example** In order to assess the damage to ponderosa pine trees due to bark beetles, U.S. Forest Service employees walk through specified patches of forest and count the number of trees which display bark beetle damage.

In an effort to ensure their statistics were representative of all of the trees in the Coconino National Forest (that is, representative of the overall *population*), Forest Service employees repeated their surveying procedure multiple times. The following table summarizes their findings.

Survey Number | # of Trees w/ Bark Beetle Damage | Total # of Trees Inspected | % of Sample w/ Bark Beetle Damage |
---|---|---|---|

1 | 206 | 380 | 54.2% |

2 | 184 | 380 | 48.4% |

3 | 209 | 380 | 55.0% |

4 | 204 | 380 | 53.7% |

5 | 214 | 380 | 56.3% |

6 | 223 | 380 | 58.7% |

What conclusions can be drawn from these results? Which sample proportion is most trustworthy?

Recall that our goal in such a situation is to try to infer the

We need the more advanced tools and ways of reasoning developed in this section to confront a situation such as this.

The fact that different statistics resulted from different samples should not be surprising - this is simply a consequence of *sampling variability*. Before we discuss the mathematical theory which allows us to responsibly address issues such as those raised in the previous example, we need to define one important piece of terminology.

Referring to the previous example, we can imagine taking *all possible* samples of size 380 of ponderosa pine trees in the Coconino National Forest (which would be a huge number of samples!) and calculating the percentage of each sample which displayed bark beetle damage. The percentages provided in the table in the earlier example are just a few of the values we would expect to get. If we collected all of these possible percentages, we would have the *sampling distribution* for the sample proportion.

Usually, our population is too large (or too difficult to access in its entirety) for us to actually take *all possible* samples of a specific size. We would never actually consider trying to take *all possible* samples of 380 trees within the Coconino National Forest, for example.

If our population is very small, though, it *is* possible to take *every single* sample of a certain size and calculate a specified statistic for each sample. If we do, we will generate the entire *sampling distribution* for the statistic we calculate from the samples. The following problem provides one example of actually generating the entire sampling distribution for a certain statistic and reveals some important concepts from probability topics discussed earlier in this chapter which underlie the main theorem from this section.

**Example** Suppose we would like to analyze the population consisting of the heights of the 5 women who make up the starting line-up for NAU's women's basketball team. Their heights (in inches) are provided in the table below.

Player | Aubrey | Paige | Tyler | Chanel | Jasmine |
---|---|---|---|---|---|

Height | 75 | 68 | 70 | 73 | 64 |

What is the population mean height of all five player?

\(\mu=\frac{75+68+70+73+64}{5}=\frac{350}{5}=70\)

Typically, we cannot access the entire population, so we cannot access theTaking samples of size 2, find the *sample* mean height of *every possible* sample.

We will abbreviate the players' name, referring to just the first letter. So when we work with the sample consisting of Aubrey and Tyler, for example, we will list this sample as A, T.

A, P | \(\bar{x}=\frac{75+68}{2}=71.5\) | |
---|---|---|

A, T | \(\bar{x}=\frac{75+70}{2}=72.5\) | |

A, C | \(\bar{x}=\frac{75+73}{2}=74\) | |

A, J | \(\bar{x}=\frac{75+64}{2}=69.5\) | |

P, T | \(\bar{x}=\frac{68+70}{2}=69\) | |

P, C | \(\bar{x}=\frac{68+73}{2}=70.5\) | |

P, J | \(\bar{x}=\frac{68+64}{2}=66\) | |

T, C | \(\bar{x}=\frac{70+73}{2}=71.5\) | |

T, J | \(\bar{x}=\frac{70+64}{2}=67\) | |

C, J | \(\bar{x}=\frac{73+64}{2}=68.5\) |

We can actually interpret this sampling distribution as a

Outcome | 66 | 67 | 68.5 | 69 | 69.5 | 70 | 71.5 | 72.5 | 74 |
---|---|---|---|---|---|---|---|---|---|

Probability | \(\frac{1}{10}\) | \(\frac{1}{10}\) | \(\frac{1}{10}\) | \(\frac{1}{10}\) | \(\frac{1}{10}\) | \(\frac{1}{10}\) | \(\frac{2}{10}\) | \(\frac{1}{10}\) | \(\frac{1}{10}\) |

Interpreting the sampling distribution as a probability distribution, for a random sample of size 2, what is the probability the sample mean will equal the population mean? That is, what is \(P(\bar{x}=\mu)\)?

For a random sample of size 2, what is the probability the sample mean is within 2 inches of the population mean?

Find and interpret the expected value of the probability distribution representing the sampling distribtuion.

\(E(x) = 66\frac{1}{10}+67\frac{1}{10}+68.5\frac{1}{10}+69\frac{1}{10}+69.5\frac{1}{10}+70\frac{1}{10}+71.5\frac{2}{10}+72.5\frac{1}{10}+74\frac{1}{10}\)

\(=\frac{66}{10}+\frac{67}{10}+\frac{68.5}{10}+\frac{69}{10}+\frac{69.5}{10}+\frac{70}{10}+\frac{143}{10}+\frac{72.5}{10}+\frac{74}{10}\)

\(=70\)

From a probability perspective, the expected value is the number we anticipate we will get if we "average" the outcomes of a large number of trials. In the context of this problem, the "trials" consist of taking samples of two players and finding their mean height, so calculating the expected value simulates taking many samples of size 2 and finding the mean of each sample - but we don't need to

It is important to note that we *rarely* - if ever! - deal with populations that are small enough to allow us to find *every possible* sample of a specific size. Therefore, we need tools to be able to describe the sampling distribution without actually calculating it in its entirety.

The following theorem, which is one of the most important theorems in mathematical statistics, provides powerful conclusions about the sampling distribution of a statistic.

- If the sample size is sufficiently large, the sampling distribution of a statistic is approximately normal - even if the underlying population is not normally distributed.

- The mean of the sampling distribution for a statistic is the related population parameter value and the standard deviation of the sampling distribution is the population standard deviation divided by the square root of the sample size.

The statement of the Central Limit Theorem may seem technical and difficult to decipher, but the fundamental conclusions we can draw from it are vitally important.

The first conclusion the Central Limit Theorem provides is that the sampling distribution of a statistic is *guaranteed to be approximately a normal distribution* (provided our sample size is sufficiently large). This is extremely powerful because, if we would like to draw conclusions about our sampling distribution, we can use all of the tools we have already developed for normal distributions - z-scores, percentiles, the 68-95-99.7 Rule, etc.

This conclusion is not surprising if the population we are sampling from is itself normally distributed (if you take repeated samples from a normal distribution, you'd clearly expect the statistics you generate to themselves be normally distributed). But the real power of the first part of the Central Limit Theorem is that we can be confident that the sampling distribution will be normally distributed *even if the underlying population is not normally distributed*. So no matter how "bad" our underlying population is (skewed, terrible outliers, several distinct "peaks," etc.), we know that if we were to take samples from that population over and over again, the resulting statistics would be approximately normal.

Thus, no matter how "non-normal" the underlying population may be, we can always apply the tools we have developed for normal distributions to the sampling distribution.

Recall that all conclusions about a normal distribution are derived from *two* defining properties of that distribution - its *mean* and its *standard deviation*. So if we would like to work with the sampling distribution of a statistic, we simply need to know its mean and standard deviation.

The second half of the Central Limit Theorem defines these values for a sampling distribution, as demonstrated by the following examples.

**Example** Suppose the outcomes of a random phenomenon are normally distributed with a mean of \(\mu = 4.6\) and a standard deviation of \(\sigma=0.8\). Suppose a sample of size 12 is taken from this population and the mean of this sample, \(\bar{x}\), is calculated.

According to the Central Limit Theorem, what is the mean of the *sampling distribution* for \(\bar{x}\)?

This implies that if we were to repeatedly take samples from this population and calculate the mean of each sample, the data set consisting of those means would be centered at the actual mean value for the entire population, namely 4.6.

According to the Central Limit Theorem, what is the standard deviation of the *sampling distribution* for \(\bar{x}\)?

The conclusion of the second part of the Central Limit Theorem is, then, that the standard deviation for the

So if we take samples of size 12 (so \(n = 12\)), we get a standard deviation for our sampling distribution of

\(\frac{\sigma}{\sqrt{n}}=\frac{0.8}{\sqrt{12}} \approx \frac{0.8}{3.464} \approx 0.231\)

Note that the standard deviation for ourIt is important to note that the previous example is somewhat *unrealistic* because we are provided with parameter values from the population. That is, we are given the *population* mean, \(\mu\) and the *population* standard deviation, \(\sigma\) - eventhough we typically do not know the *actual* mean and standard deviation of the *entire* population. The next example highlights these issues in the context of the data discussed previously regarding bark beetle damage in the Coconino National Forest.

**Example** Utilizing the Central Limit Theorem, identify the *mean* and *standard deviation* of the sampling distribution of the sample proportions discussed in the earlier example regarding bark beetle damage in the Coconino National Forest.

By the second part of the Central Limit Theorem, we can state that the sampling distribution's mean is the related population proportion. So the sampling distribution's mean is the

The second part of the Central Limit Theorem also defines the standard deviation of the sampling distribution - again in terms of a

Thus it does not appear that we are given enough information to discern the mean and standard deviation of the sampling distribution.

This example is more realistic - and more *frustrating*. In "real life," we don't know any parameter values from the original parent population, so the conclusions stated in the second part of the Central Limit Theorem do not provide us with any useable information about our sampling distribution.

Fortunately, there is sophisticated mathematical reasoning which provides rigorous justification for us to take the following *extremely* practical step: *we use sample statistics in place of the required population parameter values*. That is, we can simply use the numbers we generate from our sample in our calculations instead of the population values required by the Central Limit Theorem.

For the remainder of this section, we will focus our attention on one specific type of statistic - a sample proportion. As most of the examples from this chapter reveal, it is very often our goal to approximate the *percentage* of the entire population which has a certain characteristic. We approximate this percentage (or *proportion*) by measuring what percentage of a *sample* have the characteristic, thus generating a *sample proportion*.

Within this context, the values we will use to draw conclusions about our sampling distribution are given in the table below.

Mean of Sampling Distribution | \(\bar{p}\) = sample proportion (percentage of sample which have stated characteristic - taken from a single sample) |
---|---|

Standard Deviation of Sampling Distribution | \( \sqrt{\frac{\bar{p}(1-\bar{p})}{n}\) where \(\bar{p}\) is the sample proportion and \(n\) is the sample size |

The following example demonstrates how to calculate the values given above as well as how they can be used to draw conclusions about a sampling distribution.

**Example** A 2012 Zogby poll concluded that "76% of adults say that texting is diminishing the writing skills of most people." The details of the report indicate that this claim is based on a survey of 185 adults.

Find the mean and standard deviation of the sampling distribution for this statistic. Discuss conclusions which can be drawn about this sampling distribution based on its properties as a normal distribution.

The mean of the sampling distribution will be the sample proportion, or 76% in this example. When plugged into calculations/formulas, we'll write this percentage as the decimal 0.76.

The standard deviation of the sampling distribution will be calculated by the formula

\(s.d.=\sqrt{\frac{\bar{p}(1-\bar{p})}{n}}\)

Plugging in the specific values from this problem, we get\(s.d.=\sqrt{\frac{0.76(1-0.76)}{185}}\)

\(=\sqrt{\frac{0.76(0.24)}{185}}\)

\(=\sqrt{\frac{0.1824}{185}}\)

\(=\sqrt{0.001}\)

\(\approx 0.03\)

Therefore our sampling distribution is approximately a normal distribution with a mean of 0.76 and a standard deviation of 0.03.Knowing this, we can apply all of the tools developed earlier in the chapter for normal distributions - z-scores, percentiles, etc. Usually, we will

The following graph is labeled with the information provided by the 68-95-99.7 Rule for this particular problem:

So we can conclude that (approximately) the middle 68% of our sampling distribution falls between 0.73 and 0.79, or between 73% and 79%. Similarly, we see that the middle 98% of our sampling distribution (or "almost all" of the samplind distribution) falls between 0.70 and 0.82 (between 70% and 82%) while the middle 99.7% of our sampling distribution falls between 0.67 and 0.85.

Recall that we interpret our sampling distribution as deriving from all possible samples of 185 adults (for this example), so the conclusions in the previous paragraph seem to the percentages of those possible samples which will fall in the given ranges. Typically, we modify this conclusion slightly and re-word our claim to indicate how

The example above is *too* detailed. Usually, we simply want a quick way of determining a *range of values* which captures "almost all" of the sample proportions that could be generated. Once we have constructed such a range of values, or *interval*, we will conclude that we are 95% confident that the true population value lies somewhere within this interval. For this reason, the range of values we construct is called a 95% Confidence Interval.

**Example** Find and interpret a 95% Confidence Interval for the following claim from the media:

A survey of 1110 registered voters in Utah found that 445 plan to vote for Republican candidate Mia Love in the upcoming congressional election.

We can easily identify our sample size (\(n=1110\)), so we have all of the necessary values to plug into the formula for calculating the standard deviation of our sampling distribution.

\(s.d.=\sqrt{\frac{\bar{p}(1-\bar{p})}{n}}\)

Plugging in the specific values from this problem, we get\(s.d.=\sqrt{\frac{0.40(1-0.40)}{1110}}\)

\(=\sqrt{\frac{0.40(0.60)}{1110}}\)

\(=\sqrt{\frac{0.24}{1110}}\)

\(=\sqrt{0.0002}\)

\(\approx 0.015\)

So our sampling distribution has a mean of \(\bar{p}=0.40\) and a standard deviation of \(s.d. = 0.015\).In the previous example, we built all of the intervals provided by the 68-95-99.7 Rule, but in this example, we will only construct the interval determined by the "95" part of the rule. So we need a range of values which are within two standard deviations of the mean. Note that two standard deviations equals 0.03, so by subtracting this from the mean, we have a lower boundary of 0.37. Adding 0.03 to the mean yields an upper boundary of 0.43, creating an overall interval ranging from 0.37 to 0.43.

Thus we can conclude that we are 95% confident that the actual percentage of Utah voters who support Love is between 37% and 43%.

So the Central Limit Theorem provides the framework for drawing conclusions about the sampling distribution of a statistic. These conclusions enable us to construct an interval of values around our statistic so that we do not have to put our full confidence in the value calculated from a single sample. Thus we can infer (with a specific level of confidence) information about a value from the

< < Section 2F : : Chapter Three Intro > >