# Module 1:6 - Sampling Distributions

## Contents

## A - Introduction[edit]

The sampling distribution looks at a distribution of a given statistic based on every possible random sample from a population. The sampling distribution of means is a commonly used type of sampling distribution, and is comprised of the mean scores of random samples from a population. This is an important type of distribution to consider whenever you want to understand the variance of your sample data, and how it compares to the population. You want to determine how far apart the sample value is from the the population value, which will be crucial when making inferences to the population parameter. Sampling errors in the mean of the sample means will be normally distributed, and that allows us to know a lot about the proportionality and probability of events.

Sampling distributions follow two main ideas, which are The Distribution of Sample Means and The Logic of Hypothesis Testing. The gist of the two ideas is that a single sample can be used to make an inference about the population because the former underlies the latter, and because the sampling error is normally distributed.

The following outline focuses on the importance of sampling distributions, the sampling error, the normal distribution, and the Central Limit Theorem. It will be further explained using JMP to gain better conceptual understanding.

## B - Seeing Sample Variability with JMP[edit]

### 1 - Random Squares[edit]

From the Module 1-6 JMP journal, you will find a link that reads Random Squares. Random Squares is a website Julian created that uses PHP to simulate sampling scenarios.

For the first demo, we looked at something simple like a single coin flip. The total refers to the total number of samples in your experiment (n). The mean will depend on your size of n. Your expected value in probability, or the chances of getting either a heads or a tails, is always 0.5 regardless of the size of n. Each click of the button below "Expected Value" will regenerate a new value and probability because one click represents one coin flip. Because we are only dealing with a sample size of n = 1, the value does not change.

### 2 - Binomial Sampling[edit]

#### n = 1[edit]

Right click the column where it says "Mean n =1", go to the Formula section, choose Random, and finally Random Binomial. This will generate random samples from the sample distribution, and because it is a binomial process, you are going to input two arguments.

The n variable represents the how many flips you want, so the function will return either a 0 or 1. Here, n = 1 because you have one coin.

The p variable represents the probability that the Random Binomial process will generate a 1 (or a 0).

You divide by 1 because that is your total sample size.

Now you want to see how many times you can get a 0 or a 1 by randomly choosing a number of samples from the sample size 100,000 times. Since we are only dealing with n = 1, we flip one coin 100,000 times.

In JMP, you'll add 100,000 rows by doing the following: Right click the column of interest (Mean n = 1)>Rows>Add Rows...>100,000.

The distribution shows the frequency of the two possible outcomes you can get: 0 or 1 (tails or heads).

Notice that the mean does not exactly equal 0.50. This is because of the 100,000 samples you've taken, you won't have an equal divide of heads and tails. However, because you have that large of sample, it gets close enough.

#### n = 4[edit]

From the PHP simulator, you click on the +Row link to add another row and +Col to add another column. Now your n = 4 because you added extra samples (an extra coin, in this case) to the sample size. This means one click of the button will represent flipping four coins one time.

Because you are now dealing with four different possible outcomes, your mean value will change.

Remember that an outcome of 0 refers to tails, and an outcome of 1 refers to heads. Suppose 2 coins land on tails and 2 coins land on heads. That means your chances of getting a heads is 0.50. If 3 coins land on tails and 1 coin lands on heads, then your mean will change to 0.25.

We have a sample size of 4, but we want find out the likelihood of a coin ending up heads or tails if you flip all 4 coins 100,000 times. This means we want to take 100,000 possible and random combinations of the sample and create a sampling distribution.

The distribution forms a triangular shape. The bars show the frequency of the possible outcomes of flipping your 4 coins and getting a 1. This ranges from 0, 0.25, 0.50, 0.75, and 1.

The mean is also close to 0.50, but not exactly. The standard deviation of this sampling distribution is smaller than the standard deviation of the distribution of sampling means when n = 1.

#### n = 100[edit]

We can already see the shape of the distribution start to change with just an increase of three more sample sizes in the previous example. As we have larger sample sizes, we will notice that the distribution of sample means will start to form a bell-like shape. This is what's called a normal distribution.

Because out n = 100 is a large sample size, we will also start to see a decrease in the value of the standard deviation.

For each of the three distributions we've looked at (where n = 1, 4, and 100), we notice this pattern where the mean gets closer to its true value and where the standard deviation decreases. This is all a part of what sampling distribution is. We are taking the mean of the sample means, so we average out any outliers and bring the standard deviation smaller and closer to the population mean.

## C - Sampling Distribution of Means[edit]

### 1 - Definition[edit]

The sampling distribution of means is a distribution comprised of the mean scores of all the possible random samples of a particular size (n) from a given population. It is found by repeatedly taking samples from a population, recording the mean of each sample, and plotting these means in a distribution. Because it is based upon scores from the original population (it is just the averages of different subsets of these scores), the mean of a sampling distribution of means is always the same as that of the population. However, because it is a distribution of means as opposed to individual observations, extreme scores in either direction are unlikely and the distribution of sampling means will always have a smaller standard deviation than the parent population.

### 2 - Importance[edit]

Although sampling distributions are theoretical and we would not actually collect data in this way, we can use computer simulations to create sampling distributions that provide us with important information. By looking at the distribution of sampling means, we can determine how much variability from the population mean to expect in any given sample. In a research setting, sampling distributions provide guidelines for determining whether an observed difference between our sample mean and the population mean is likely due to chance, or likely due to an actual experimental effect. A sampling distribution of means allows us to calculate the precise likelihood of any given result being due to chance by looking at the proportion of means that fall under the distributional curve above or below any given sample mean.

In JMP, the concept of using the proportion of a normal distribution to determine the probability of observing any given score can be explored using the Distribution and Probability Calculator *(Add-Ins>Teaching Modules>Distribution Calculator)*.

## D - Working With Sampling Distributions[edit]

### 1 - Constructing a Real Sampling Distribution[edit]

1) Choose a population

2) Enumerate every possible sample of size of n

3) Calculate the mean of each sample of size n

4) Plot of histogram of the means calculated

### 2 - Approximating a Sampling Distribution[edit]

It would take an infinite number of samples to create a real sampling distribution of means, so instead, we just approximate the sampling distribution. According to Julian, one must follow these steps to create a sampling distribution. To recreate the demo where we want to take 100,000 samples of a sample size of n, we follow these steps below:

1) Choose a population

2) Take a random sample of size n

3) Calculate the mean of that particular sample

4) Record the mean to a table

5) Go back to #2 and repeat 100,000 times

Luckily for us, we have JMP to help us repeat the samples 100k times. Phew! *(Add-Ins>Teaching Modules>Distribution of Sample Means)*

### 3 - Characteristics[edit]

The sampling distribution **always** has a mean equal to the population mean.

As the size of the sample n increases, the standard deviation of the sampling distribution decreases by the square root of the sample size. This occurs because in large sample sizes, any outliers are averaged out by the many other individuals within the sample, and so the mean is not dramatically changed as it might be by outliers within a very small sample.

Another term for the Standard Deviation of the Sampling Distribution of Sample Means is the**Standard Error**. By looking at how spread out a sampling distribution is, we can predict how much error we are likely to encounter. A sampling distribution that varies a lot, and therefore has a large Standard Error, tells us that our sample is less likely to have a mean that is close to the population mean. In contrast, a sampling distribution with a small Standard Error tells us that our sample means do not vary a lot from the population mean, and so we can expect a single sample to contain a smaller amount of error.

A sampling distribution is normal under the following conditions:

1) The parent population is normally distributed. Even with a sample size of 1, the sampling distribution will be normally distributed if the population is normally distributed.

2) The parent population is non-normal, *but* we have a sample size larger than ~30. Even if the population is skewed or uniform, with a large enough sample size, the sampling distribution will be normal.

Here we see how non-normal populations will result in a normal sampling distribution when the sample size is large enough:

## E - The Central Limit Theorem[edit]

### 1 - Definition[edit]

The Central Limit Theorem was proposed by Abraham de Moivre in the early 18th century. The theorem stated that the distribution of the sum of a large number of individual random variables approaches a normal distribution.

### 2 - Importance[edit]

The Central Limit Theorem shows that even if the raw values for the population do not form a normal distribution, once you create a sampling distribution with large enough samples, the sampling distribution itself will begin to take the shape of a normal distribution.

This phenomenon is of great importance to statistics because it shows that for every type of population there will, eventually, be a normal sampling distribution that would allow one to make predictions as to the probability of any of its individual component events occurring simply due to chance.

__Normally distributed populations__will*always*result in normal sampling distributions__Non-normally distributed populations__will result in normal sampling distributions*when the sample sizes are relatively large*

Once we can determine the probability of an event occurring simply due to chance, we can compare it to results reached through experiments to see if our manipulations were likely the cause of any differences.

### 3 - Notable People[edit]

- Wrote__Abraham de Moivre__*The Doctrine of Chances*(1718), which mostly concerned gambling, but featured an early description of the normal distribution- Wrote__Pierre-Simon Laplace__*Théorie Analytique des Probabilités*(1820), or The Analytical Theories of Probability, which formally connected the Central Limit Theorem to normal distributions- Provided mathematical proof for the Central Limit Theorem__Aleksandr Lyapunov__- Wrote__Sir Francis Galton__*Natural Inheritance*(1889), which includes the following quote:

I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.

### 4 - Video[edit]

To watch an amazing auto-tuned video about Galton's quote on the Central Limit Theorem, featuring kinetic typography, click here.

## F - New JMP Tools[edit]

### 1 - Random Squares[edit]

To open the Random Squares simulator, open the 1.6 JMP Journal and click the arrow by "Seeing Sampling Variability", which will open the simulator in an internet browser window.Each square represents a randomly drawn value. As you click the gray square below "Expected Value", the squares will redraw a random value. This is most noticeable when the values change. However, it may redraw the same value, meaning the change is not noticeable.

adds/removes entire rows__+/- Rows__adds/removes entire columns__+/- Columns__increases/decreases the size of each square box__+/-Size__

randomly draws a value of either 0 or 1__Binomial__randomly draws a value from a normal distribution centered at 0.5__Normal 0.5__randomly draws a value from a normal distribution centered at 100__Normall 100__

= sum of all values__Total__= mean of all values__Mean__= average expected value__Expected Value__

- The small text boxes at the bottom correspond to the squares. The text can be edited, resulting in the ability to customize the expected value for each square.

### 2 - Random Binomials[edit]

#### Sample Size n=1[edit]

Similar to the Random Squares page created by Julian, JMP has a way of drawing random values directly in a data table.

To do so, open a JMP data table and create a column. Right click the column's title and click Formula. Click Random, then Random Binomials.

refers to the number of random draws you want to perform__n__refers to the probability of the event (drawing a 1) occurring__p__

Double click each section to input the desired values, then click Apply or OK.

To generate the output, double click any number of empty rows on the right, or click Rows (in the JMP window), then Add Rows, and input the desired number of rows.

#### Sample Size n>1[edit]

When you want a sample size larger than 1, simply update the Formula for the column and input the desired sample size as the n-value. Then, highlight the entire Random Binomial box, click the symbol to divide, and double click the box in the denominator to input the same n-value. This will allow JMP to display the average value for the entire sample.

To generate the output, double click any number of empty rows on the right, or click Rows (in the JMP window), then Add Rows, and input the desired number of rows.

### 3 - Random Normals[edit]

Similar to Random Binomials, JMP can also draw random Z-scores and allow you to customize the distribution they represent.

#### Z-Scores[edit]

To draw random z-scores, open a JMP data table and create a column. Right click the column's title and click Formula. Click Random, then Random Normal.

- If you click Apply or OK at this point, you will now have a single z-score in each row.

To generate the output, double click any number of empty rows on the right, or click Rows (in the JMP window), then Add Rows, and input the desired number of rows.

#### Customize[edit]

To customize your z-scores in a way that allows you to obtain their equivalent values on a different scale, highlight the entire formula and click the multiplication symbol, then double click to input the value of the standard deviation on your desired scale. Highlight the new entire formula, then click the addition button and add the value of the mean on your desired scale.

In essence, this is reversing the formula used to obtain a z-score in the first place, by instead finding the z-score first, then fitting it to the desired distribution mean and standard deviation.