Module 1:3 - Quantifying Distributions
- 1 Central Tendency
- 2 Variability
- 3 Deciding on an estimator for Variance
- 4 Sample Standard Deviation
- 5 Simulating Variance Estimates
- 2 methods of Quantifying Distributions: Measuring the Center and Measuring the Variance
- Central Tendency: a number that represents the middle of a distribution (ex: mean, median, mode)
- Mean: the sum of a set of scores divided by the number of scores in the set (the "average")
- appropriate for interval or ratio scale data, but not for ordinal or nominal data
- Different symbols used:
Calculating the Mean
- Population Formula: Sum of the Xi / the size of the population
- Xi is to indicate each individual score/observation
- Sample Formula: Sum of the Xi / the size of the sample
- We are still taking the sum of the scores divided by the by the number of scores we have. The difference is in the notation that we use to denote the sample formula from the population formula.
- The different notation helps us to keep track of whether we are calculating a population parameter or a sample. This is important, as it reminds us if we are using a sample to make an inference on a population, or if we are using the population itself.
- Sample statistics are subject to error, whereas population parameter is truth of the population.
- Set size difference: "N" means a population size' "n" means a sample size -- reminds us that a sample size is always smaller than a population
Visualizing the Mean
- Using the balance beam example provided in the video, we will look for the mean for '2', '2', '6' and '10'
- Since the mean represents the center of mass, in this case, from the balance beam, we see that is also known as the center of balance
- When we take the distance these points to the mean and sum them up, the sum will always be equal to 0.
- The sum of the side differences to the center has to be 0 as the mean is the arithmetic center to these points. Moving any of the scores/points will move the mean as the mean is sensitive to the changes of these scores. That is to say that the mean uses all pieces of information in the data set to find where the center of mass is.
- Variability is the magnitude in which scores are spread out in a distribution, wider or narrower
- It is important because if accounts for the difference between scores
- Variability is used to asses how well the mean and individual scores represent the distribution
- If the variability is low then it is more reliable and consistent with the mean
- Standard deviation is the typical amount scores differ from the mean
- It is the average amount scores are further away from the center
- Sigma is the notation for population standard deviation
- Little s is the notation for the sample deviation
Deciding on an estimator for Variance
- Want to have the least error possible for the sample statistic we will use to predict the population value
- To get rid of negative signs, take absolute value or square the deviation
- Sampling error: The discrepancy (or amount of error) between a sampling statistic and the corresponding population parameter
Properties of Estimators
- Statistical estimators: mathematical procedures we do on samples in the service of knowing something about a population (ex: sample mean)
- Consistency: get better when we have more data
- Relative Efficiency: "Does it err less than other estimators?"
- Sufficiency: whether or not it uses all the data (ex: The mean is sufficient because uses all values, but the median is not because it only uses the central value)
The Squared deviation
- often referred to as squares
- Squares: an individual's deviation squared
- The Sum of Squared Deviations (aka Sum of Squares or SS)
- Mean Square Deviation= SS/N
This is how much spread there is and is also known as the Population Variance!.
- It is the square of the standard deviation
Calculating the Standard deviation
Sample Standard Deviation
- Sample standard deviation is an estimate of the population standard deviation
- Estimators follow the properties of consistency, efficiency, sufficiency, and bias
- Bias is over or underestimating the true value on average
- It needs to be corrected so that the value is neither too or too high
Calculating standard deviation
- To calculate the sums of squares we use x bar instead of mu because it is a sample
- To calculate s we divide by the sample size minus one
- Decreasing the denominator inflates the estimate of s
- X bar is too small and underestimates the deviation so inflating s corrects it
- Video example: The sum of squares is smaller because it is based of the sample mean which minimizes the difference so when the deviation is calculated it needs to be inflated so it becomes a better estimate of the population standard deviation
Simulating Variance Estimates
We will be using JMP to simulate the process of taking repeated samples to visualize the kind of estimates we get.
Variance SimulationFrom the JMP Journal of this module, click on the Variance Simulation link to bring up the dialog for the program.
The dialog prompts for one to specify the settings of the various population parameters: its mean, and its standard deviation. It also asks for the sample size and the number of samples to draw.
Using the example as shown in the video (see attached image), we are investigating for 10,000 repetitions, JMP will take a sample size of 5 from the population that is centered at 100 with a standard deviation of 15.
JMP will than produce a table for Variance and Standard Deviation for "n" and "n-1".
Some of the population estimates are really small at almost 0, that is, the difference between the 5 individuals of the sample is very small because they’re close to each other. Notice that when we are using "n" (that is n of 5), we got an estimate of slightly less than 200 which is an underestimation of our population variance of 225. With "n-1" (which gives us 4 in the denominator), we are able to inflate our estimate enough to be correct on average. That is we inflate each of our estimate a little, but the bias of the statistic will be on average be corrected such that an average our estimates will equal the population value. #on average implies that it is across all the sample means we take. There are times we will over- or under-estimate our population value, thus taking the average across the sample means will ensure that we are more accurate.
Comparing Variance with Different Sample Size
When we keep everything else the same (that is, same population mean and standard deviation and the same number of samples to draw) and change the sample size to 60, the difference in variance we get between with n=60 and n-1=59 is much smaller than the difference between n=5 and n-1=4.
- As we use larger sample sizes, the bias in our statistic is less profound.
The mean is a constant statistic, that is, with larger sample size, the mean we expect to get should be closer to the population mean.
- When we get a sample mean that is different from the population mean, we will underestimate the sums of square.
- To the degree that we are now estimating the mean precisely, the amount of correction we need to apply has to be less.
===With sample size of 60===
- Though n-1 is still more accurate in predicting our population value, the difference between using n and n-1 is not as great using a larger sample size.
- Our underestimation with larger sample size is less considerable.
===With sample size of 2===
- The difference between using n and n-1 is considerable
- There is a significant underestimation of almost half when using the population formula
Standard Deviation of Sample
We will keep everything else the same and use a sample size of 100 instead.
- The estimated change between "n" and "n-1" is much lesser
Run script for Compare Standard Deviation to look at the distribution of Standard Deviation in the sample using "n" and "n-1".
- Standard Deviation is not an unbiased estimator unlike the variance estimate, as expectation does not compute over power.
- We expect and maintain that the variance estimate will be on average correct, but the same cannot be said for the standard deviation.
- This is seen when we look at the standard deviation estimate for n-1. We see that the sample is still systematically off. Therefore, the standard deviation of the sample is not an unbiased estimator of the population standard deviation
- In a large sample, the underestimation is trivial, but there will always be an underestimation of the population standard deviation because expectation does not compute over power, that is, though the sample variance is an unbiased estimator of the population's, the sample standard deviation is not.
- That is also to say that, SS2 is an unbiased estimate of σ2, and X is an unbiased estimator of μ.