Module 1:8 - Statistical Inference II

Jump to: navigation, search

Hypothesis Testing Decisions[edit]

Nature of Errors in Statistical Decision Theory[edit]

In all forms of statistics, there will always be some level of sampling error. It is nearly impossible to get a perfect sample without error. When taking samples, there will occasionally be a mismatch between the conclusion that we reach based on the sample data and what is true in the world.

There are only two possible truths, together mutually exclusive and exhaustive, that a researcher can attempt to distinguish between. Because these truths are mutually exclusive, the researcher can only choose to conclude that one is true. However, the researcher won't always make the right decision based off the sample data. Different factors contribute to the likelihood that the researcher will come to right conclusion, resulting in potentially incorrect decisions.

Two Possible Truths

1. There is no effect.Therefore, H0 (null hypothesis) is true

2. There is an effect.Therefore, H1 (alternative hypothesis) is true


Screen Shot 2015-01-28 at 11.22.34 AM.png

There are four types of decisions that fall under two categories: correct or incorrect. Our ability to make these errors depends on the true state of the world.

Correct Decisions[edit]


Specificty occurs if one fails to reject the null hypothesis(H0) and there is no effect in the real world. Specificity has a probability of 1 - alpha, as it providing a correct response to being outside the critical region determined by the alpha level. Since Specificity is dependent on alpha, researchers are more comfortable achieving this conclusion.

2. Statistical Power

Statistical Power occurs if one rejects the null hypothesis(H0) and there is an effect in the real world. This case has a probability of 1 - beta, which is often unknown and is dependent on the value of alpha.

Incorrect Decisions(Type of Errors )[edit]

1. Type I Error

If there is no effect in the real world and one rejects the null hypothesis, then a Type I Error occurs. This error also goes by the name of "False Alarm" or "alpha error." The only time we would reject the null hypothesis is if the value were to land within the critical region in the distribution. The value remains within the null distribution, rather than belonging to a case of the alternative hypothesis. Since the critical region is determined by the alpha, the probability of receiving this error will always dependent on alpha. Hence, this explains why a Type I Error is also known as an "alpha error." As the researchers define alpha in the distribution, the error rate for a Type I error is always under our control, so we're comfortable with a false alarm rate. Therefore:

If the null hypothesis is true, then the probability of getting a Type I error is the value of the alpha level, as those values would be within the critical region. However, if the alternative hypothesis is true, then there is zero chance for a false alarm if no real effect exists in the world.

If the alternative hypothesis is true, then the probability of getting a Type I error is 0. This is because the alternative hypothesis provides no chance for us to get a false alarm should a real effect exist.

2. Type II Error

If an effect exists in the real world and one fails to reject the null hypothesis, then a Type II Error occurs. This error also goes by the name of "Miss" or "beta error," insisting that there is a mismatch between what is true in the real world and what we claim from the sample data given. Like all errors, it's probability rate is dependent on the state of the world. If the null hypothesis is true, then there is zero chance for a Type II error, as no effect would be observed. If the alternative hypothesis was true, then its probability rate would be the beta level. Because this error is dependent on the beta level, which would be unknown, we are uncomfortable confirming a rejection of the null hypothesis.

Never Accept the Null Hypothesis[edit]

"We NEVER accept or prove the Null Hypothesis on the basis of sample data."

Even in the case that we fail to reject the Null Hypothesis, it is not a strong claim. Sample data merely allows us to use induction to possibly infer a possibility about the population.

An useful example of this is: If we observe 20 people in a sample. Of those people, we have 12 people with black hair and 8 people with brown hair. From that, we cannot just crazily assume that people only have black and brown hair. The results of 20 people cannot be used to make a conclusion such as that. However we can state that based on our sample, there are people in the world that have brown and black hair.

This example can easily be related back to the idea of an experiment. In the case that we have 20 people and a treatment showed no effect on any of those 20 people, we can state that we failed to reject the Null Hypothesis. However, to state that the Null Hypothesis is true is just as wrong as the previous example of stating that there are only black-haired and brown-haired people in the world based on that 20 person sample. In order to make state that the Null Hypothesis is true, one would have observe every single person in the world and find only black-haired and brown-haired people.

"A lack of evidence for an effect in a sample is not good evidence for a lack of an effect in the population." OR "Absence of evidence is not evidence of absence" (Altman and Bland 1995).

Visualizing Statistical Power[edit]

Screen Shot 2015-01-28 at 11.32.50 AM.png

Statistical Power[edit]

The power of a test is the probability that the test will reject a false null hypothesis. Power is the probability of detecting a real effect.

Demonstration Using the Ginseng Experiment[edit]

First, remember the difference between the Null Hypothesis and the Alternative Hypothesis:

Null Hypothesis (H0) - µtreatment = µwithout treatment

Alternative Hypothesis (H1) - µtreatment ≠ µwithout treatment

Null Hypothesis Distribution[edit]

(1) This would be what a population distribution would be like if the Null Hypothesis was true.

Screen Shot 2015-01-28 at 12.33.59 AM.png

Notice that in the case of the Null Hypothesis, the mean IQ of the population is at 100 (the average IQ). This would ultimately represent that there is no effect of the ginseng treatment.

Alternative Hypothesis Distribution[edit]

(2) This is what the population distribution would be like if the Alternative Hypothesis was true (in comparison to H0)

Screen Shot 2015-01-28 at 12.37.06 AM.png

In this example with the ginseng, we are imagining that H1 is true and shifts the population IQ mean to 120. Note that we are talking about populations and this is what we are "observing" after treating an entire population with ginseng.

If you observe the two distributions, you'll notice that there is an overlap between the H0 and H1 distributions. When we take samples from a population in which H1 is true, we will still sometimes get samples from the H1 distribution that are below the zcritical value in the H0 distribution. However, our sample means should roughly (and it may not be exact due to sampling error) be around 120 in the case that H1 is true.

As experimenters in real life, we can only know what the H0 distribution (blue one) looks like. In our examples, we only know what the H1 distribution (red one) looks like because we are essentially playing make believe.

Highlighting the Statistical Power[edit]

Remember that we chose to have an alpha level of 0.05. This set our critical points = +/- 1.96. Therefore, if we have a sample mean with a z-score that is greater than +1.96 or less than -1.96, we can confidently reject the Null Hypothesis.

This image can be used as reference for the following explanation

In the figure above, the area that is shaded dark red is the proportion of the make-believe H1 distribution in which we would correctly detect the Alternative Hypothesis and reject the Null Hypothesis. This shaded dark red area (when the sample mean z-score is greater than zcritical) is what we considered to be 1-β OR the statistical power. The non-shaded red area (when the sample mean z-score is less than zcritical is what we call β OR essentially a "miss." A "miss" will happen occasionally due to sampling error even in the case in which H1 is true.

If there is in fact a true effect we can merely look at the H1 (red) distribution. However, as real scientists, we must use the H0 (blue) distribution to determine whether or not there is a statistically significant effect from the treatment since we realistically will not know the actual H1 distribution (if we did, statistics would not be needed).

Screen Shot 2015-01-28 at 1.33.09 AM.png

In the case if the ginseng actually decreased the average population IQ mean by 20 points, we can see that the power (1-β) as well the "miss" (β) are the same. The only difference is the direction in which the H1 distribution is going. As long as the effect size remains the same, direction will not affect statistical power.

Factors that Affect Power[edit]

Alpha Level[edit]

As the researcher sets the alpha level directly, this is the only factor affecting power that can be influenced before we even begin studying our population. Increasing our alpha level, or allowing more samples to count as evidence for rejecting the null hypothesis, increases power. Decreasing the alpha level, which decreases how many samples we allow to count as evidence for rejecting the null hypothesis, also decreases power. Notice the trade off between power and the false alarm rate. When we increase alpha in an effort to bump power levels, we also increase the false alarm rate (recall that the false alarm rate is always equal to alpha) and decreasing alpha to protect our research from false alarms also decreases our statistical power.

Screen Shot 2015-01-28 at 11.18.16 AM.png Screen Shot 2015-01-28 at 11.18.26 AM.png

Effect Size[edit]

The size of the effect that our treatment has on the population being studied is expressed as the mean of the sample minus the mean of the population; you may recall that this is just the numerator of our equation for finding z-scores. This equation estimates the mean under the alternative hypothesis minus the mean under the null hypothesis, which gives us an estimate of how far apart our two distributions are. Anything that increases this difference increases our power. This is because increasing the numerator of our test statistic will increase the test statistic itself, resulting in greater likelihood that we would reject the null hypothesis (the correct decision anytime power is defined).

Put another way: increasing effect size decreases our beta region. We know that beta is the region under the tail of the alternative distribution that falls into the portion of the null distribution beyond alpha. If we move the alternative distribution away from the null distribution, the tail of the alternative distribution will not overlap as much with the tail of the null distribution and the beta region will be reduced. As power is equal to one minus beta, anything that decreases beta will increase power.

Screen Shot 2015-01-28 at 11.02.32 AM.png Screen Shot 2015-01-28 at 11.02.46 AM.png


Decreasing variability increases power. This is because populations that are very variable have a large amount of 'spread'. That is, their distributions are quite wide and so are more likely to overlap. Recall that beta is simply the amount of the alternative distribution that overlaps the null distribution past the alpha level and that decreasing beta increases power. So, if we reduce the amount of variability in the populations, they will 'draw up' or pull in tighter to their individual means. This decreases the spread of the distributions and reduces the amount of overlap between the two. Reducing the amount of overlap between the two distributions causes a decrease in the overall noise of the situation, allowing us to more easily see that there is a difference between the null and alternative hypotheses.

Screen Shot 2015-01-28 at 11.02.32 AM.png Screen Shot 2015-01-28 at 11.15.59 AM.png

Sample Size[edit]

One way to decrease variability in the distribution of the alternative hypothesis is to increase sample size. We know that increasing the size of the samples that we are drawing from a population decreases the variability of that population's sampling distribution. Decreasing variability in this way should have the same effect as reducing variability in the population as a whole (see Variability above).

Directional Hypotheses[edit]

Using a directional hypothesis allows the researcher to increase their statistical power without affecting the false alarm rate. The key difference between a directional hypothesis (one tailed p-test) and a non-directional hypothesis (two tailed p-test) is that a directional hypothesis makes a prediction about in which direction the effect will change the mean of the population. In doing this, we allocate all of alpha to one tail, increasing the critical region in that tail. This reduces beta and, as previously discussed, anything that reduces beta increases power.

Using a Directional Hypothesis[edit]

Directional hypotheses are only properly used if specified before we begin the study and when there is reason to believe that effects in the opposite direction are not possible or extremely uninteresting. In order to insure that the experimental hypotheses remain mutually exclusive and exhaustive, the alternative hypothesis must be changed from an inequality (a ≠ b) to a strict inequality (a > b or a < b) and the null hypothesis must be changed to an inequality of the form a ≥ b or a ≤ b.

Arguments Against Using Directional Hypotheses[edit]

Even though using a directional hypothesis increases power, there are two very good reasons not to use them.

(1) Not specifying that you are going to use a directional hypothesis before starting research changes the probability of a false alarm and thus exposes the scientific community to misleading information.

(2) The potential for abuse makes any conclusion reached using a directional test extremely suspect to fellow experts. Abuse may include the following

  • Switching to a one-tailed test after the fact to make a nearly statistically significant result fall into the critical region.
  • Switching back to a two-tailed test if effect is in the opposite direction than predicted