The Data Professor's Guide to Crossing The Thresholds in Basic Statistics
The Data Professor's Guide to Crossing The Thresholds in Basic Statistics
Sean G. Carver, Ph.D.
Buy on Leanpub

Preface for Instructors

What does it mean to “cross the thresholds” in basic statistics? The education literature describes a notion of a “threshold concept” in the following way:

A threshold concept can be considered as akin to a portal, opening up a new and previously inaccessible way of thinking about something. It represents a transformed way of understanding, or interpreting, or viewing something without which the learner cannot progress. As a consequence of comprehending a threshold concept there may thus be a transformed internal view of subject matter, subject landscape, or even world view. This transformation may be sudden or it may be protracted over a considerable period of time, with the transition to understanding proving troublesome. Such a transformed view or landscape may represent how people ‘think’ in a particular discipline, or how they perceive, apprehend, or experience particular phenomena within that discipline (or more generally).1

Myers and Land continue: threshold concepts “may be transformative (occasioning a significant shift in the perception of a subject), irreversible (unlikely to be forgotten…), and integrative (exposing the previously hidden interrelatedness of something) or may be troublesome and/or they may lead to troublesome knowledge for a variety of reasons.”2 They argue that threshold concepts are the “jewels in the curriculum”3 and instructors should build their courses around these topics.

The Data Professor’s Guide to Crossing the Thresholds in Basic Statistics, still in preparation, will progress through the following topics that we identify as threshold concepts. We take a recursive approach rather than a linear one, revisiting each topic repeatedly as students put the pieces together. The literature on threshold topics recommends this approach.4

Planned Chapters:

  1. Preface for Instructors
  2. An Introduction to p-Values
  3. Data Sets, Variables, and Distributions
  4. Regression
  5. Probability Models and Random Variables
  6. Sampling, Sampling Distributions, and the Central Limit Theorem
  7. Tests of Significance in the Context of Sampling
  8. Statistical Power and Type I and II Errors
  9. Confidence Intervals

While we identify many threshold concepts in Basic Statistics, we consider the paradigms of inference as the “crown jewels” which we revisit several times on the above list. Usually inference comes at the end of a Basic Statistics course, however we introduce tests of significance in the first chapter, with only basic high school level mathematics as prerequisite.

An Introduction to p-Values

An inquiry5 of potential interest to students and professors alike: can one recover from an all-nighter by sleeping in for the next two days? You may have an opinion, but a statistician would demand that you back up your intuition with hard evidence. In fact, a statistician would demand even more: that you quantify just how much evidence you have.

Researchers investigated the sleep-deprivation-and-recovery question with college-aged volunteers (called subjects), randomly divided into two groups. Members of one group were allowed unrestricted sleep throughout the duration of the study; members of the other were compelled to remain awake all night, but then allowed to enjoy unrestricted sleep on the subsequent two nights. All subjects were tested on a cognitive task, both before and after the experience, and the improvement in their performance was scored. In this experiment, these improvement scores ranged between \(-14.6\) and \(45.6\). Some improvement scores fell below zero because if a subject performed worse on the task after the ordeal (which happened in a few cases for subjects of both groups), their negative improvement was so noted.

The group that enjoyed unrestricted sleep throughout the experiment improved more, on average6, than the group that suffered sleep-deprivation three nights before the second test. How much more? The “Unrestricted” group improved, on average, by a whopping 19.82 points, whereas the “Deprived” group improved, on average, only by a measly 3.9 points. Compelling? Not to a statistician—at least, not yet!

Pictures invariably help. In the graph below, each circular dot represents a single subject—with horizontal separation (and also color separation) between the two groups. The vertical position of each dot indicates the corresponding subject’s improvement, positive or negative. The 11 red dots indicate the “Deprived” subjects’ scores and the 10 blue dots indicate the “Unrestricted” subjects’ scores. We added a small amount or random jitter to each subject’s horizontal position to separate the dots for subjects with nearly equal improvement scores. Indeed, two of the “Deprived” subjects had the same score: \(-10.7\). Beyond individual scores, I display the average for each group with a cross of the corresponding color.

Compelling, now? To a statistician, still not yet. Why? We want to say that sleep deprivation impairs cognitive performance. But first we must consider the all ways could this statement be wrong and still show us a graph like the one shown above.

First of all, let’s suppose the experimenters are evil. Suppose, unlike what the authors described in their published paper, they first administer a survey to the subjects with questions like: “How often do you stay up all night?” And: “Would you like to stay up all night as part of this experiment?” Suppose this survey successfully divides subjects into “Night-Owls” and “Morning-Larks,” according to a measure of their natural capacity to stay up all night. Here comes the evil part: they assign the Morning-Larks, expected to be more impaired by sleep deprivation, to the “Deprived” group, and reward the Night-Owls with membership in the “Unrestricted” Group. If all subjects are relatively impaired by sleep deprivation, a conclusion to this effect still stands correct. But if the Deprived group (Morning-Larks) is more impaired on average, (more impaired than the Night-Owls would have been, if deprived of sleep), then the result will seem stronger than it should seem, simply from the choice of grouping.

As bad as evil experimenters sound, I can imagine a worse scenario. Suppose the experimenters first run a battery of tests on the subjects to determine their aptitude for learning the cognitive task that measures their performance. After all, the figure above shows large differences between subjects, even within each group. So suppose the researchers find a creative way to predict whether a subject will improve a lot, or only improve just a little, or even get worse on the task. Now they turn conniving: they assign the “cognitively-impaired” subjects to the “Deprived” group and the “cognitively-precocious” subjects to the “Unrestricted” group. This choice may be even more dishonest than the Night-Owl/Morning-Lark grouping because it may go beyond exaggerating the result—it may create an illusion of a result where perhaps one was not there to begin with.

If these examples seem a bit contrived, they are indeed: the researchers could not have published their study without an assurance that they assigned groups randomly, and the machinations described above would be grossly unethical. Statisticians consider randomized groupings a “best practice” in experimental design, but it is important to understand all of the possible consequences of this choice.

Usually random group assignment is done with software, but an entirely acceptable alternative would be to write each subject’s name on a playing card, then shuffle the deck (in this case, a deck of 21 cards, one for each subject), then finally deal 11 cards into a “Deprived” pile and 10 cards into an “Unrestricted” pile. The sizes of these piles match the (perhaps arbitrary) choices made in the experiment. If the shuffling were done properly—so that each arrangement of cards remained equally likely—this alternative procedure would be just as good as using software to assign groups.

We have already discussed the fact that there are differences between subjects. Even if the experimenters made no effort to categorize the subjects in this way, some will invariably be more Night-Owl, others more Morning-Lark. Some will be more cognitively-precocious, and others more cognitively-impaired.

Here is a key point: if we randomize our selection of groups in the way we should, there is still no guarantee that the 10 most cognitively-precocious don’t all end up in the “Unrestricted” group, simply by luck of the deal. Hang out in a casino long enough and you may see a Royal-Flush at the poker table, even if nobody cheats.

The following somber conclusion would be entirely on the mark: experiments like this one don’t prove anything because no matter how compelling the result seems, the supposed evidence could be a misleading consequence of an unlucky randomized grouping. That insight is the first of three I want you to come away with.

But the second insight follows this somber conclusion: if the sleep pattern really had no effect on performance, we can control just how much bad luck we would need to randomly shuffle subjects into a grouping that compels an erroneous conclusion.” The conclusion we care about, in this context, is the assertion that sleep deprivation impairs performance, which would indeed stand erroneous if the sleep pattern really had no effect on performance. The lever in our control is the amount of evidence we demand to make the conclusion we care about. We will demand enough evidence so that, if the sleep pattern really had no effect on performance, the chance of making the conclusion in error is acceptably small.

Finally, we add our third insight: we can quantify just how unlikely the evidence we found supporting the statement that sleep deprivation impairs performance, if in fact the sleep pattern had no effect on performance. This quantification can give us confidence in our conclusion—and the ability to convey the exact degree of this confidence to others.

These insights will take us a while to develop. Along the way, we will visit the major paradigms of tests of significance (also known as hypothesis testing), arguably the cornerstones of a Basic Statistics course. Let me warn the reader that I will deviate from canonical statistical practice. The experimenters did what most statisticians would do: they performed something called a “two-sample t-test” on the data. Although, I agree that a “two-sample t-test” is appropriate here, I am going to present a different test of significance7—different from a t-test—simply for pedagogical purposes. A t-test requires background that I feel would obscure, at first reading, the paradigms I am trying to illuminate. A solid understanding these paradigms will leave the reader much better prepared to understand the t-test, when the time comes.

What’s the conclusion we want to reach? My answer to this question, for now, is already a deviation from standard statistical practice. The desired conclusion of a traditional t-test involves a statement about a larger population of potential volunteers, perhaps all college-age individuals in the United States. On the contrary, I am going to just stick with just the 21 subjects chosen for the experiment and say that our desired conclusion is that “even though we have divided subjects into two groups, and imposed different sleep patterns, if all 21 subjects could belong to both groups, and be tested in both scenarios, subjects’ scores would improve more (on average) with unrestricted sleep than with sleep deprivation.”

Of course, we could not actually do this experiment properly because we would end up training the subjects while testing the first condition, which would affect their performance while testing the second. However, we do not need to run this experiment in this way to successfully perform a test of significance: we just need the data we already have.

Now we articulate a statement that contradicts our desired conclusion: “even though we have divided subjects into two groups, and imposed different sleep patterns, if all 21 subjects could belong to both groups, and be tested in both scenarios, all subjects’ improvement scores would remain exactly the same, regardless of what group and what sleep pattern they experience.” In the language of tests of significance, we call this statement our null hypothesis.

To support our desired conclusion, we seek evidence against the null hypothesis. But for the reasons discussed above, we cannot directly test the null hypothesis, with all subjects belonging to both groups. So instead, to build a case against the null hypothesis, we use the evidence from the experiment, but we quantify how unlikely this evidence would be if the null hypothesis were true. If we can demonstrate that our evidence would be very unlikely if the null hypothesis were true, then that calculation would support our desired statement that the null hypothesis was false.

In looking at the graph above, by now, most statisticians would already have a strong suspicion that the null hypothesis is false. However a statistician’s next step would calculate a number called a p-value which quantifies how unlikely the evidence from the experiment would be if the null hypothesis were true.

In context of the experiment, we think, we will try to argue, but we will never know for sure, that the null hypothesis is false. However, in the context of the p-value calculation, the null hypothesis is true by assumption: we impose the truth of the null hypothesis to make a point. It is important to keep these two contexts separate.

How do we calculate the p-value? We repeat the experiment, virtually, while imposing the truth of null hypothesis. To make such a repetition, we need to randomly reassign the groups, without changing each subject’s improvement score.

Why is it that easy? The null hypothesis posits that the subjects’ scores do not change with a different grouping or sleep pattern. Of course, we also must assume that when we repeat the experiment under exactly the same conditions (even with change in grouping), subjects scores still do not change. In practice, we would not expect these coincidences. Indeed, if we could actually repeat the experiment, with the same real—instead of virtual—subjects, we should expect some variability. For example, from night to night the quality of a each subject’s sleep varies, even if it’s unrestricted. Subjects may have different amounts of coffee and/or sugar at breakfast from day to day, and they may be more or less distracted by other things going on in their lives. Nevertheless, the assumption that their scores do not change upon a virtual repetition makes it possible to conduct this virtual experiment. (How else could we virtually assign their scores?). The extra assumption in no way invalidates the results, because we seek to answer the following question: how unusual would the grouping in the original experiment have been if the null hypothesis had been true in this context?

The graph below shows the original data, on the left, together with four random reshufflings on the right. The trials (original and reshuffled) are now separated on the horizontal axis, but the groups, still separated by color, are no longer separated horizontally. Notice that the vertical position of the dots do not change from trial to trial (improvement scores don’t depend on the reshuffling and don’t change). Only the colors of the dots change (indicating how the reshuffling regroups the subjects into different but, in this case, irrelevant, sleep patterns). All reshufflings have 11 “Deprived” (red) dots and 10 “Unrestricted” (blue) dots. As before, the group-averages for each trial are noted as red or blue crosses. Despite the fact that the scores don’t change, the positions of these crosses do change, because which scores are included in each average changes.

To quantify “the degree of bad luck getting evidence challenging the null hypothesis as strongly as the evidence seen in the original experiment,” we first need to quantify “evidence challenging the null hypothesis as strongly as evidence seen in the original experiment.” This task is easy. In the original experiment, the average of the “Unrestricted” group’s improvement scores is 19.82. The average of the “Deprived” group’s improvement scores is 3.9. The difference between these averages is 15.92. We quantify the strength of the evidence with this difference. We call this quantity our test statistic8. If we virtually repeat the experiment and find the test statistic (i.e. difference between group averages) greater than or equal to 15.92, we say we have found “evidence challenging the null hypothesis as strongly as the evidence seen in the original experiment.”

The test statistic will have a positive sign if the blue cross lies above the red cross, so that the “Unrestricted” group performs better, on average, than the “Deprived” group—as expected in the original experiment. The test statistic will have negative sign if positions of the crosses are reversed, indicating a reversal of the expectation from the experimental context. The distance between the two crosses equals the magnitude of the (either positive or negative) test statistic.

What do we see with our four reshufflings? As we should have anticipated, the test statistic changes each time, shown here: 9.3718182 (Reshuffle 1), -5.0227273 (Reshuffle 2), 3.3009091 (Reshuffle 3), and -16.4009091 (Reshuffle 4). For the first and third reshuffling, the test statistic is positive, indicating that the average improvement for the “Unrestricted” reshuffled group is greater than the improvement for the “Deprived” reshuffled group. If these values had been seen in an experiment, the data would “suggest” our desired conclusion: that sleep deprivation impairs performance. But if these suggestions seem uncompelling here, we should not be surprised: our desired conclusion is false in the reshuffled context (i.e. for calculating the p-value). Moreover, the results of trials 2 and 4 would “suggest” the opposite conclusion—that the “Deprived” group performs better than “Unrestricted” group. Likewise, this reversal should not alarm us: in the context of the p-value calculation, we have no reason to expect that the “Unrestricted” group will perform better than the “Deprived” group—in fact, when the null hypothesis is true, about half the time it does and about half the time it doesn’t.

All values of the test statistic shown for the four reshufflings remain less than the value seen in the original experiment. That bodes well for the conclusion we desire to make. In other words, in the four trials shown above, when we knew the truth of the null hypothesis, we did not succeed in finding evidence challenging the null hypothesis as strongly as seen in the original experiment. This failure suggests that perhaps the reason we did find such evidence in the experiment was that the null hypothesis stood false when we originally collected the data, as we hoped to conclude. We find this suggestion encouraging, but we still need to make a more careful account of the likelihood of finding such evidence, under the assumption of a true null hypothesis.

Incidentally, the magnitude of the test statistic for the fourth reshuffling was greater than the magnitude of the test statistic for the experiment, except that the sign was reversed. (Specifically, the sign was negative, whereas in the experiment, it was positive.) Had we collected the data from this single trial in an experiment, the large negative value for the test statistic might have suggested that a couple of good nights sleep actually impairs you, rather than helps you. But would statisticians consider such a result as evidence against null hypothesis? They might, in some experiments. However, in this particular experiment, the researchers might be certain that any large negative value for the test statistic was a group-assignment fluke, as we know it to be in our reshuffled trial. Indeed, experimenters might have good reasons to think that sleep should never impair subjects. On the other hand, no one seeing this data in an experiment would have absolutely any basis for concluding what we wanted to say in the first place: that sleep deprivation impairs subjects. For this reason, statisticians may interpret a large negative value for the test statistic as a “failure to reject the null hypothesis,” by which they would really mean a “failure find evidence for the desired conclusion.”

A decision to count only large positive (as opposed to both large positive and large negative) test statistics as evidence against the null hypothesis is called posing a one-sided alternative rather than a two-sided alternative. (In other experiments, a one-sided alternative might mean that only large negative values count.) The word “alternative” implies “alternative to the null hypothesis,” which we is what we have so far referred to as the “desired conclusion.” This desired conclusion is more traditionally called the alternative hypothesis. Our one-sided alternative is that “sleep deprivation impairs performance,” whereas a two-sided alternative would be that “sleep deprivation changes (either impairs or improves) performance.” Statisticians consider it important to make the choice between a one- or two-sided alternative before any data analysis to prevent the data from influencing this decision.

To exactly calculate the odds of finding sufficiently compelling evidence when the null hypothesis is true, the next piece of the picture involves realizing that there is only a finite number of ways of shuffling a deck of 21 cards. This number is called “21 factorial,” written 21!, which equals \(21 \times 20 \times 19 \times \dots \times 3 \times 2 \times 1\), (see footnote9), written out as 51,090,942,171,709,440,000. That is a big number, but fortunately the number that truly interests us remains quite a bit smaller: it is the number of ways of dividing a deck of 21 cards into two piles of sizes 11 and 10, where the order of the cards in each pile doesn’t matter. This number is called “21 choose 11,” and it equals 352,716 (see footnote10).

The number 352,716 is small enough to work with. Specifically, we can easily repeat this experiment—virtually—once for each of the 352,716 different possible group assignments. The null hypothesis determines the scores—making the virtual experiment possible. The program varies the group assignment—allowing computation of the test statistic for each grouping. All repetitions together take only a few seconds on a modern computer.

Programming a computer to run through all the possible reshufflings is not especially hard for someone trained in coding, but it is not something that is usually expected of a student in a basic statistics class. Therefore, we omit the details of how we accomplished this feat, but we will expound upon the details of the results.

The 352,716 values of the reshuffled test statistic varied between a minimum and a maximum value of -23.2927273 and 23.27. The unique minimum (negative) value occurred when the 11 highest performers belonged to the “Deprived” group; the unique maximum value occurred when the 10 highest performers belonged to the “Unrestricted” group. (Note that the “Deprived” group always had 11 members and the “Unrestricted” group always had 10.)

The value of the test statistic seen in the experiment, 15.92, was not unique, but it is was quite rare: it occurred only 23 times out of 352,716 total trials—in less 1 in 10,000 reshufflings. However, statisticians do not normally care about the number of times the test statistic takes on the exact value seen in the experiment. It could easily have happened that all 352,716 trials had unique values, particularly had more significant digits been recorded in the data. Indeed, equal test statistics occur only by coincidence—either because two or more data points coincide, or because combinations of several average to the same value.

What statisticians care about is the number of test statistics, out of all test statistics possible, that fall in the appropriate range. The range of interest is the range where the strength of the evidence against the null hypothesis implied by the test statistic meets or exceeds the strength of the evidence seen in the original data. This statement implies that we care about reshuffled test statistics greater than or equal to 15.92. (On the other hand, if we had chosen a two-sided alternative, we would have cared about reshuffled test statistics that were either greater than or equal to 15.92 or less than or equal to -15.92.)

Again, a picture can help. Below, we plot all 352,716 reshuffled test statistics with the value of each test statistic shown as the vertical position of a translucent blue dot. Each dot corresponds to a different reshuffled trial. We randomly “jittered” the horizontal position of the dots to avoid having them all fall on the same vertical line. The vertical position of the black horizontal line displays the value of the test statistic seen in the original experiment, 15.92. Within about 10 points of zero (plus or minus), the individual dots become a dense blue smear, because so many values for the test statistic fall in that range. However, the values start to thin out before they reach the black line.

We caution the reader that we drew this figure assuming the truth of the null hypothesis. If the null hypothesis were false, the blue dots would appear at different vertical levels, because the sleep patterns determined by the grouping would now change the scores. Specifically, if our desired conclusion were true, we would see values of the test statistic shifted upwards from their corresponding positions that we see here. To elaborate, if sleep deprivation impairs performance, then subjects reshuffled from the “Deprived” to the “Unrestricted” group would tend to have higher scores, whereas those reshuffled from “Unrestricted” to “Deprived” would tend to have lower scores. Both shifts would imply that the test statistic (the “Unrestricted” group average minus the “Deprived” group average) would increase from its value under the null hypothesis (which imposes that scores stay the same upon reshuffling).

With a true null hypothesis, we find exactly 2533 blue dots on or above the black line—a little less than 1% of the 352,716 total. Each of these unusual values corresponds to a reshuffled trial that produces misleading evidence against the null hypothesis—that is, evidence meeting or exceeding the strength of the evidence seen in the original experiment.

The p-value equals 0.0071814 and is given by this equation11:

$$ \mbox{p-value} = \frac{\mbox{total number of dots on or above the black line}}{\mbox{total number of dots overall}}. $$

The p-value is related to the percentage of dots on or above the black line, as mentioned above. To calculate this percentage, we simply need to multiply the p-value by 100%: the result is 0.7181415%. This number indeed remains less than 1%, as advertised. We interpret this result as follows: if the null hypothesis were correct (making our desired conclusion false) then our chance of seeing data that provides evidence against this true null hypothesis simply from the random group assignment, is only 0.7181415% out of 100% total.

It should be pointed out that, for the quotient shown above to accurately express the p-value, it is essential that each possible group assignment remain equally likely. If the experimenter stacks the deck in an evil way, or in any way other than giving it a proper shuffling, then all bets are off, so to speak.

Note that p-values are written in terms of proportions (fractions of 1) not percentages (fractions of 100%). To reiterate, proportions fall between 0 and 1, whereas percentages fall between 0% and 100%. One should have no difficulty converting (as we do above), but nevertheless, by convention, one should always report p-values as proportions. Statisticians have a good reason for this restriction: it prevents introducing ambiguity: does a p-value of 0.7181415 mean 0.7181415% or 71.81415%—the difference could not be more stark. Answer: a p-value of 0.7181415 always means a proportion—equivalent to 71.81415%, one hundred times what we found. The use or non-use of %-sign does resolve this ambiguity, but it is too easy to carelessly leave a %-sign off. Without restricting authors to proportions, in some cases, as above, a reader could never know what an author intends.

Let’s consider a different scenario. What if all (100%) of the blue dots were on or above the black line? This result would imply that the test statistic takes on its unique minimum (negative) value—unique among the 352,716 possibilities—with the 11 “Deprived” subjects having the 11 highest scores. More succinctly, those with sleep deprivation would have the best performances. If this occurrence happened, the numerator and denominator of the above equation would have the same value (352,716), making the p-value exactly 1.0. In this case, we would have the weakest possible evidence that sleep deprivation impairs subjects.

A p-value of 1.0 in a hypothetical experiment reveals there is absolutely nothing in the data that supports the desired conclusion. In this scenario, if the null hypothesis were true, we would not be at all surprised to find evidence challenging to the null hypothesis as strong as seen in this hypothetical experiment. In fact, we would be certain to find such evidence—giving us absolutely no reason to believe that our null hypothesis wasn’t true when we collected our maximally weak data.

On the other hand, the smaller the p-value, the more surprised we would be to find matching evidence (i.e. of equal or greater strength) against the null hypothesis if the null hypothesis were true. Smaller p-values allow us to make a case that the null hypothesis was not true in the experiment, thus smaller p-values bode well for our desired conclusion.

The smallest possible p-value, at least in this problem, would occur if the test statistic took on its unique maximum value. This result would imply that the 11 “Deprived” subjects had the 11 lowest scores. In this case, only one blue dot would lie on the black line, whereas the rest would fall below. This p-value would be 1/352,716 or 0.0000028. Numbers this small remain hard to read, so most statisticians would simply report such a p-value with the inequality: “<0.0001.”

How small does the p-value need to be? An experimenter should decide on a threshold called the level of significance during the design of the experiment. Like the decision between a one- or two-sided alternative, experimenters should assign the level or significance before analyzing any data, because statisticians consider it important that the data do not influence these choices.

After designing the experiment, then collecting and analyzing the data, the experimenter computes a p-value and applies the following criterion: only if the p-value is less than the previously chosen level of significance, do they deem the p-value small enough to reject the null hypothesis and accept the desired conclusion. With this decision, the experimenter finds significance. On the other hand, if the p-value is not smaller than the level of significance, the experimenter fails to find significance and thus fails to reject the null hypothesis and fails to make the desired conclusion.

In no way should finding or failing to find significance be thought of as “determining the truth.” The reader should already understand that bad luck with random grouping can introduce an error in the conclusion, no matter what this conclusion should be. A statistician’s goal is to control and quantify this luck. Specifically, the p-value assesses the strength of the evidence against the null hypothesis.

How does one decide on a level of significance? Most studies follow the tradition: \(\alpha = 0.05\). To clarify, statisticians use the Greek letter “alpha” (written \(\alpha\)) to denote the level of significance. What this means for our study is that no more than 5% of the reshuffled test statistics should be above the black line \((\alpha \times 100\% = 5\%)\) to allow us to proceed to make the desired conclusion. Like p-values, levels of significance are written as proportions, but for clarity with small numbers, here we also write percentages, which may be more familiar to some readers at this stage. Remember, we said that less than 1% are above the black line, which, of course, also means that less than 5% are above the black line. Because the p-value remains less than the level of significance (0.0071814 < 0.05) we succeed, as expected, in finding significance. We can now “safely” reject our null hypothesis and establish our desired conclusion.

Finally, we have done the work to satisfy a statistician: sleep deprivation impairs performance, even after one night of recovery. But a statistician would be careful to remind us that our “safety” in drawing this conclusion is only as high as our p-value is low. Finding significance means that the degree of our safety in making this conclusion has crossed into the territory of “safe enough” (as per the criteria we set forth when we designed the study and/or the traditional criteria). We caution the reader that unless our p-value actually equals 0 (which cannot happen in this problem, as with many other problems) then “safe enough” must fall short of certainty.

For comparison, we can draw a red horizontal line to indicate the level of significance. Specifically, the red line on the plot below is the line where 5% of the reshuffled test statistics lie above it. We see that the red line is below the black line. This property follows from the fact that we found significance: exactly 5% of the blue dots lie above the red line, whereas less than 1% lie above the black line. Had we discovered the black line below the red line, we would have failed to find significance. The red line corresponds to a test statistic of 11.1472727, called the critical value of the test statistic. Because the test statistic of the experiment was greater than its critical value, we found significance; if it had been less than, we would not have.

What if the black and red lines had actually coincided? That is, what if the p-value had actually equaled the level of significance? Equivalently, what if the test statistic from the experiment had actually equaled its critical value. This coincidence happens so infrequently in statistics that it bears no consistent convention. In a study, you should always, regardless of the p-value, report the p-value, not just whether you find significance. If your p-value equals your the level of significance, you should note this fact, but then realize you are free to make your own convention as to how to determine significance under this contingency. The safest choice, if you need to pass peer-review to publish your study, would simply be to say your “result was on the boundary of significance.”

When would you not use the traditional level of significance? To decide on a non-standard \(\alpha\), consider the following principle: if the null hypothesis actually holds, the level of significance gives the fraction of reshuffled trials that will lead to a false positive result—a random group assignment that, by chance, leaves the test statistic above the red line and leads to an inappropriate finding of significance. Thus, with the traditional \(\alpha\), if a true null hypothesis makes the desired conclusion false, we would have a 5% chance of erroneously accepting the desired conclusion. If you want this error rate to be higher or lower, you would choose a different \(\alpha\).

Why wouldn’t you want the chance of a false positive to be as low as possible? All things being equal, you would—but things are never equal. When the chance of a false positive goes down (indicated by a lower \(\alpha\)), the chance of a false negative goes up, which means it will be harder to find significance, even if that choice were appropriate, (i.e. if the conclusion you want to make is true). Many statisticians find that \(\alpha = 0.05\) provides a reasonable trade off, in most circumstances, and is considered a good choice for many problems. Indeed, for many studies, \(\alpha = 0.05\) will not raise the eyebrows of skeptical readers, whereas any higher \(\alpha\) may well find scrutiny. But other trade offs may be appropriate in some situations, for example, if false positives remain especially costly compared to false negatives.

An analysis of false negatives is much more subtle than the analysis of false positives and will bring us to the study of something called statistical power in a future lesson.

Notes

1Jan H.F. Myers and Ray Land. “Threshold concepts and troublesome knowledge (1): linkages to ways of thinking and practicing within disciplines.” In Improving Student Learning: Ten Years On. C.Rust (ed.) Oxford Center for Staff and Learning Development, Oxford, 2003.

2Jan H.F. Myers and Ray Land. “Threshold concepts and troublesome knowledge (2): Epidemiological considerations and a conceptual framework for teaching and learning.” Higher Education 49:373–388. Springer 2005.

3Ray Land, Glynis Cousin, Jan H.F. Myers, and Peter Davies. “Threshold concepts and troublesome knowledge (3): Threshold concepts and troublesome knowledge (3): implications for course design and evaluation.” In Improving Student Learning Diversity and Inclusively. C.Rust (ed.) Oxford Center for Staff and Learning Development, Oxford 2005.

4Sinead Breen and Ann O’Shea. “Threshold Concepts and Undergraduate Mathematics Teaching” In The Best Writing on Mathematics, M.Pitici (ed.), Princeton University Press, 2017

5Beth L. Chance and Allan J. Rossman. Investigating Statistical Concepts, Applications, and Methods. Third Edition. http://www.rossmanchance.com/iscam3/

6As some of you may know, there exists more than one way of taking an average: for now, we use the mean—the method most people understand from the word “average.” If you don’t know the word “mean,” but you do know “average,” in this context, you almost certainly understand correctly.

7Statisticians know the test I present by the name nonparametric permutation test, and consider it entirely appropriate for the problem under discussion. Indeed, a t-test would be contraindicated for samples this small, if the data were deemed too poorly fit by a Normal model. As it turns out (see footnote below), the p-Values for the two tests rest almost equal to each other, so both tests can be used here, with the same degree of success.

8The test-statistic for a two-sample t-test adjusts this difference between the means by dividing by the standard error. The notion of standard error does not carry over to this context, where we do not perform sampling, in the traditional sense.

9Since the deck has 21 cards, there are 21 choices for the first card in the deck. Once the first card has been dealt, there are only 20 choices for the second card, and so on to the last card. Combining these possibilities requires a product of these 21 numbers.

10Once the deck has been shuffled, only the assignment to each group matters—the order within each group is irrelevant. For each selection of the “Deprived” group, there are \(10!\) irrelevant ways of reordering the 10-subject “Unrestricted” group, and for each selection of the “Unrestricted” group there are \(11!\) irrelevant ways of reordering the 11-subject “Deprived” group. Therefore, the number “21 choose 11,” commonly written \( {21 \choose 11} = \frac{21!}{11! \times 10!}.\)

11Interestingly, the p-value for a Welsh two-sample t-test, on this same data, with a one-sided alternative, remains remarkably similar: 0.0076742. The two results are computed in entirely different ways.

12David S. Moore, George P. McCabe, and Bruce A. Craig. Introduction to the Practice of Statistics. Nineth Edition. W. H. Freeman, 2017.