The P-value

Definition. The P-value is the probability of observing a test statistic (i.e., a summary of the data) that is as extreme or more extreme than currently observed test statistic under a statistical model that assumes, among other things, that the hypothesis being tested is true. This can be expressed as Pr(data|H0), where "Pr" is read "the probability of" and "|" is read as "given" or "conditional upon." The statistic should not be interpreted as the probability of H0 being true. 

Interpretation: two competing frameworks. P-values can be used  in multiple ways. This has caused a great deal of confusion, because there are two competing and sometimes contradictory philosophical frameworks used to derive the P-value. The first framework was formally developed and popularied by R. A. Fisher (Fisher, 1925). Fisher's framework is called significance testing. The second framework was developed by Jerzy Neyman and Egon Pearson (Neyman & Pearson, 1928, 1933). Neyman & Pearson's framework is called hypothesis testing. When we interpret the P-value borrowing some concepts from Fisher's framework and some from Neyman & Pearsons framework, incoherent interpretations may result. It is therefore important to understand the objectives and basis of each framework. 

Fisher's significance testing. P-values are to be used flexibly in this framework, with the P-value interpreted as "a rational and well-defined measure of reluctance to accept the hypotheses they test" (Fisher, 1973, page 47). Although many have mistakely suggested a single threshold for determining "statistical significance" (myself included, mea culpa!), Fisher noted "no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas" (Fisher 1973). Nonetheless, the smaller the P-values, the stronger the evidence against the null hypothesis. Fisher intended the P-value to be combined with other sources of information from within and outside the study, often based on background knowledge. Thus, the researcher is not to place sole reliance on the P-value as a means of reaching a conclusion. Note that there is no alternative hypothesis in Fisher's significance testing framework, and that failure to reject the null hypothesis provides no evidence for its support.

Neyman-Pearson (N-P) hypothesis testing. The N-P hypothesis testing procedures is suited for decision-making, and less so for scientific inference. In this framework, we set acceptable rates for type I errors (false rejection of null hypotheses) and type II errors (false retention of null hypotheses) before the experiment is begun (i.e., preexperimentally). The acceptable type I error rate is referred to as "alpha." The acceptable type II error rate is referred to as "beta." Preexperimental error rates are not based on the data from the study. After the experiment is completed, we may calculate a  P-value from the data and compare it to the preexperimental a level. If p < a, the null hypothesis is rejected. Note that in N-P hypothesis testing, the conclusion of the test is not intended to verify or falsify the specific hypotheses tested. Instead, it provides "rules for behavior" that are intended to limit the number of type I and type II errors in a long run of similar experiments. The N-P hypothesis testing procedure is has been criticized as being non-scientific being incapable of interpreting the results from a single scientific study.

Comparison of significance testing and hypothesis testing and additional notes. 

 

Fisher Significance Testing

Neyman-Pearson Hypothesis Testing

Logical basis

Inducive reasoning

Rules of behavior based on a quasi-deductive model.

Hypotheses tested

The null hypothesis.[1] There is no alternative hypothesis in this system.

Null and alternative hypotheses tailored to the situation.

Objective  

The P-value is used as an informal measure of evidence to reflect upon the credibility of the null hypothesis.

Alpha and beta levels are provided pre-experimentally to limit the number of type I and type II errors in the long run.[2] 

p = .04 vs. p = .06

These results provide approximately the same level of evidence against the null hypothesis.

Assuming a pre-experiment alpha level set to .05, p = .04 provides a significant finding while p = .06 provides a nonsignificant finding.

p = .04 vs. p = .001

p = .001 provides much stronger evidence against the hypothesis than does p = .04.

Assuming a pre-experiment alpha level set to .05, both studies provide significant  evidence to reject the null hypothesis, and the  two p values are treated equally.

Conclusion

The conclusions of the experiment should not be based on the P-value alone.[3]

Decisions should adhere to rejection and acceptance regions based on alpha and beta set up before the study.

         In Fisher�s significance testing framework, the P-value is an inductive measure that assigns a number as a measure of the credibility to the hypothesis being tested.

         The P-value is not a direct measure of inductive statistical evidence. Inductive statistical evidence is defined as the relative inductive support given to two hypotheses by the data.7 Fisher�s P-value addresses only one hypothesis: the null hypothesis [4] 

         The alpha level in the N-P framework is akin but not identical to the P-value. Both the alpha level and the P-value are based on unobserved data in the tail region of the probability model defined by the null hypothesis. However, the P-value is postexperimental, while the alpha level is preexperimental. It is a mistake to view the postexperimental p value as the smallest level of alpha at which the experimenter would reject the null hypothesis (Goodman 1993, Greenland 1991).

         Significance tests and hypothesis tests are both forms of frequentist inference. Other forms of statistical inference include Bayesian methods[5] and standardized likelihoods.



[1] The null hypothesis is the hypothesis to be nullified and is not necessarily restricted to a statement of �no association.�

[2] A postexperiment P-value can be slotted into the hypothesis testing procedure by comparing it to the preexperimental alpha level.

[3] Fisher intended the P-value to be used informally, as a flexible inductive measure with inferences depending on background knowledge about the phenomenon under investigation.

[4] Goodman 1993 cites the book Probability and the weighing of evidence by I Good (New York: Charles Griffin & Co, 1950).

[5] Bayesian statistics were called inverse probabilities until the middle of the twentieth century (Feinberg, 2006).