If You Must Test, Do It One-Sided, Perhaps with a Touch of Bayes.
A p-value is most informative when its underlying null hypothesis was tailored to tightly address the research question. This optimizes the chance that the result will be correctly understood, interpreted, and communicated. I argue here that when traditional (frequentist) statistical inference is called for, most research questions are best handled using one-sided tests. Doing so also allows us to add a "touch of Bayes" in order to turn the indirect inference provided by p-values into direct inference.
"Wide-angled" hypothesis testing is ineffective. "Overall," "omnibus" "simultaneous" tests are often taught, but they are usually uninformative when used.
"Wide-angled" hypothesis testing is ineffective. "Overall," "omnibus" "simultaneous" tests are often taught, but they are usually uninformative when used.
Bilder and Loughin (2015, Section 3.2.3) present counts for a 4 x 4 contingency table in which the rows are the type of fiber in a new fiber-rich cracker: none, bran, gum, both bran and gum. The subjects ate several of one type of cracker, then consumed a meal, and later rated how much bloating they experienced: none, low, medium, high. The general research question appears to be: Do these fiber-rich crackers increase the incidence for bloating?
> fiber <- c("none","bran","gum","both") |
While the warning "Chi-squared approximation may be incorrect" indicates that this sample size is too small to generally provide trustable p-values, that is beside the point here. If instead the counts were tripled (giving a total sample size of 144), then no such message appears and p < 0.000001. Nevertheless, even this strong result has three distinct problems.
- While this scale is clearly ordinal, the analysis only treats it as nominal, and thus does not address the research question. In fact, scrambling the ordering of the rating levels does not change the results. For example, the following statements produce the same results as above.
> bloat2 <- bloat
> levels(bloat2) <- c("high","low","none","medium")
> (c.table2 <- xtabs(count ~ fiber + bloat2)) - This test only addresses whether the four fiber types have the same ratings profile; the single p-value does not relate, say, to comparing the three fiber types to control ("none").
- If we bore down into the distributional theory underlying this test, we could partition this chi-squared test statistic with 9 degrees of freedom (DF) into 9 statistically independent effect-size components (each one being a central or non-central chi-square random variable). To maintain statistical power, each component must provide adequate effect size (non-centrality).
To illustrate this theory, suppose subjects in this study cannot reliably distinguish between "none" and "low" bloating, nor between "medium" and "high" bloating. If so, the 4 x 4 table could be collapsed down to a 4 x 2 table, thus winnowing out 6 DF worth of little or no true effect size. Plus, if only gum fiber increases significant blotting, the fiber variable could reduced to "none/bran" versus "gum/both," thus producing a 2 x 2 table to winnow out another 2 DF worth of marginal effect size. The resulting 2 x 2 table (shown below) gives a chi-squared value with one degree of freedom of 9.38 (p = 0.0022). Let this be the first of the 9 effect-size components. 9.38 is 55% of 16.94, the overall chi-squared value with all 9 DF worth of components for the 4 x 4 table. Thus, (16.95 - 9.38)/8 = 7.56/8 = 0.94 is the average chi-square value for the remaining 8 DF worth of effect-size components. Under the null hypothesis, the expected value of each effect-size component is 1.0, so the winnowed 8 DF worth of effect size is remarkably consistent with being null.
Hypothesis tailoring supports tighter analyses and greater statistical power, a concept poorly emphasized in courses, books, and articles on design and analysis. Of course, for any p-values and confidence levels to be valid, this tailoring must be done in the statistical planning stage and written into the study protocol.
> # Collapse 4 x 4 table to 2 x 2.
> bloat.2x2 <- as.character(bloat)
> bloat.2x2[bloat %in% c("none", "low")] <- "none/low"
> bloat.2x2[bloat %in% c("medium", "high")] <- "medium/high"
> bloat.2x2 <- factor(bloat.2x2, levels = c("none/low", "medium/high"))
> fiber.2x2 <- as.character(fiber)
> fiber.2x2[fiber %in% c("none", "bran")] <- "none/bran"
> fiber.2x2[fiber %in% c("gum", "both")] <- "gum/both"
> fiber.2x2 <- factor(fiber.2x2, levels = c("none/bran", "gum/both"))
> (c.table.2x2 <- xtabs(count ~ fiber.2x2 + bloat.2x2))
bloat.2x2
fiber.2x2 none/low medium/high
none/bran 21 3
gum/both 11 13
> summary(c.table.2x2)
Call: xtabs(formula = count ~ fiber.2x2 + bloat.2x2)
Number of cases in table: 48
Number of factors: 2
Test for independence of all factors:
Chisq = 9.375, df = 1, p-value = 0.0022
As is exemplified for this study of high-fiber crackers and bloating, the common overall test of a R x C contingency table is often too general to handle well-posed research questions. Even if bloating rating were handled as ordinal in scale, say using the Kruskal-Wallace K-group generalization of the classic Wilcoxon-Mann-Whitney 2-group test, a critically small p-value would still only indicate that these four groups differed in some way. Such a wide-angled look begs for using a close-up lens focused sharply. See the strategy based on WMW() summarized below and covered in detail as Example X.
Other such tests share the same problems:
- The overall ANOVA tests the equality of K > 2 group means. A critically low p-value indicates only that the means are different in some way, a bland result that rarely helps investigators address their actual research questions. When several groups are involved and some have similar true means, statistical power suffers.
- The omnibus tests in statistical modeling compare the fit of a model with K predictors (X variables) plus an intercept term to the intercept-only model. This fails to focus on the individual beta*X terms, each of which should have been formed to address a particular research question or serve in a supporting role as covariates. When some beta coefficients are null or nearly null, power for the omnibus test suffers.
- The primary test in a two-group discriminant analysis (MANOVA) compares a pair of centroids with respect to K > 1 variables. A critically low p-value indicates that the groups differ with respect to an optimal linear combination of those variables—the discriminant function—but interpreting this is unlikely to relate to the research questions, which usually link to the individual variables. In practice, greater K usually reduces power for this test. These problems only worsen when more than two groups are involved.
Focus on single parameters. I have long advocated that the most powerful and cogent statistical approaches focus on single parameters, each tailored to a specific research question. Obtaining estimates, confidence intervals, and (perhaps) p-values for tailored hypotheses tests gives us information we can understand, interpret, and communicate effectively. If two or more such parameters are deemed to be needed to address a single research question, then I will use Bonferroni familywise Type I error protection, always trying to minimize that family size. See below for how this applies to the study of fiber-rich crackers and bloating.
Two-sided testing usually makes little sense! Even when analzing a single tailored parameter, should we use a one-sided or two-sided confidence intervals and hypotheses? Here I discuss the inherent problems with testing the classical null hypothesis for the two-sided test of H0: WMWprob = 0.50 or, equivalently, H0: WMWodds = 1.00. The discussion applies to any test on a single parameter, including an odds-ratio, a correlation coefficient, a single parameter in a statistical model, a difference between two means, a single-degree-of-freedom contrast on K means.
If H0: WMWprob = 0.50 is true, we posit that the obtained p-value, P, behaves like a standard uniform random variable, U. For a perfect replication of the study, Prob[U ≤ P | H0 true] = P. Lower values of P create greater suspicion that H0 may not be true, leaving us to give greater credence to H1: WMWprob ≠ 0.50. This indirect reasoning confuses most people. Even those who know better subconsciously drift into behaving as if P is the probability that H0 is true, i.e., that P equals Pr[H0 true | P]. Moreover, the classical two-sided p-value has a fundamental weakness. Being a point null hypothesis, H0: WMWprob = 0.50 is rarely true prima facie, so how can we assume that it is true in order to compute a p-value? Examples teach best:
If H0: WMWprob = 0.50 is true, we posit that the obtained p-value, P, behaves like a standard uniform random variable, U. For a perfect replication of the study, Prob[U ≤ P | H0 true] = P. Lower values of P create greater suspicion that H0 may not be true, leaving us to give greater credence to H1: WMWprob ≠ 0.50. This indirect reasoning confuses most people. Even those who know better subconsciously drift into behaving as if P is the probability that H0 is true, i.e., that P equals Pr[H0 true | P]. Moreover, the classical two-sided p-value has a fundamental weakness. Being a point null hypothesis, H0: WMWprob = 0.50 is rarely true prima facie, so how can we assume that it is true in order to compute a p-value? Examples teach best:
Suppose subjects with a given disease were randomized to receive drugs A or B, both known to be active agents in the disease's biological mechanism, but they affect it in different ways. Let (Y1, Y2) be a random pair of observations from groups A and B. What is the chance that Pr[Y1 > Y2] could be exactly the same as Pr[Y1 < Y2]?
|
If that chance is minuscule or nil, the p-value for H0: WMWprob = 0.50 has little or no probative value.
Consider these other comparisons. For each, ask: For some ordinal Y of interest, what is the chance Pr[Y1 > Y2] is exactly Pr[Y1 < Y2]?
Consider these other comparisons. For each, ask: For some ordinal Y of interest, what is the chance Pr[Y1 > Y2] is exactly Pr[Y1 < Y2]?
|
When confronted with this argument, some people counter by saying that the true WMWprob might be near enough to 0.50 to be "essentially" 0.50, thus making classical two-sided p-values admissible. Formally, they are saying that testing H0: WMWprob = 0.50 versus H1: WMWprob ≠ 0.50 is an acceptable surrogate for testing H0(delta): abs(WMWprob - 0.50) ≤ delta versus H1(delta): abs(WMWprob - 0.50) > delta, where delta is some small value quantifying "near enough" to 0.50.
However, if this two-sided "delta-interval" hypothesis is to be tested, why not do it? Let P1 and P2 be the p-values obtained from testing these one-sided delta hypotheses:
- H0(delta): WMWprob ≤ 0.50 + delta versus H1: WMWprob > 0.50 + delta
- H0(delta): WMWprob ≥ 0.50 - delta versus H1: WMWprob < 0.50 - delta
Define P = min(2*min(P1, P2), 1.00). If the estimated WMWprob exceeds 0.50 + delta,, 0 < P1 < 0.50, yielding 0 < P < 1.00. If WMWprob is less than 0.50 - delta, 0 < P2 < 0.50, and 0 < P < 1.00. If 0.50 - delta ≤ WMWprob ≤ 0.50 + delta, both P1 and P2 exceed 0.50, so P = 1.00. Note that for delta = 0, this scheme equates to performing the classic two-sided test.
Even though interval null hypotheses can be tested, estimates and confidence intervals deal with the matter directly. Consider these examples using WMWodds estimates and 95% CIs:
|
In my experience,, the vast majority of research questions are inappropriately addressed using p-values from classic two-sided testing. However, within the frequentist school, some point null hypotheses are viable and thus their associated p-values are useful, which is why WMW() will compute them, if requested. Here's one example:
Subjects are either ("cases") who have the disease or ("controls") who do not have the disease. Cases and controls are compared on K variables, Y.1, Y.2, ..., Y.K, where K could be quite large, so that no particular Y is in focus.. In such screening studies, the research hypothesis is that very few, if any, of these Ys truly differ between cases and controls. The Ys yielding p-vlaues below some conservative threshold, perhaps 0.05/K, are merely being selected for further study.
|
Thinking in terms of two one-sided hypotheses. The previous section noted that the two-sided test is equivalent to the conducting two one-sided tests as a Bonferroni family. Consider the following pair if general one-sided hypotheses:
(L) H0: WMWprob ≤ H0.WMWprob versus H1: WMWprob > H0.WMWprob
(U) H0: WMWprob ≥ H0.WMWprob versus H1: WMWprob < H0.WMWprob |
where 0 < H0.WMWprob < 1 is the null value of WMWprob to be tested; commonly, H0.WMWprob = 0.50. Hypothesis pair L is congruent with the CI.type = "L" one-sided confidence interval, [LCL, 1.00]. Hypothesis pair U is congruent with the CI.type = "U" one-sided confidence interval, [0.00, UCL]. Let P.L and P.U be the standard one-sided p-values for these tests. We know that P.L = 1 - P.U.
Consider the four cases illustrated here. For Case A, P.L = 0.023, p.U = 0.977, denoted 0.023 : 0.977, is associated with a 95% CI that is just a little above the specified null value, H0.WMWprob. Its standard two-sided
Echoing the result for delta = 0 in the previous section, the standard two-sided p-value is P = 2*min(P.L, P.U). Using the common 0.05 Type I error rate with Bonferroni adjustment, the result P.L = 0.024: P.U = 0.976) would "significant" in favor of rejecting H0: WMWprob ≤ H0.WMWprob in favor of H1: WMWprob > H0.WMWprob. The 95% CI would fall just to the right of H0.WMWprob. On the other hand, (P.L = 0.026, P.U = 0.974) would fall short of 0.05 signifianc
Consider the four cases illustrated here. For Case A, P.L = 0.023, p.U = 0.977, denoted 0.023 : 0.977, is associated with a 95% CI that is just a little above the specified null value, H0.WMWprob. Its standard two-sided
Echoing the result for delta = 0 in the previous section, the standard two-sided p-value is P = 2*min(P.L, P.U). Using the common 0.05 Type I error rate with Bonferroni adjustment, the result P.L = 0.024: P.U = 0.976) would "significant" in favor of rejecting H0: WMWprob ≤ H0.WMWprob in favor of H1: WMWprob > H0.WMWprob. The 95% CI would fall just to the right of H0.WMWprob. On the other hand, (P.L = 0.026, P.U = 0.974) would fall short of 0.05 signifianc
In Example 1b, 51 ratings for truly "abnormal" scans (Y1) are compared to 58 observations for truly "normal" scans (Y2). The one-sided test compares H0: WMWprob ≤ -.80 versus H1: WMWprob > 0.80. Because H0 and H1 are interval (not point) hypotheses, neither can be said to be false.. The p = 0.010 was computed assuming that the true WMWprob is exactly 0.80, because that yields the maximum (most conservative) p-value over the entire H0: WMWprob ≤ -.80 interval. Reversing the polarities gives H0': WMWprob ≥ 0.80 versus H1': WMWprob < 0.80, which gives p' = 1 - p = 0.990. If we admit there is virtually no chance that the true WMWprob is exactly 0.80, then H1 and H0' are functionally identical. Thus, for one-sided tests, p-values let us directly compare H0 with H1 and it makes no difference how the are polarized.
Adding a "touch of Bayesianism." For the one-sided test of H0 versus H1, p = Pr[P < p | H0 true]. Reversing the polarities as above, Pr[P' < p' | H0' true] = maxPr[Data | H0'] = 1 - P = 0.990.. Let the prior probabilities be Pr[H0 true] and Pr[H1 true] = 1 - Pr[H0 true]. Using Bayes Theorem with these maximal probabilities, we have the surrogate posterior probability,
Pr[H1 true | P] = {(1 - P)*Pr[H1 true]}/{P*Pr[H0 true] + (1 - P)*Pr[H1 true]}.
For P = 0.010 in Example 1b, agnostically setting Pr[H0 true] = Pr[H1 true] = 0.50 gives Pr[H1 true | P] = 1 - P = 0.990. Using this logic, 1 - P becomes a proxy for what researchers usually seek, P[H1 true | P]. However, we could skeptically set Pr[H1 true] = 0.10, which gives Pr[H1 true | P] = 0.917. Very skeptically, setting Pr[H1 true] = 0.01 gives Pr[H1 true | P] = 0.50.
Suppose two independent replications of the same study are conducted and analyzed with the one-sided hypothesis scheme H0: WMWprob ≤ c versus H1: WMWprob > c; 0 < c < 1. Beginning very skeptically, set Pr[H1 true] = 0.01. If Study A gives P = 0.041, the posterior probability is Pr[H1 true] | P = 0.041] = 0.191. Suppose Study B yields P = 0.023. Using Pr[H1 true | P = 0.041] = 0.191 as the prior probability for Study B, Pr[H1 true | P = 0.041 & P = 0.023] = 0.909. As every Bayesian will tell you gleefully, this two-study posterior probability is also 0.909 if the computational sequence is P = 0.023 then P = 0.041. This order indifference for any set of K p-values.
The logic in this section is not specific to WMW analyses, but rather applies to any one-sided test on some parameter, theta. Is this a general way to attach a touch of Bayesianism to frequentist testing in order to obtain what most investigators want to discern from their studies and how they typically interpret p-values?
Pr[H1 true | P] = {(1 - P)*Pr[H1 true]}/{P*Pr[H0 true] + (1 - P)*Pr[H1 true]}.
For P = 0.010 in Example 1b, agnostically setting Pr[H0 true] = Pr[H1 true] = 0.50 gives Pr[H1 true | P] = 1 - P = 0.990. Using this logic, 1 - P becomes a proxy for what researchers usually seek, P[H1 true | P]. However, we could skeptically set Pr[H1 true] = 0.10, which gives Pr[H1 true | P] = 0.917. Very skeptically, setting Pr[H1 true] = 0.01 gives Pr[H1 true | P] = 0.50.
Suppose two independent replications of the same study are conducted and analyzed with the one-sided hypothesis scheme H0: WMWprob ≤ c versus H1: WMWprob > c; 0 < c < 1. Beginning very skeptically, set Pr[H1 true] = 0.01. If Study A gives P = 0.041, the posterior probability is Pr[H1 true] | P = 0.041] = 0.191. Suppose Study B yields P = 0.023. Using Pr[H1 true | P = 0.041] = 0.191 as the prior probability for Study B, Pr[H1 true | P = 0.041 & P = 0.023] = 0.909. As every Bayesian will tell you gleefully, this two-study posterior probability is also 0.909 if the computational sequence is P = 0.023 then P = 0.041. This order indifference for any set of K p-values.
The logic in this section is not specific to WMW analyses, but rather applies to any one-sided test on some parameter, theta. Is this a general way to attach a touch of Bayesianism to frequentist testing in order to obtain what most investigators want to discern from their studies and how they typically interpret p-values?