Return to Online Journal Club Discussion
(PLEASE NOTE: On the bottom of this page, you will find the topic for discussion and the name of the contributor.)
Balancing false discovery and missed discovery
Email Notification of Changes: Click here and add TITLE of the topic to the body of the email.
Please add comments and then click on the "Add comment" button.
%COMMENT{type="belowthreadmode"}%
I believe that another important limitation of multiple comparison adjustments is that they cannot take subject-matter knowledge into account. An important case where this matters is when multiple results all fit a coherent pattern. For example, suppose treatment A appears superior to treatment B for survival, quality of life, health care costs, toxicity, and days of work missed. If each of these comparisons had P=0.015, then a naive Bonferroni adjustment would give them all P=0.06, which could be misinterpreted as evidence against superiority. In many studies, the ensemble of results may reinforce one another, instead of detracting from one another as automatically assumed by multiple comparison adjustment methods. I\x92ve described some other examples in some lecture notes posted at CTSpedia.
I had thought that such considerations were usually not relevant in extremely high-multiplicity genetic studies, but a trainee recently told me about such a case where subject matter knowledge was important. The smallest P-value was just larger than the conventional cutoff (10^-7.5, I think), but that SNP happened to be one of the few (out of the >million tested) known to relate to a plausible mechanism of the disease being studied. Requiring the conventional P-value cutoff for such special finding does not make any sense, but her group was having difficulty getting the study published.
I agree that the \x93essential disconnect\x94 that Jeffrey writes about is very important, and I expect that this will indeed come up in the next one (or two) journal club discussions. I\x92ve discussed this as one of the three fundamental flaws in current sample size conventions; see article reprinted at CTSpedia.
The 4 to 1 ratio of Type II to Type I error was proposed by Jacob Cohen in a 1965 book chapter. He argued (with some caveats) that Type I errors were often about 4 times worse than Type II errors, so with alpha fixed by convention at 0.05, beta should be set to 0.20. I think this is poor reasoning, even if the valuation of the errors follows the 4 to 1 ratio. I don\x92t think that optimization of the tradeoff between alpha and beta given a fixed N will generally result in alpha and beta following that ratio, as implicitly assumed. More importantly, in sample size planning, we are not trading off alpha and beta with a fixed N. We are instead trading off reduction in beta versus increasing cost and burdening more participants. Optimization would therefore require considering the relevant quantities, costs and burdens. In fact, some colleagues and I have argued in detail that planning sample size based only on cost makes more sense than the supposed ideal of planning it without any consideration of cost. Properly considering the tradeoffs also has ethical implications, as we pointed out in a pending letter to the editor regarding the article being discussed.
Re both Peter and Knut’s comments on 3.9: I suspect these lines of thought will continue naturally into our next meeting. Let’s set aside the complications of multiplicity for a moment: Both Peter & Knut observe that the relative cost of inferential errors is context-dependent, yet the scientific community has become quite rigid with respect to alpha (demanding 0.05) and is becoming so with respect to beta (0.20 becoming standard). Why should this 4-to-1 ratio be universally imposed? It contributes to meaningless power calculations where the error rates and the feasible N are all constrained, so that the target effect size is forced to be unrealistically large (not the “minimum clinically meaningful value” or even a good prior guess at the actual value). The only remaining option in such cases is to abandon the study. The latter action may sometimes be the correct ethical decision, and the statistician may need to advocate for it to an investigator who really wants to do the study. Unless statisticians can influence journals, funding agencies, etc., to accept well-reasoned alternative choices for error rates, or move to a different inferential framework, we will continually face these difficult situations.
The essential disconnect between Neyman-Pearson hypothesis testing and the goals of most scientific studies is highlighted by the related point on dichotomizing results and quantification of evidence. The simple-v-simple hypothesis testing model used for power calculations requires an alternative point hypothesis, which may be unrealistic (as above) and even if realistic will play no special role in the actual analysis, where it is just another point in the space of alternative hypotheses that are really under consideration. At that point, focus falls on the p-value as a measure of strength of evidence, a role for which it is ill-suited. Why the shift in focus? Because choosing between point hypotheses is so rarely what we want to do – we want to summarize evidence, most often over a range of hypotheses. Someone may sometimes have to make actual decisions based on that evidence, such as which set of genes to pursue, but the need to express the evidence is present regardless of whether a decision must follow. Mary asks what measure of evidence should be used – I find the likelihood ratio the most convincing (see work by Jeffrey Blume such as his 2002 tutorial in Statistics in Medicine, or the book by his advisor Richard Royall “Statistical Evidence”).
Re 3.9., I agree with Peter and would like to expand on this a bit. Sometimes, the problem is not the lack, but the abuse of testing for multiplicity. Most "genomic studies", such as genome-wide association and expression studies are better seen as selection procedures (Bechhofer 1954), rather than confirmatory tests. As the aim in selection procedures is to balance power vs the size of the selected set, rather than the level, "adjusting" p-values to the number of tests used is not only irrelevant for the set being selected, but the choice of an essentially arbitrary cut-off (e.g., 10^-7.5, irrespective of chip density) is misleading. In particular, a larger set (equivalent to a higher p-value) should be chosen in a smaller study to ensure sufficient power. Statisticians should be as vigilantly prevent the abuse of "correction for multiplicity" in selection studies as enforce the appropriate use in confirmatory statistics.
Re 3.5., the focus on the "assumption [of a] Gaussian distribution" may also me misleading. First, least-square methods are highly robust against deviations of the empirical distribution of residuals from the Gaussian distribution (Scheffe 1959). Second, lack of a "significant" result of a test for deviation does not prove the null hypothesis (of a Gaussian distribution). Hence, requiring "empirical verification" of assumptions could create the problems it is supposed to address. Finally, the focus on the Gaussian distribution may result in other assumptions, such as the adequatness of the measure of central tendency being used (arithmetic mean, geometric mean, median, ...) being overlooked. Which measure of central tendency to choose can rarely be decided (or verified) based on the data. Instead, knowledge of the subject matter are needs to be applied to select this measure. In particular, an approximate answer for the correct question may be better than the exact answer for the wrong question.
MaryBanach - 24 May 2012 - 12:12
I like Peter's point about thinking about the "amounts of evidence" but how do we code this? Laura Lee Johnson and I have had a few conversations about how to do this type of scoring or coding. What do people suggest that we do?