A challenge to BERD

Email Notification of Changes: Click here and add TITLE of the topic to the body of the email.

Please add comments and then click on the "Add comment" button.

%COMMENT{type="belowthreadmode"}%

PeterBacchetti - 14 Aug 2012 - 18:25

I’m sorry to see only one response so far to this challenge. This may reflect what a difficult task this is to take on. I would say that the key problem is the widespread and severe misinterpretation of statistical results, notably the problem Frank Harrell mentioned on the call of concluding “no difference” based only on P>0.05. My own view is that a wholesale switch to likelihood or Bayesian methods would be neither necessary nor sufficient to solve the problems. I think that use of estimates, confidence intervals, and even P-values would be OK if done well, and it would be just as easy for researchers to reach faulty conclusions of “no difference” based on a likelihood ratio >0.10 as it is now based on P>0.05. Switching might provide a fresh start, but preventing the same problems from arising in new forms would likely be a daunting task, particularly because the sociology of how this happened the first time does not seem to be well understood. The alternative to advocating a switch is to promote proper use of the standard ideas, which I have been trying to do for a long time, notably here, here, and here.

As Paul and Jeff note, flaws in current practice and better alternatives have been shown clearly by many writers, but my impression is that poor practices seem to only be getting more entrenched. I think that the idea of statistical “hypothesis testing” is the core driver of the problems--expecting individual randomized trials (and even other studies) to provide a definitive conclusion one way or the other. If the Neyman-Pearson framework had instead been called “automatic decision making”, that might have helped. Automatic decision making is fine in private situations such as industrial quality control, where the experimenter also has full ability to take the decided course of action. The purpose of statistical analysis is entirely different when it is to be published in order to convey information to others; in those cases, elucidating and clearly conveying the evidence provided is what matters, and the automatic decision framework is not helpful. This mismatch between the real purpose of clinical and translation research versus the dominant conceptual framework may warrant further emphasis in a paper from BERD. It has implications for both teaching and peer review, and it could be a somewhat fresh theme. Also fairly novel might be empirical documentation of misinterpretations of P>0.05 in leading journals, although I think this would require careful thought and could be a lot of work.

A paper could also set the stage for ongoing monitoring and publicizing of abuses. Right now, egregious misinterpretations may be so common that they are rarely challenged (hence Frank’s remark). Focusing initially on a single journal, with a letter submitted to challenge each misinterpretation, could help to make further progress, although this might be too much effort or might be stonewalled by editors. I think this would require participation from many BERD statisticians, but it might have a noticeable impact.

I don’t know much about the FDA, but I’m pretty sure that they do not make automatic decisions based on pre-specified criteria. Don’t they weigh estimated effects and attained P-values (rather than just noting whether or not P<alpha)? So even there, the statistical “hypothesis testing” framework is off target.

JeffreyWelge - 07 Jul 2012 - 09:22

Thanks for the comments Paul. I think that in the "simple v simple" case the likelihood methods are so far superior to classical p-values & confidence intervals that it is no contest. In more complex situations (composite hypotheses, nuisance parameters), whether there is an advantage over going Bayesian is not clear to me. Of course adopting full Bayes has its own barriers. Blume and Royall acknowledge that there is no completely satisfactory solution, but either their methods or full Bayes seem much better to me than the classical approaches. I would strongly support a KFC initiative to promote these as alternatives. We would probably not agree that p-values should be dropped altogether, but among the many statisticians who have expressed an opinion over the decades, the p-value seems to have few defenders as a primary measure of evidence. Some say that they still have their place along with other tools, but need to be de-emphasized, and that does seem to have happened somewhat. If there is a real "silent majority" who want to retain p-values as the primary measure of evidence, I suspect they are subject matter experts rather than statisticians and the issue is inertia more than any real belief that p-values are a particularly good thing.

I think it is true that we cannot get away from putting continuous things into some small number of categories for the purpose of drawing conclusions about the results of a study. It would be nice if in the future the publication model includes tools that allow anyone to interact directly with possibly continuous-valued evidence – likelihood functions or even, since we may disagree about the choice of likelihood function, the raw data. But we will always expect investigators to offer their own interpretation of the results and that means some type of classification. The classification need not be so coarse as “accept/reject”, but some small number of grades of evidence might become standard (we kind of have that with the “asterisk” system of p-values, but the p-value is not a good measure of evidence). If you provide a likelihood ratio between two hypotheses, you may wish to put it in context by stating that it represents “strong evidence favoring the null hypothesis over the alternative” – it would then be good for us to have some guidelines, which cannot help but be somewhat arbitrary, about what ratios constitute “strong”. Jeffreys offered such guidelines for the Bayes factor, Royall has offered a nice thought experiment that yields intuition about the magnitude of likelihood ratios.

I am not an insider on FDA’s decision process but it is only the final decision that needs to be “thumbs up/thumbs down”. No reason why the statistical analysis of individual studies has to include a binary decision, and I doubt that there are many cases where approval hung directly on getting p<alpha in one trial.

PaulWakim - 29 Jun 2012 - 11:09

First, thank you for the BERD Online Journal Club and for the opportunity to post comments to this forum.

Two somewhat unrelated comments:

1. Yesterday’s general discussion on the “problems” with presenting results as p-values, confidence intervals and standard errors is one that I have heard and read many times before over many years. But what are we (biostatisticians) doing about it? Bayesians statisticians would say that they already have the answer. Do we agree that they have a good point? And if so, why don’t we all push for Bayesian statistics?

Another way of thinking: In May 2011, I attended a workshop at the SCT Annual Meeting in Vancouver, titled “Likelihood Methods for Measuring Statistical Evidence”, by Jeffrey Blume and Gregory Ayers, from the Department of Biostatistics at Vanderbilt University School of Medicine. It was a very interesting workshop. They discussed the strength of evidence measured by the likelihood ratio of two hypotheses. That measure is comparable between studies, regardless of what subject it is. The speakers presented this approach (which is not new either) as an alternative to p-values. Even though I know almost nothing about the technical pros and cons, it made a lot of intuitive sense to me. Is this line of thinking worth pursuing?

So here’s the bottom line: Would the BERD KFC be interested in addressing and recommending an alternative (or alternatives) to p-values (and confidence intervals and standard errors) for presenting clinical trial results? I realize it is a formidable task, but it is one that could have a tremendous contribution to the field of biostatistics. And what better group to address this issue than this group of biostatisticians from the best universities in the country, who is charged with making a difference in the general field of clinical and translational research.

When subject matter experts and applied biostatisticians (like me) hear the same story from different prominent biostatisticians involved in statistical methods research, they accept it and apply it. But when they hear contradictory stories, they throw up their arms and give up. In this case, do we have a solution that we would agree on?

2. I agree with the point made yesterday on the call that we (biostatisticians) should try to move away from dichotomous decisions (reject or do not reject the null hypothesis). We at NIH can certainly try to do that. For example, instead of focusing the primary hypothesis on one time point, we can look at the results from a longitudinal model over the course of the whole trial and describe the primary result as the “story” that the longitudinal model tells us. For example: compared to placebo, the new treatment seems to have its biggest effect after 4 weeks, but then its effect fades away by Week 12. But, and this is my point, the FDA has no choice. At the end of the day, it needs to have a “thumbs up/thumbs down” decision, with the decision rule clearly specified upfront. And so the biostatistics field, particularly as it pertains to the FDA, cannot do away without such dichotomous decisions.