Return to Discussion Forum

Title: BERD - Pain Presentation

On the bottom of this page, you will find the topic for discussion and the name of the contributor.

Please add comments and then click on the "Add comment" button.


FrankHarrell - 18 Aug 2011 - 10:10

A few random comments -

- This is a lot of work in SAS compared to the ordinal package in R (handles random effects)

- Sparseness does not have to do with non-proportional odds except in a strange way: the SAS PROC LOGISTIC test for proportional odds (which doesn't reference my student Bercedis Peterson's paper which invented the test and pointed out its shortcomings) will strongly reject H0:proportional odds even when there is perfect proportional odds, when cells are sparse.

- I suspect that your observations about standard errors when things become more sparse is related to the above. For proportional odds assessment I rely on partial residual plots.

Regards, Frank

Michael Berbaum - 17 Aug 2011 - 16:08

Greetings, I can offer a couple more examples of pain analyses. Both designs are 3-arm RCTs examing pain control during interventional radiology procedures. Thus, at baseline, before the procedure begins, average pain level should be nearly equivalent in the three groups. Patients are repeatedly asked their pain (and anxiety) levels at regular intervals (repeated measures). One key feature is that as patients' procedures are completed, they "drop out" so that observation end when the patient with the longest duration completes. With increasing sparsity, the standard errors around the groups' curves grow substantially! In study #1 we used a normal mixed model (SAS PROC MIXED or BMDP 5V); in study #2 we used a proportional odds model with random intercepts (SAS PROC NLMIXED). The proportional odds assumption failed owing to the sparsity of data at later times and at higher pain levels. We collapsed levels 9 and 10 into level 8 and then we were OK. We struggled to find an understandable graph of results and ended up showing binary "splits" at various thresholds. I've appended some SAS code for analyses in #2, prepared mostly by Ms. Xinyu Li, M.S., co-author on the second paper. If I had it to do again, I would think more about controlling baseline covariates and the MAR assumption we relied on. I hope this is helpful, and I'd welcome any comments on the approach we took.

#1 Lang, Elvira V., Benotsch, Eric G., Fick, Lauri J., Lutgendorf, Susan, Berbaum, Michael L., Berbaum, Kevin S., Logan Henrietta, and Spiegel, David (2000). Adjunctive non-pharmacological analgesia for invasive medical procedures: a randomised trial. The Lancet, vol. 355, issue 9214, pages 1486-1490, April 2000. doi:10.1016/S0140-6736(00)02162-0

#2 Lang EV, Berbaum KS, Faintuch S, Hatsiopoulou O, Halsey N, Li X, Berbaum ML, Laser E, Baum J. 2006). Adjunctive self-hypnotic relaxation for outpatient medical proceudres: A prospective randomized trial with women undergoing large core breast biopsy. Pain, 126(1-3): 155-164, Dec 15, 2006. PMID: 16959427. Best regards, --Mike

-- Michael L. Berbaum, Ph.D., Director Methodology Research Core Institute for Health Research and Policy (MC 275) University of Illinois at Chicago 1747 West Roosevelt, Room 558 Chicago, Illinois 60608 Tel: (312) 413-0476 Fax: (312) 996-2703 Email: IHRP web site:

FrankHarrell - 15 Aug 2011

I hear ya.

Thanks Frank

KnutWittkowski - 15 Aug 2011

The fundamental difference from many other methods is that ambiguities are allowed. We don't need to make strong assumptions (proportionality, linearity, independence) to ensure that the pairwise ordering among all subjects can be decided.


FrankHarrell - 15 Aug 2011

Hi Knut,

I'm in England, with apologies for the slow reply. Yes your example isn't controversial. Other combinations are harder to figure, which is why I like to adjust for baseline as a covariate instead.

KnutWittkowski - 15 Aug 2011


I agree that this discussion I proving more interesting than I had expected, including two of your recent remarks.

In fact, I wonder whether we should edit this discussion and make it available in some form.

First, I fully agree that "making the subject its own control" is heavily overrated among clinicians. As we have seen, it is less than trivial to formalize the concept of "change", being it difference, ratio, sign, ... Still, there may be cases where baseline values should be incorporated.

Which brings is to your second comment.

U-scores for multivariate data (Hoeffding 1948) are based on the assumption that - everything else being the same - more in any of the pain characteristics is worse. No linearity, proportionality, or independence being assumed. For instance, if subject A is coming in with a lower baseline and a higher outcome VAS than subject B, then A had less of a response than B. I would not expect too much of a controversy here.

The fundamental difference from many other methods is that ambiguities are allowed. We don't need to make strong assumptions (proportionality, linearity, independence) to ensure that the pairwise ordering among all subjects can be decided.


KnutWittkowski - 15 Aug 2011

Hi Ron,

Thanks for the clarification!

Of course, knowing the scale does not automatically lead to the method, but it restricts the methods one can use. Chi-square only for nominal, u-test for ordinal, u- or t-test for ratio/interval only.

It is also important to remember that rank tests are no panacea (see Scheffe, 1959, chapter 10) and that the (apparent, assumed, ...) distribution of the data is not very helpful in choosing between tests that are both asymptotically distribution-free (like the t- and the u-test).

Still, I'd rather use a test that is approximately right, than one that is exactly wrong. If all we were interested in is alpha, we might simply toss 17 coins and if we get less than five of either heads or tails, we have a test for the 5% level, no need to even gather data wink Hence, we need to also consider what alternatives the tests are sensitive against, like deviations from the arithmetic mean being zero for the t-test vs the tendency among paired comparisons from 50:50 (not the median!) for the u-test.

What's missing in many of these cook book rules is that we cannot choose a test by looking at the characteristics of the data (distribution) and the variables (scale), we also have to formalize the question (type of alternative) of interest.


FrankHarrell - 14 Aug 2011

Same here! Great discussion! Frank

RonaldThisted - 14 Aug 2011

(k) Enough ramblings for now. Thanks to all for the excuse to avoid doing other stuff on my plate.

Cheers, Ron Thisted

FrankHarrell - 14 Aug 2011

It is easy to convert the odds ratio and other parameters in the model to the mean or median pain score. Also, exceedance probabilities come straight out of the model and are easily interpreted by clinicians.

RonaldThisted - 14 Aug 2011

(h) Changes in pain (or in other symptoms that also have a subjective or self-report element) may make sense in some contexts and not in others. For instance, pain after surgery eventually gets better. The focus may be on how rapidly this occurs. On the other hand, if one is studying chronic pain (that is, pain that one would not expect to improve in the natural course of things), then the focus is definitely on how much improvement in pain can be achieved, and in what fraction of patients.

(i) Consider the situation in which patients are randomized to two treatments, and two hours later, a VAS pain measurement is taken. If the point of an exercise is testing the null hypothesis of no difference between groups, lots of sensible and familiar tests will work just fine, in the sense that they will be valid tests of H0 and will have (approximately) the right size. For this purpose, the difference between a t-test and a proportional-odds regression based measure (taking each unique observed VAS score as a "cutpoint") will depend upon the alternatives against which wants to have greatest power.

(j) If the point is to estimate the size of the treatment effect, then one has to have some sense as to what differences on some scale mean. In the context of anesthesia for certain particular operative procedures, for instance, VAS pain measurements of 30 or below are considered adequately low scores, and pre-treatment scores of 50 or 60 are typical. In this context the mean and SD of VAS scores has a clinical interpretation (which may not extend to other contexts). The odds ratio (from a proportional odds-based analysis) would not be easily understood or communicated, and it would be hard to relate to what clinicians already understand about how the VAS works in this particular context--even if it were the basis for a more sensitive test of the differences between the VAS distributions under the two treatments.

FrankHarrell - 14 Aug 2011

I haven't experienced that problem. You can model baseline using dummies or using quadratic or spline functions.

RonaldThisted - 14 Aug 2011

(g) I agree with Knut that changes in pain can (and often should) be analyzed using methods other than simply taking the numerical difference in pain scales. Transition models (with a small number of defined ordered categories) are often successful at doing this. Proportional odds models, while incredibly useful for comparing groups at a single point in time (such as the completion of a randomized clinical trial), are less easily used when one wants to make inferences conditional on, say, a baseline variable that itself is measured as an ordered category (for instance, baseline pain assessment).

FrankHarrell - 14 Aug 2011

I don't think that follows. I agree that clinicians think this is more interpretable but I think they are largely fooling themselves, mainly because of floor and ceiling effects. An unbiased estimate of current status is going to be quite useful, and can be calibrated in the sense you are saying, by including baseline level (or a spline function of it) as a covariate.

RonaldThisted - 14 Aug 2011

(f) Changes in pain scores (as opposed to changes in other kinds of scores) can be particularly important, since within-subject pain scores are likely to be much better calibrated than between-patient scores. So from an interpretability standpoint, clinicians and others often find changes to be more compelling than raw scores. And as we know, if the within-subject correlation exceeds 0.5, there are efficiency gains to the use of difference scores.

FrankHarrell - 14 Aug 2011

Often I see major non-proportionality yet the PO model fit better than all the other models I was entertaining.

RonaldThisted - 14 Aug 2011

(e) I agree with Frank that the proportional odds and related models are not known (or used) widely enough. As with almost all models, the assumptions under which they work best (constant proportionality of odds between groups) always hold only approximately. Conditional on actually using a proportional odds model, examining the extent to which proportionality holds, and critically assessing the extent to which it really matters whether it holds, are also not done widely enough.


The real problem with the central limit theorem is that for a given dataset we don't know if it applies (this is more true for highly skewed Y).

RonaldThisted - 14 Aug 2011

(d) The utility of a particular analysis depends more on the study design and the substantive question than on the scale of measurement. The central limit theorem works wonders in many situations. (For instance, if one applies the two-sample t-test to binary data--ordinal scale at most--the test is essentially equivalent to the chi-squared test for comparing proportions.)

FrankHarrell - 14 Aug 2011

Very nice discussion Ron. It should be noted that the Wilcoxon tests almost always tests a stochastic ordering hypothesis that is relevant. We tend to get ourselves in trouble when we use the t- or normal approximation for getting P-values with the Wilcoxon. If you have scale differences (or other simple translation differences) you can get very accurate P-values using the general U-statistic standard error, as implemented in the R Hmisc package's rcorr.cens function.

RonaldThisted - 14 Aug 2011

Dear Laura Lee, Frank, Knut, Greg, et al:

A few random thoughts on pain stimulated by the (less random) notes of others:

(a) Regarding Laura Lee's original request, Thomas Permutt at FDA has done some very thoughtful work on analyzing pain outcomes in the context of clinical trials. I am not sure if his work has been published, but it has been influential in the design and analysis of Phase III studies of drugs intended to affect pain. A key reference is the IMMPACT recommendations (2005, "Core outcome measures for chronic pain clinical trials: IMMPACT recommendations," Pain 113: 9-19).

(b) The emphasis on scale of measurement (ordinal, interval, ratio, etc) has the potential to side-track us from the most important questions of design, analysis, and inference. As often as not, focusing on scale of measurement can be misleading. It is particularly pernicious when it leads to automatic choices of the "correct" statistical analysis based on measurement characteristics and not consideration of the study design, distributional characteristics of the measurements, subject-matter knowledge, and identification of the question that really needs to be answered. The outstanding paper by Velleman and Wilkinson makes a convincing case. [Velleman, P. F. & Wilkinson L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. American Statistician, 47, 65-72.]

(c) The identification of a particular statistical test with a scale of measurement is often gets things badly wrong. For instance, it is commonly stated that the t-test assumes an interval scale, while the Wilcoxon (Mann-Whitney) test assumes only an ordinal scale. That is not correct. In fact, the two-sample Wilcoxon procedure relies on the assumption that the two distributions differ only in location and not in shape. In particular, that variances and skewness are identical in the two distributions, and that one is simply a shifted version of the other. Simply having an ordinal scale of measurement is not sufficient to justify the validity of the Wilcoxon test. Indeed, if the two groups are normally distributed and have the same mean, but one standard deviation is twice the size of the other, the size of a nominal 0.05 Wilcoxon test is actually 0.074 (JW Pratt, JASA 1964, 59: 665-80).

Ronghui (Lily) Xu - 14 Aug 2011

Hi Laura,

If it adds to it, our group have also worked on brain imaging and meta-analysis aspects of pain research:

Leung A, Duann J, McGreevy K, Li E, Xu R, Donohue M, et al. The supraspinal pain pathway of the thermal grill illusion. NeuroImage, 2009; 47(Supplement 1): S61-S61.

Leung AY, Donohue M, Xu R, Lee R, Lefaucheur J, Khedr E, Saitoh Y, Andre- Obadia N, Rollnik J, Wallace M, Chen R. rTMS in suppressing neuropathic pain: a meta-analysis. The Journal of Pain, 2009; 10(12): 1205-16.

Thanks, Ronghui (Lily) Xu Professor Division of Biostatistics and Bioinformatics Department of Family and Preventive Medicine and Department of Mathematics Director, CTRI Design and Biostatistics University of California, San Diego 9500 Gilman Drive, Mail Code 0112 La Jolla, CA 92093-0112

FrankHarrell - 14 Aug 2011

Hi Knut,

Good discussion. I think the score you've specified will make even more assumptions than the proportional odds assumption though.

I don't think that change will do better adjusting for baseline differences, because of floor and ceiling effects.

Best, Frank

KnutWittkowski - 14 Aug 2011

A deterministic reply to a random comment: analyzing changes in pain could potentially adjust for differences in baseline pain perception without the need of making assumptions about proportionality. Of course, "analyzing changes" does not necessarily mean "computing differences of scores". For instance, one could score a particular subject's response (outcome vs baseline) as

- the number of subjects with a larger-or-equal baseline and a smaller-or-equal outcome (smaller effect) minus - the number of subjects with a smaller-or-equal baseline and a larger-or-equal outcome (larger effect).

These 'u-scores' would score changes on one (or several) ordinal outcomes without computing differences.


FrankHarrell - 14 Aug 2011

A random comment: I think it is a mistake to analyze change in pain status. The difference in two ordinal scales is not ordinal. There are many reasons to have the final pain severity as the outcome, adjusted for initial severity as a baseline covariate.

A nice feature of the proportional odds model is that you can have as many categories as you have unique Y values.


Kathryn Chaloner - 14 Aug 2011

Hi all

Like John Connett I was involved in the Shlay (1998) study. The more we looked and analyzed the Gracely continous scale, I think it is fair to say as the study went on, the less we believed that it measured something real. We also had a \x93Global Pain Relief Scale\x94 that was ordered \x93Complete, A lot, \x85none, Pain got worse\x94 that was more believable in interpretation and in analysis. Fortunately results were consistent.

The rationale for the Gracely scale was that a previous study in diabetic peripheral neuropathy had used the scale \x96 and so for the HIV and acupuncture design there was data \x96 and there was a lot of support for using the same scale from clinicians. In retrospect not a great idea to perpetuate a bad endpoint.

With hindsight, the simple global pain relief scale made a much better endpoint and analysis and was much more interpretable. We used ordinal response models.


Kathryn Chaloner, PhD

319 384 5029

KnutWittkowski - 14 Aug 2011

Hi Laura Lee,

I find the empirical "validation" for using methods based on the linear model for individual VAS scales less convincing, but the real problem lies in the complexity of measuring complex phenomena, such as pain, on a variety of scales. Shlay (1998): Patients rated their pain in a diary once daily, choosing from the Gracely scale of 13 words that describe the intensity. The words had been assigned magnitudes on the basis of ratio-scaling procedures that demonstrated internal consistency, reliability, and objectivity. --- Comparison of treatment groups for the primary end point of change in pain, as measured by the pain diary, used a linear model with baseline characteristics, clinical unit, and option (factorial or single factor) as covariates. Griffith (2008): The primary outcome measure was the mean difference in the subjects\x92 self-reported pain scores before and after the administration of the initial medication treatment. A pain score reduction of 3 or more points after the initial treatment was considered clinically effective and used as a cutoff point to dichotomize the primary outcome measure for multivariate statistical analyses. I agree with Frank that ordinary regression may not be appropriate to generate valid comprehensive scores. Trying to avoid the problem by dichotomizing the outcomes at an arbitrary cutoff point may also not be a good solution.

Under WebServices/MuStat, CTSPedia offers biostatstical tools (spreadsheets, R package, and Web server) that help to resolve some of these problems by creating scores/metrics that are intrinsically valid, because fewer assumptions need to made and, thus, empirically "validated".

BTW, in a collaboration with the NINR we are corrently using the same WebServices/MuStat to screen for genetic risk factors of fibromyalgia, yet another way of addressing the many open questions in pain research using the novel methods and tools developed by BERD.

Here are the references:
Morales (2008): (complex phenotypes, such as pain)
Wittkowski (2010) (comprehensive overview in a book with many CTSA contributions)
Rubio (2011): (on the crossfertilization of BERD developing metrics applied both by and to BERD practicioners)


John Connett - 13 Aug 2011


An example of pain measurement and analysis in an acupuncture study:

Shlay, Chaloner et al. (1998) "Acupuncture and amitriptyline for pain due to HIV-related peripheral neuropathy," JAMA 280: 1590-1595.

John C.

FrankHarrell - 13 Aug 2011


I'm amazed that I still see people analyzing ordinal scales using ordinary regression. The proportional odds model and its cousins are still not known to vast areas of research.


GregoryStoddard - 13 Aug 2011


Pain research usually involves a visual analog scale (VAS) measurement of pain. There is confusion, however, if these can be analyzed as a continuous variable (interval scale), or if should they be considered an ordered categorical variable (ordinal scale). That is, there is inconsistent in how these scales are analyzed. You could clear up this confusion in your talk.

I introduce this topic on page 3 of Ch 2-6 of my course manual that is available in CTSpedia, the educational materials section. Another website for it is given in the footnote on the first page. In that chapter, I provided citations and justification of why it can be analyzed as an interval scale. I have attached two papers I cited.

On p 41 of Ch 2-1, I give the taxonomy of levels of measurement, which you might use as background material.



Demetrios Kyriacou - 13 Aug 2011

Dear Laura Lee,

Attached is a study that was published in Journal of Pain and conducted at Northwestern University Department of Emergency Medicine by one of our senior residents, a junior faculty person, and me. We conducted a retrospective cohort study to compare "Metoclopramide Versus Hydromorphone for the Emergency Department Treatment of Migraine Headache."

I use this study in my Intermediate Epidemiology course to illustrate the different types of confounding by indication. While we adjusted for potential confounding by severity of the migraine headache in the adjusted relative risk comparison of reduction in migraine pain, we did not adjust for the potential confounding by indication for nausea or vomiting that is frequently associated with more severe migraine headaches and is often treated with metoclopramide which is an anti-emetic medications. Thus, there is still potential for confoudning in the study.

Let me know if this is useful to you and if you have any questions.

Demetrios N. Kyriacou (Jim)

DiscussionBERDForm edit

Title BERD - Pain Presentation
Description - Problem to be explored I have been asked to give a 20 minute synopsis of \x93Resources available through the CTSA Biostatistics, Epidemiology and Research Design (BERD) Key Function Committee\x94 on Wednesday to the CTSA Pain research interest group.

I plan to ask the group what research design issues are currently at the forefront for developments in pain research and will report back on what needs and questions I am asked. That said, can folks on our BERD list email me with ideas or examples I can use in my talk? I plan to go through the BERD watch series looking for ideas but hoped you all might have additional thoughts. I will also talk about CTSpedia, and say hey, each of your CTSAs have folks very interested in talking with you early and often! Plus, local BERD people can use the KFC to find expertise that may not be available locally.

Contributor/Email Laura Lee Johnson (
See Also
Disclaimer The views expressed within CTSpedia are those of the author and must not be taken to represent policy or guidance on the behalf of any organization or institution with which the author is affiliated.
This topic: CTSpedia > WebHome > DiscussionForum > DiscussionBERD000
Topic revision: 09 May 2013, MaryBanach
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback