Measurement and its ambiguities

http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=9780521000871 Measuring blood pressure

Many elements may be involved in a sick person getting better, and it's not always clear what they are. Consider an elaborate Australian study of the treatment of moderate hypertension: tens of thousands of Australians were screened for high blood pressure. Only those with systolic blood pressure (SBP) greater than 200 or diastolic blood pressure (DBP) greater than 90 millimeters of mercury (mm Hg) were entered in the study. Various drugs or placebos were given to different sub-groups, and the blood pressure in the participants generally dropped, as shown in measurements made every four months. The study wisely included a relatively small group of 237 untreated people with moderate hypertension - people who received no medication and no placebos of any kind - whose blood pressure was also measured every four months. Their blood pressure dropped, too. The mean DBP in this group dropped from 101.5 mm Hg (mildly elevated) to about 80 mm Hg (normal) in thirty-two months, and then stabilized at that level (plus or minus 1 mm Hg) for the next two years (MCATT 1982).

Why did the blood pressure of these people go down? For those treated with various blood pressure reducing drugs, we might be tempted to explain the decline as a result of the specific medical effectiveness of these drugs. But patients who got placebos also showed blood pressure declines. And, most interesting of all, patients who got no placebos and no medications also showed this decline. One explanation for the return of blood pressure to normal in this group is "regression to the mean," which we discussed briefly earlier. This principle states that, if you select a group of people based on the fact that they share an extreme characteristic (high blood pressure, for example), they will in time revert to a more normal condition as the result of ordinary human homeostatic processes. Similarly, one can show that tall men tend to have tall sons, but not sons as tall as they are; and short men tend to have short sons, but not as short as they are. Stature tends to "regress to the mean." Note that if this were all that were going on, in some number of generations, there would no longer be any tall, or short, men. That's clearly nonsense, since it is also possible for men of average stature to have sons appreciably taller, or shorter, than they are. And that may be analogous to sample quiz questionwhat happened to the people in the blood pressure study.

There are, however, other explanations. It may be that these individuals were, when first enrolled in the study, responding nervously to having their blood pressure taken, which gave them higher blood pressure; "white coat hypertension" is a well-recognized phenomenon (Landray and Lip 1999; O'Brien 1999). Having their blood pressure taken every four months may have gradually desensitized them to this event, and their blood pressure then no longer increased with the approach of the cuff. This would be a case of a distinct "measurement effect," where the measurement created the object of study, the elevated blood pressure, at least for a while.

There is another possibility: this may be an example of the meaning response. There is ample evidence to indicate that the use of various medical instruments and machines can have significant healing effects. We will look more closely at this possibility later. For the moment, we can imagine a modification to the study as I have described it which might let us make a more informed judgment about what was going on here. Suppose that the experimenters had included another group of people with high blood pressure, but the people in this group did not have their blood pressure taken every four months. Instead, their blood pressure was taken only at the end of the study, after three years. It's unlikely that this group would have gotten used to the blood pressure experience as the repeat measurement group might have. So, if at the end of three years we found that this ignored (no treatment at all!) group also now had normal blood pressure, we could probably attribute the change to "regression to the mean." If that group still had high blood pressure, we could attribute the change in the multiple measurement group to the placebo effect of the blood pressure cuff. Unfortunately, the researchers didn't do this, and so we will probably never know.

The point is that it is very difficult to know, even under the most stringent conditions, and in the simplest and most clear cut cases, just why a particular group of people "got better." Autonomous (or homeostatic) responses, drug responses, and meaningful responses are much easier to keep separated conceptually than they are in practice.

Diagnosis is treatment

Let's pursue this thought experiment - our addition to the Australian blood pressure study - a little bit further. Suppose we were going to do as I suggested and identified several hundred people with moderate hypertension, and then did nothing to them for three years. What might we tell them about this experiment? Unless we could measure their blood pressure without them knowing it, we would have to tell them something. And, we have found that their blood pressure is higher than normal. Suppose we tell a little lie; we say that they are fine, there is nothing to worry about, nothing to be done, go home and don't think about it any more. Earlier, I said that this was to be a group that got no treatment at all. But our little lie here doesn't seem to me to be "no treatment at all." Consider another possibility: suppose we told these people the "truth," that they had high blood pressure, that this was a risk factor for stroke and heart attack, and that they had a medical condition for which several thousand other people in this study would be treated with powerful medications. But they wouldn't get any. Of course it's unlikely that we would do either of these things; and indeed, the researchers didn't do either. But it seems quite plausible to me that if we did one or the other of these things, the outcome might have been quite different. The group to which we told lies might have been quite comforted by our charade and, as a result, their blood pressure might have gone down. The group with which we were brutally frank, but not very caring, might be expected to be scared and disturbed, and we might imagine that their blood pressure could go up, or at least not go down as it might in the group to which we lied.

What this means is that the very fact of diagnosing a person with some sort of medical condition is a form of medical treatment which can be expected to have an effect. This process was noticed many years ago by Howard Brody and David Waters and described in a fascinating article titled "Diagnosis is Treatment." They describe several interesting cases where it is quite clear that the shaping of the diagnosis can make a significant difference in the outcome of the illness. In one case, a 52-year-old man with long-term hypertension and recent symptoms of ulcers was quite testy with his doctor when asked about changes in his family situation. The doctor was trying to find areas of increased stress and anxiety; the patient was unwilling to describe any. But, with some persistence, the doctor learned that the man's wife had recently returned to work and was enjoying her "new life"; the patient, however, explained that he was feeling abandoned, and was very unhappy about it. The doctor suggested he discuss this with his wife. "The physician asks if he had felt more tense or sad since she returned to work. The man considers the idea and says he could not say but would think about it. He returns two weeks later to say he had discussed the conversation with his wife, who had not realized how deserted he was feeling. He is feeling much closer to her and more relaxed. He also reports a decrease in gastric pain." The authors continue by saying that in this (and another) case, "the diagnosis in itself exercised a therapeutic effect for the patient inasmuch as it provided an understandable, acceptable explanation for his behavior" (Brody and Waters 1980).

Untreated control groups

This leads us to another curious and complex issue. Occasionally, studies are designed to have a "no-treatment group," sometimes called a "natural history group," as did the Australian blood pressure study. The idea is that, in this group which is getting no treatment, we can see the "natural course of the disease." I would counter that, except under the most extraordinary circumstances, it is logically and conceptually impossible to have a no-treatment group. In order to do a trial, people have to be recruited and diagnosed for the condition under study; they receive some sort of examination, maybe an invasive and dramatic one. They give informed consent, perhaps after reading a long and complex document describing the study, the various treatments under review, and so on. They are then randomly assigned to (in this case) three conditions: drug treatment, placebo treatment, or no treatment. It's not clear what one will tell the group getting "no treatment." Certainly, their participation can't be "blind" to them; they know they aren't getting any drugs or placebos; a reasonable inference might be that they are healthy enough not to need any. And there has to be a follow up, an assessment of the condition of the subjects after some period of time, or a diary of symptoms has to be kept, or something similar. While these people have not had pills, they have had a good deal more than "nothing."

The only way to proceed would be to diagnose illness surreptitiously, secretly, so that the individuals didn't know they were being observed; the follow-up would also have to be secret. No lab tests would be possible. Think "medicine by binocular."

I know of only one experiment which approximates a genuine no treatment group, the Tuskegee Syphilis Study, in which the US Public Health Service enrolled 399 poor, rural, African-American men with syphilis into a forty-year-long observational study (Jones and Tuskegee Institute 1981). The idea was, in part, to see what happened to people who had syphilis which was not treated at all. Begun in 1932, the study went on until 1972; these men were deprived of all treatment. There were moderately effective treatments with salvarsan and other drugs in the 1930s. Treatment with penicillin, which was available for treatment of syphilis in the mid-1940s, was also denied to them. In 1997, President Clinton apologized to the few survivors of the experiment and their families, and to the nation, for this egregious ethical and moral catastrophe. It is, then, possible to have an untreated group, but only if you are prepared to go to incredibly extreme lengths.

Clinical trials

Given this complexity, how does one ever figure anything out about medical treatment? One of the great benefits of science is that the essential methodology is simply to forge ahead anyway, regardless of the complexities, by simplifying (this, of course, is also one of the great problems of science!). The main procedure that researchers use in clinical research 4 is the "randomized controlled trial" or "RCT." An RCT is designed to determine the efficacy of a drug for people with a particular medical condition. A simple study design would go like this.

Researchers first accumulate a number of people with a certain condition. If they have lots of patients with this condition coming to their own clinic, they might just ask their own patients if they were interested in participating in the study. Or they might ask other physicians to refer appropriate patients to them. They might advertise (residents of communities with university medical centers are accustomed to seeing ads in the local newspaper saying things like "Are you a man between 25 and 40 years of age and suffering hair loss [or psoriasis, or migraine headache, or any of hundreds of other such conditions]? Call 1-888-234-5678 to participate in a study of ... "). Recently, large research organizations like the various Institutes of the National Institutes of Health and others have been seeking participants for studies over the World Wide Web. 5 Today, many large trials are carried out simultaneously at a number of sites, from two or three to a hundred or more. Whenever such studies have any federal funding, and most other times, too, they have been approved by a research committee - often called an Institutional Review Board (IRB) - to see that they are ethically acceptable, that the rights of the patients who volunteer for the trials are protected, and, particularly, to see that all patients are appropriately asked for their "informed consent." Patients are then matched up against the entrance requirements for the study. This is a very important point in the process. Sometimes, it's relatively simple: "patients with active ulceration of the duodenum as seen on endoscopy" is pretty straightforward; "patients with significant late-luteal phase dysphoric disorder (i. e., PMS)," or, remembering our earlier discussion, "patients with mild to moderate hypertension" are more problematic. Let's take the simpler case: ulcers.

After being selected for the study, patients are given some sort of treatment for the condition. In some studies, there will be three or four treatment groups with different amounts of medication (10 mg, 20 mg, 30 mg, etc.). In the technical argot of medicine, these are sometimes called "verum" groups; verum means "true" or "truly." And then there is also a "control group." The control group may be given an existing standard drug or it may be given an inert treatment, a "placebo." The central necessities at this stage are that patients must be allocated to the different treatment groups at random, and that neither the researchers nor the patients can know who is getting which treatment; hence they are called "Randomized Controlled Trials."

The patients are treated with their medications or placebos for an appropriate length of time - in the case of ulcers, it might be for about a month - and then they are checked again to see what has happened; they might receive a second endoscopic examination, for example. At this point, the study is "unblinded," and the outcome in the treatment groups is compared. For the sake of simplicity, let's assume that there were two groups, one receiving active drug treatment and one receiving placebo treatment. And suppose we find that, in the drug treatment group, 60% of the patients are better, while in the placebo treatment group, 40% are better. Sounds pretty good.

But suppose that we had only 10 people in each group. In the drug group, 6 of 10 were better, while in the placebo group, 4 of 10 were better. There are only 2 more in the drug group that got better than in the placebo group. It seems pretty likely that this could have occurred simply by chance; we might very well have had this outcome if we had given placebos (or active drugs) to both groups. Indeed, this example is rather like the outcome of an experiment of flipping coins. Flip a coin 10 times, and you have an excellent chance of getting 6 heads one time, and 4 the next. If, however, you flip the same coin 1,000 times, it is extremely unlikely that you will get 600 heads the first time round and 400 heads the second. If you had a crooked coin, you might get heads 606 once and 594 once, but not 600 and then 400. So, sample size is important.

Suppose in our ulcer treatment study we had enrolled 2,000 patients, and we had 60% healed in the drug group (600 of 1000), and 40% in the placebo group (400 of 1,000). It seems extremely likely that, if we repeated this experiment, we would not get reversed results the second time, just like the coin. It is quite clear that we can conclude now that the drug is an effective one for healing ulcer disease. But we have had to do an awful lot of work to prove it, studying 2,000 patients. This is why the use of statistics is essential in doing clinical research. Using fairly straightforward statistics, one can determine what the probability is that the outcome of a particular experiment is due to chance. For example, when you flip a coin 10 times, you have a 37% probability of getting 6 heads; such an outcome is likely 1 time in 3. No one is likely to conclude from this experiment that we have a biased coin (or an effective drug).

If, however, we have 50 patients in each of two groups, one getting an active drug and one getting a placebo, and 60% of the drug patients are better (30 of 50) at the end of the trial, and 40% of the placebo patients are better (20 of 50), there is only a 4.5% chance that this "biasing" of the outcome is due to chance. In such a case, it is common to say that there is less than one chance in 20 (5%) that the outcome is due to chance; people also say that the result is "statistically significant at the .05 level." Now notice that just because something is "statistically significant" doesn't necessarily mean that it is particularly important (or "significant"). If we have a new drug that we test on 10,000 people (two groups of 5,000 each), and people in the drug group are better at the end of the trial 51% of the time (2550 of 5,000), and the placebo group patients are better 49% of the time (2,450 of 5,000), this is a statistically significant difference which is exactly the same as in the previous case (this outcome could happen simply due to chance only 4.5% of the time). But even though the difference is statistically significant, it doesn't seem very significant (unless it were really really cheap to do!).

Another way to look at this is to use the concept of the Number Needed to Treat (NNT). The NNT tells you how many people have to receive some treatment in order for one person to benefit from it. To calculate the NNT, you determine the proportion of benefit the treatment gives, and divide it into 100. In our case with 50 patients in each group, where 60% of drug patients got better, and 40% of control patients got better, the proportion of benefit is 60% - 40% which equals 20%, which we divide into 100% giving an NNT of 5. We need to treat 5 patients with the new drug in order for one to benefit. All 5 have to pay for the drug, and all 5 have to tolerate its undesirable effects, and one will benefit. In our case with 10,000 patients, the proportion of benefit is 51% - 49%, or 2%, which divided into 100% gives an NNT of 50. We have to treat 50 people with this new drug to have one person benefit. Just because a difference is "statistically significant" doesn't mean it is "significant" for real medical practice.

Note that I have been assuming that the differences between the drug group and the control group in these studies was due to the fact that one group got the drug and the other didn't. Are there any other possibilities? One of the biggest problems in doing RCTs is being certain that the individuals were assigned to the different groups in a truly random fashion. Suppose, for example, that the researchers decided to simplify, and arranged for all the men to be in one group and all the women to be in the other. This would hardly be a random distribution. Indeed, it is common enough for researchers to restrict study subjects to one gender or the other (it has traditionally been men) just so this couldn't arise. At the end of the study, when the results are "unblinded," the researchers compare the two groups on a variety of demographic measures, hoping that there are no differences - in gender, age, illness severity, economic status, etc. - between them. If the groups are the same on these measures, it is taken as evidence that the two groups are "the same," and therefore any differences between them are due to the presence or absence of the drug being tested. 6

There is another factor to consider. It is often alleged that, for a variety of reasons, RCTs aren't really "blind." In particular, it is said that people can figure out who is taking the drug and who is taking the placebo by noticing "side effects," or whatever. This may be the case. In so far as it is, and in so far as doctors or researchers convince people who researchers think to be taking the drug that they will do better than others, the results of the trial will probably show a deflated "placebo effectiveness" rate and an inflated "verum effectiveness" rate. Later, if the drug is approved for use, practicing physicians, convinced by these biased studies that the drug is highly effective, will convey that enthusiasm (or bias) to their patients and may heal a lot of patients with the meaning response. There's nothing wrong with this, of course, but it's likely that, if some skeptic comes along later with a better research design and tests the drug again, it will disappear from the pharmacies fairly quickly.

The very fact that conventional medicine relies so strongly on the randomized controlled trial, often referred to as the "gold standard" of medicine, rests on the fact that people get better when they take inert medications.
4 There is a longstanding distinction in conventional medicine between two types of research: laboratory research is work on chemistry or biology which might involve testing various substances on tissues grown in petrie dishes, or on animals, and perhaps on the occasional "human guinea pig," or the like. Clinical research involves testing drugs or procedures on human beings in hospitals, doctors' offices, or clinics.

6 There is a down side to this. If you pick your patients so they are all white men between 40 and 45 with "stage II illness," who each make between $37,000 and $40,000 per year in middle management, and don't wear glasses, and then randomly assign them to different groups, at the end you will be able to show that the two groups were "the same." But you won't have any real idea of whether your new drug will be of any value for rich 20-year-old black women, or 70-year-old nearsighted Hispanics. This is a very common problem in medicine; in particular, most drugs have, over the years, not been tested in women or children.

Excerpted and adapted from: Moerman, Daniel E. Meaning, Medicine and the 'Placebo Effect'. West Nyack, NY, USA: Cambridge University Press, 2002. Chapter 3.
Click sample quiz question above for a sample sample quiz question.