A double-blind test is a control group test where neither the evaluator nor the subject knows which items are controls. A randomized test is one that randomly assigns items to the control and the experimental groups. Whenever possible, a control group study should randomly assign members to the control and experimental groups. This reduces the chance of biasing the study.
The purpose of controls, double-blind, and randomized testing is to reduce error, self-deception, and bias. An example should clarify the necessity of these safeguards.
The DKL LifeGuard Model 2 from DielectroKinetic Laboratories can allegedly detect a living human being by receiving a signal from the heartbeat at distances of up to 20 meters through any material. So say the manufacturers of the device. Sandia Labs tested the LifeGuard 2 using a double-blind, randomized method of testing. Sandia is a national security laboratory operated for the U.S. Department of Energy by the Sandia Corporation, a Lockheed Martin Co. The causal hypothesis they tested could be worded as follows: the human heartbeat causes a directional signal to activate in the Lifeguard, thereby allowing the user of the LifeGuard to find a hidden human being (the target) up to 20 meters away, regardless of what objects might be between the LifeGuard and the target.
The testing procedure was quite simple: five large plastic packing crates were set up in a line at 30-foot intervals. The test operator, using the DKL LifeGuard Model 2, tried to detect in which of the five crates a human being was hiding. Whether a crate would be empty or contain a person for each of the twenty-five trials was determined by random assignment. This is to avoid using a pattern that might be detected by the subject.
Tests showed that the device performed no better than expected from random chance. The test operator was a DKL representative. The only time the test operator did well in detecting his targets was when he had prior knowledge of the target's location. The LifeGuard was successful ten out of ten times when the operator knew where the target was. It may seem ludicrous to test the device by telling the operator where the objects are, but it establishes a baseline and affirms that the device is working. Only when the operator agrees that his device is working should the test proceed to the second stage, the double-blind test. The operator will not be as likely to come up with an ad hoc hypothesis to explain away any failures in a double-blind test if he has agreed beforehand that the device is working properly.
If the device could perform as claimed, the operator should have received no signals from the empty crates and signals from each of the crates with a person within. In the main test of the LifeGuard—when neither the test operator nor the investigator keeping track of the operator's results knew which of five possible locations contained the target—the operator performed poorly (six out of 25) and took about four times longer than when the operator knew the target's location. If human heartbeats cause the device to activate, one would expect a significantly better performance than 6 of 25, which is about what would be expected by chance.
The different performances—10 correct out of 10 tries versus 6 correct out of 25 tries—vividly illustrate the need for keeping the subject blind to the controls: it is needed to eliminate self-deception and subjective validation. The evaluator is kept blind to the controls to prevent him or her from subtly tipping off the subject, either knowingly or unknowingly. If the evaluator knew which crates were empty and which had persons, he or she might give a visual signal to the subject by looking only at the crates with persons. To eliminate the possibility of cheating or evaluator bias, the evaluator is kept in the dark regarding the controls.
The lack of testing under controlled conditions explains why many psychics, graphologists, astrologers, dowsers, New Age therapists, and the like, believe in their abilities. To test a dowser it is not enough to have the dowser and his friends tell you that it works by pointing out all the wells that have been dug on the dowser's advice. One should perform a random, double-blind test, such as the one done by Ray Hyman with an experienced dowser on the PBS program Frontiers of Science (Nov. 19, 1997). The dowser claimed he could find buried metal objects as well as water. He agreed to a test that involved randomly selecting numbers that corresponded to buckets placed upside down in a field. The numbers determined which buckets a metal object would be placed under. The one doing the placing of the objects was not the same person who went around with the dowser as he tried to find the objects. The exact odds of finding a metal object by chance could be calculated. For example, if there are 100 buckets and 10 of them have a metal object, then getting 10% correct would be predicted by chance. That is, over a large number of attempts, getting about 10% correct would be expected of anyone, with or without a dowsing rod. On the other hand, if someone consistently got 80% or 90% correct, and we were sure he or she was not cheating, that would confirm the dowser's powers.
The dowser walked up and down the lines of buckets with his rod but said he couldn't get any strong readings. When he selected a bucket he qualified his selection with something to the effect that he didn't think he'd be right. He was right about never being right! He didn't find a single metal object despite several attempts. His performance is typical of dowsers tested under controlled conditions. His response was also typical: he was genuinely surprised. Like most of us, the dowser is not aware of the many factors that can hinder us from doing a proper evaluation of events: self-deception, wishful thinking, suggestion, unconscious bias, selective thinking, subjective validation, communal reinforcement, and the like.
Many control group studies use a placebo in control groups to keep the subjects in the dark as to whether they are being given the causal agent that is being tested. For example, both the control and experimental groups will be given identical looking pills in a study testing the effectiveness of a new drug. Only one pill will contain the agent being tested; the other pill will be a placebo. In a double-blind study, the evaluator of the results would not know which subjects got the placebo until his or her evaluation of observed results was completed. This is to avoid evaluator bias from influencing observations and measurements.
The first use of control groups in medicine is attributed to Dr. James Lind (1716-1794) who discovered a relationship between citrus fruit and scurvy, a disease that killed many more sailors than died of battle wounds in the 18th century. Lind compared six treatments on sailors with scurvy. Those given lemons and oranges were almost symptom free within a week. The others sailors in the study didn't fare so well, though those given cider improved slightly. For more on the history of the randomized control study see Trick or Treatment: The Undeniable Facts about Alternative Medicine (2008) by Edzard Ernst and Simon Singh.
Of course, Lind did not know that vitamin C was the necessary nutrient in the citrus fruit that was preventing scurvy. In fact, he believed that the cause of scurvy was "incompletely digested food building up toxins within the body" (Bryson 2010). Lind's controlled experiment showed that there was something vital in oranges and lemons that prevented scurvy. His view of what caused scurvy indicates that he still adhered to the belief that disease is caused by internal toxins that needed to be expelled, a popular belief among medical experts from antiquity through the 19th century. Only quacks still maintain the belief that toxins in the body cause disease and the only cure is to expel them.
The long road from Lind's experiment to a complete understanding of the role of ascorbic acid in nutrition involved the work of many scientists over many years. It would not have been possible to conceive that food itself contains nutrients necessary to avoid specific diseases when one believed that all disease is due to internal bad humors or toxins that need to be expelled. Had Lind lived in a later age (but maintained his belief in the internal toxin theory of disease) where it would have been possible to determine the level of toxins in scurvy victims, he might have thought his belief validated if he found toxins in scurvy victims. However, if there were such toxins, they could have been the effect of scurvy, or the effect of something altogether unrelated to the scurvy.
As late as the early 20th century, the leading medical textbook of the day attributed scurvy to "insanitary surroundings, overwork, mental depression and exposure to cold and damp" (Bryson 2010). The medical textbook reflects what is called the miasma theory of disease, which was also very popular in the 19th century.
In 1917, E. V. McCollum, who coined the terms 'vitamin A' and 'vitamin B', declared that scurvy was caused by constipation (Bryson 2010). McCollum, who was one of the leading nutritionists of his day, seems to have adhered to the toxic buildup theory, the one that led to so much death and destruction over several centuries in the form of bloodletting. Still, McCollum represents an advancement. Who wouldn't prefer a laxative to bloodletting?
Dr. Alan Hirsch claims to be "The World Expert In Smell & Taste." He is an M.D.—a psychiatrist, in fact—who developed some magical crystals that will "help you reduce your appetite and food cravings." You can read all about his crystals, which he calls SprinkleThin™, on his website (which has been taken down, but you can see what it looked like at http://web.archive.org/web/20080225043445/http://www.scienceofsmell.com/). On July 25, 2005, I found the following testimonial on that website.
“What Dr. Hirsch discovered might surprise you. (Certain smells) seem to control appetite. Dr. Hirsch studied 2,700 people over six months, like the six people we met. They tried just about every diet imaginable. Dr. Hirsch brought along with him these special, non-caloric, scented crystals and asked the six to sprinkle it on their food.
All the participants kept a video diary for Dateline to prove they were using the product. At the end of three months when we checked in on them, they were all losing weight.”
What is wrong with Dateline's investigation? Among other things, Dateline did not have a control group. Dr. Hirsch says he has been studying eating behavior and weight loss for 25 years. He says he has done many studies, but if his studies were like Dateline's study they are not of much scientific value.
A well-designed study on the diet crystals would use a control group. Using a control group wouldn’t eliminate all problems with a study on weight loss, but it would reduce them. Weight loss is affected by many factors (motivation, eating behavior, amount of activity—especially exercise—overall health, metabolism, stress, and so on) and experimenters can't lock up humans in cages to make sure they do what they're supposed to do for the study. But, at the very least, a well-designed scientific study should use a control group and try to match the members of that group to those in the experimental group for factors that might have a significant effect on the outcome. For example, if you were doing a study that was testing whether prayer has an effect on the longevity of patients dying of AIDS, you should make sure that the ages of the subjects in both groups match up. It would not be a fair study to have 60-year-olds in one group and twenty-somethings in the other group.
Without a control group, a scientist can't be sure that the diet crystals contributed significantly to the weight loss or, if they did, in what way. The placebo effect may be at work here: dieters may believe these crystals really affect their sense of taste and smell to such a degree that their appetites are suppressed. They may be deceiving themselves, but the crystals help them anyway. However, powdered beetle dung might have had the same effect. The diet scientist doesn't just want to help people lose weight. If a product works, she wants to know why it works.
Dateline (and Dr. Hirsch) should not just give the crystals to dieters and observe whether they lose weight. They should have a group of similar people who want to lose weight and give them a placebo, a substance that looks like the diet crystals and is ingested in exactly the same way, but which is inert. They should agree to study the two groups for a set length of time, long enough for any diet to show results (several weeks, at least). At the end of the study they would compare the weight loss of the two groups. If the experimental group shows a significantly greater weight loss than the control group, then the scientists have good evidence that the crystals might be effective.
Having a control group is necessary but it is not sufficient for having a well-designed control group study. The study must use an adequate number of participants. Six people would not be adequate for a control group study. Several hundred would be a better number. Why? With only six people, all it takes is one participant to do really well to elevate the average of the group significantly above the average of the other group. But this one person's success might be a fluke. By having a larger sample, the researcher reduces the chances that a few fluky individuals have skewed the results.
Another way to reduce the chances of fluky results is to randomly assign subjects to the control and experimental groups. Randomization is very important to reduce the chances of biasing the samples. If highly motivated folks are placed in the diet crystal group and a bunch of lazy couch potatoes are in the control group, the results of the study would be biased. It is important that a method of true randomization be used, such as a random number table. You might think that assigning all the dark-haired subjects to one group and the light-haired subjects to the other would be sufficient to avoid having biased groups, but you cannot be sure that there is not something about hair color that is related to a person's weight. It is unlikely, but a scientist should not go with hunches in matters such as randomization.
It is also important that the subjects in this study not know whether they have been given the magic crystals or the placebo. There is much controversy regarding the ethics of deceiving subjects, but from a scientific point of view it might be better if the subjects didn't even know that the study is about weight loss. If they think, for example, that the study is testing the effectiveness of a new blood pressure medicine, you would eliminate such things as motivation to lose weight or belief that the crystals are appetite suppressants as possible causes of any weight loss achieved. However, many, if not most, scientists argue that it is unethical to deceive participants in scientific studies. The subjects in a study don't need to be told which group they are in, but they should be told that they have been randomly assigned to their group and that at the end of the study they will be told which group they were in. (In some studies, participants will know which group they're in by obvious facts, e.g., the control group folks would know they're in the control group of a study testing various methods to reduce blood pressure if they're told to do nothing special and just come in for regular blood pressure measurements.)
The kind of control group study described above is known as a parallel group study. However, as Dr. Gerard Dallal (2000) writes: "It takes little experience with parallel group studies to recognize the potential for great gains in efficiency if each subject could receive both treatments. The comparison of treatments would no longer be contaminated by the variability between subjects since the comparison is carried out within each individual." Such studies are known as crossover studies. They are highly recommended. In a crossover study, at the midway point in the study, members of the control group would now be given the active item being tested (i.e., they would now become the experimental group) and the members of the experimental group would now be given the placebo.
Had Dr. Hirsch done a double-blind study, an assistant might have randomly assigned the subjects to their groups and kept a record of who is in which group. Dr. Hirsh or another assistant might have weighed all the subjects and kept weight records for each participant. After all the data had been collected, Dr. Hirsch would "unblind" the study and the data for the two groups compared.
The final step in a well-designed study is the analysis of the data. You might think that the scientists should be able to look at the results and see right away whether the crystals did any good. This would only be true if, say, there were hundreds in each group and the experimental group lost 50 pounds each on average, while the control group gained 2 pounds. If the study had been designed properly, such results would be extremely unlikely to be a fluke. But what if the experimental group lost 2% more weight than the control group? Would that be statistically significant? To answer that question, scientists revert to statistical formulae. By some formula, a 2% weight loss might be statistically significant. If, however, a 2% weight loss meant 4 ounces over six weeks, most of us would say that even if this is statistically significant it is not important and not worth the money or the risk to use these crystals. The crystals might have some wicked side effect that hasn't yet been discovered.
The moral of this story is that while testimonials of six people who use crystals and lose weight might have a powerful effect on a television audience, a critical thinker should recognize that without a well-designed control group study, such testimonials do not have much scientific value.
A critical thinker also knows that information should be put in the proper context, which requires a certain amount of background knowledge. For example, you should know that many well-designed scientific studies get significant results that cannot be replicated at all or in a consistent fashion. If there is a causal relationship between diet crystals and losing weight, it should not work sporadically but consistently, unless, of course, there are so many factors that affect body weight as to make it nearly impossible to isolate the true effectiveness of a single item. In any case, a single study, no matter how well designed or how significant the results, rarely justifies drawing strong conclusions about causal relationships.
Finally, as mentioned above, there might be some deleterious side effect of these crystals that has not yet been discovered. SprinkleThin™ might help you lose weight but if it kills you in the process, what have you gained?
Bryson, Bill. 2010. At Home: A Short History of Private Life. Doubleday.
Dallal, Gerard E. Ph.D. 2000. The Computer-Aided Analysis of Crossover Studies. <http://www.jerrydallal.com/LHSP/crossovr.htm>, accessed 12/20/2012.
Ernst, Edzard and Simon Singh. 2008. Trick or Treatment: The Undeniable Facts about Alternative Medicine. W. W. Norton & Company.
Giere, Ronald. 1998. Understanding Scientific Reasoning, 4th ed. Holt Rinehart, Winston.
Kourany, Janet A. 1998. Scientific Knowledge: Basic Issues in the Philosophy of Science, 2nd ed. Wadsworth Publishing Co.
Sagan, Carl. 1995. The Demon-Haunted World: Science as a Candle in the Dark. Random House.