The Problem of Self-Selection

Phone-in polls are an example of self-selection. People select themselves for the poll by deciding to call in. Such a poll would be biased toward people with strong opinions, and the opinions may differ depending on the program or the network. A phone-in poll on CNN commonly produces very different results from a phone-in poll on Fox News. If the target population is "all voters," the results of neither poll can be considered representative. Neither is based on a random sample of all voters.

Natural groups as self-selected populations

What is self-selection? Does it require a conscious decision?

The term self-selection does not necessarily mean people volunteer to be in a group, as in a phone-in poll. It does not imply that people have "decided" to be in a group, or that they are conscious of their group membership. A self-selected group is simply a naturally-occurring group. Here are examples of self-selected groups.

  1. 16-year-olds
  2. People who buy Coca-Cola
  3. Left-handed people
  4. Mothers of twins
  5. Citizens of Italy
  6. Heroin addicts
  7. Redheads
  8. Yankees fans
  9. Sophomores at a particular college

Notice how each of these groups is defined by one characteristic, even though people in general have many characteristics. For each of the groups above (and any other naturally formed group) you could probably come up with a long list of characteristics they tend to have in common. A citizen of Italy might tend to be a brunette, gesture with the hands while talking, enjoy foods with garlic, use the metric system, and any number of other characteristics that together make up a stereotype of what it means to be Italian (or European, or human). Like all stereotypes, it would not fit every Italian, but there would be some truth to it. Observational research could add many other features of Italians to the lineup.

So far, so good. You could take a random sample of Italians, and that would be meaningful. Statistic could tell you how accurate your sample should be, so you could present your results with sampling error includes, for example, "85% of Italians are naturally brunette, plus or minus 3%" (meaning there is a 90% likelihood the true figure is between 82% and 88%). As long as your conclusions are purely descriptive, there is no problem. But if you infer that being Italian causes somebody to be brunette, you are clearly making a logical blunder. The mere fact that having dark hair is highly correlated with being Italian does not mean a new Italian (either natural born or immigrant) will necessarily be brunette. Here's where critical thinking comes in. When people interpret observational research, they tend to focus on the characteristic used to label a group. People tend not to consider all the other characteristics that fit the group. If somebody draws a conclusion from the research, but they ignore all the other correlates of group membership. pA person could be 16, left-handed, and a Yankees fan.Indeed, the same person could belong to several of the groups above.

People who answer opinion polls at CNN ke They are "selected" by their behavior, their ancestry, their life histories, or other circumstances. Such groups are not random samples and should not be considered representative of larger populations. When researchers gather data from a naturally occuring group (any singers, politicians, people who live in mountains, people who eat healthy foods, people who take a particular medicine, whatever) the result is a self-selected sample. Any variable that correlates with group membership could explain observed characteristics of such a group. Here are some examples.

When a group is described by observational data, what sorts of variables might explain the patterns?

Example #1: Defining a group based on what they consume. Suppose you are a student majoring in public health. For a class project, you decide to examine the effects of a common herbal remedy for joint pain: a combination of glucosamine and chondritin. You devise a questionnaire asking people (selected randomly from a telephone book) about vitamins and supplements they take daily. Sure enough, 13% of your sample takes glucosamine/chondritin supplements.

You examine the data from this group and your heart jumps, because you see some alarming patterns. This group is less healthy than average! Before you jump to any conclusions, however, you remember your introductory psychology class and the concept of self-selection, so you think about what factors may correlate with the decision to take this health supplement. Then you realize several things...

  1. This group elected to take nutritional supplements; they may have felt that something was wrong with them. The group may also include more health-conscious hypochondriacs who tend to complain about bodily symptoms.
  2. This group selected glucosamine/chondritin, which has a reputation for helping joint pain. They are likely to have joint pain already, so more people in this group might have arthritis.
  3. Older people tend to be more health-conscious, so a group of people taking glucosamine/chondritin might be older than a random sample of the population.
  4. This group purchased a supplement that has been "debunked" or disproven several times by reputable medical studies. Maybe they are less educated or less informed than most people. [And we could go on...]

In fact, we could go on, and on, and on. The number of distinctive characteristics of this group is limited only by your imagination. Any "alarming patterns" you see in your data might not be due to glucosamine and chondritin themselves; they might be due to one of the factors above or some other factor you did not consider.

Good effects as well as harmful effects can be erroneously inferred from self-selected samples. During the decades of the 1990s and 2000s, many different vitamin and food supplements were described as beneficial. These include anti-oxidents, megadoses of certain vitamins, human growth hormone, and a variety of herbal remedies. Typically a survey showed positive results in people taking a supplement.

However, when people are assigned at random into two groups, only one of which takes a genuine supplement, the supplements never seem to produce a beneficial effect. Why do surveys show positive effects, but controlled experiments show negative or non-existent effects? My guess is that people who seek these supplements are more attentive to health issues in general, so they take better care of themselves. They attribute their good health (wrongly) to some of the pills they are taking. When more careful controls are employed, the effect disappears, so this is an example of diminishing returns with repeated replications.

Example #2: Defining a group based on lifestyle. One of your relatives is sleeping with her newborn baby. Her physician points out that studies have shown a greater risk of infant death in such populations. Putting on your critical thinking hat, you start generating factors (variables) that might correlate with the decision to sleep in the same bed as an infant. You quickly realize that a woman who sleeps with her baby is more likely to be...

  1. Poor (somebody in a crowded living situation or a very small apartment or house is less likely to have a spare room to use as a nursery)
  2. Single or divorced (a husband in the bed might make it more likely a baby would be put elsewhere)
  3. A substance abuser (somebody who's life is disorganized or chaotic may sleep with a baby simply because they haven't made other plans)

Other possibilities might be generated with a little imagination. However, even these three could explain why more babies who sleep with their mothers are likely to die. Another unpleasant but realistic possibility is that criminality or intentional infanticide might occur at higher rates in populations like this. In fact, one study did suggest that babies sleeping with alcoholic parents are more likely to be smothered (by accident or otherwise...people always claim it is an accident).

None of these possibilities can be ruled out. Therefore a finding such as, "Babies who sleep with their parents are more likely to die" may actually show that poor education, low income, criminality, or alcoholism are risk factors for babies.

Such issues are raised whenever a naturally occurring group is studied using observational research. The temptation—always—is to attribute interesting patterns in the data to group membership. However, the true explanation for any pattern found could be anything that correlates with group membership.

How can researchers deal with the problem of self-selected samples, in situations where a sample will always be self-selected? Self-selection will always be an issue when researchers study diet, living habits, ethnicity, or other characteristics that cannot be randomly assigned in an experiment. The best solution, in most cases, is to measure all the variables thought to be important, then do a multivariate analysis (i.e. look at the effects of each variable separately and together).

This might reveal, for example, that when families are matched in marital status, income, and education, having a baby sleep with parents poses no risk. Although a discussion of multivariate research is beyond the scope of an introductory psychology book, the basic idea is simple enough: (1) spend some time thinking about potentially important variables that might correlate with the self-selection process, and (2) measure them all, if possible. Then a skilled statistician can help determine whether one variable or another is more strongly associated with interesting patterns in the data.

Multivariate research is much more detailed than locating a correlation between two varibles. That is a good thing. When we consider only two variables (like "sleeping with infant" and "infant mortality") we are likely to put on blinders and feel as if we are analyzing a cause-effect relationship. As these examples show, that is not the case, particularly when samples are self-selected. By looking at more details, researchers are far more likely to uncover useful information. For example, research that asks about many different variables (income, marital status, details of sleeping arrangements) might uncover highly specific factors, such as the width of slats on a headboard of an adult bed, that pose a genuine risk to babies but could easily be addressed by attentive parents.

Write to Dr. Dewey at

Don't see what you need? Psych Web has over 1,000 pages, so it may be elsewhere on the site. Do a site-specific Google search using the box below.

Custom Search

Copyright © 2007 Russ Dewey