Wednesday, January 13, 2016

How not to match items, or: On treating continuous variables as dichotomies, or: Why p > 0.05 does not imply that H0 is true

TL;DR: Don’t dichotomise continuous variables.

So about a year ago, I designed an experiment. It is a simple reading-aloud experiment: participants see one word at a time and need to read it out aloud as quickly and accurately as possible. I manipulated word frequency (a continuous variable, with log frequencies between 0 and 2), and predictability of the pronunciation (a binary variable I made up). I matched on several psycholinguistic variables, such that any differences associated with the manipulation could not be attributed to a correlated variable. I did this by comparing the means across the four conditions, and making sure that differences between conditions were not significant (to be ultra-conservative, taking 0.1 as a cut-off). For example, I matched on orthographic N (for a given word, the number of words that can be created by exchanging a letter; a continuous variable). The average orthographic N values (and SDs in brackets), across the four conditions, are listed below:


High frequency (>1)
Low frequency (<1)
Predictable
6.3 (3.9)
6.1 (4.5)
Unpredictable
5.7 (3.8)
5.1 (4.9)

So far, so good – pretty standard practice, I believe. But here comes the complicating factor: in the analyses, we did not treat frequency as a dichotomised variable, but rather as a continuum. I decided to make a descriptives table for the manuscript which reflects this. Instead of presenting four columns for the different conditions and separate rows for each potential confound variable, I decided to perform, for each covariate, a separate linear model (LM) analysis, where the covariate is the dependent variable, predicted by the manipulated variables (frequency, predictability, and their interaction). Here is what I got:

Potential covariate
Main effect of frequency
Main effect of predictability
Interaction of predictability and frequency
Overall average and standard deviation
Orthographic N
t = 2.02, p = 0.04 *
t = 1.04, p = 0.30
t = -0.23, p = 0.82
5.82 (4.35)

It looks like my beautifully matched item set isn’t beautifully matched, after all. Far out. (It would have been far more convenient to discover this prior to data collection, too.)

So, what happened? The critical difference is that the pairwise t-tests treated frequency as a dichotomy; in the LM analysis, it was treated as a continuum. As it turns out, dichotomising naturally continuous variables decreases the power, meaning that, on average, fewer analyses will yield statistically significant p-values if there is a true difference.

I would like to think that this was just my rookie error, but it seems that it is not generally known that dichotomising continuous variables decreases power. For example, my meta-analysis of my eight studies failing to replicate an effect was rejected, partly because an anonymous reviewer argued that I was reducing the power to find an effect by treating the critical variable as a continuum rather than a dichotomy. The implications of treating continuous variables as continuums therefore go beyond mishaps in matching items: increasing experimental power is a central issue that is being discussed as a possible solution to the replication crisis. For psycholinguists, it should be – in many cases – relatively simple to increase their power, simply by using slightly more complex models (more complex compared to the traditional 2x2 ANOVA, that is), and treating continuous variables as - well - continuous variables.

To provide an illustration that dichotomising variables really does decrease power, I used the British Lexicon Project (Keuleers, Lacey, Rastle, & Brysbaert, 2012), which has reading aloud latencies for over 2,000 words and information on their linguistic characteristic. I took 1, 000 repeated samples each of 20, 40, 60, …, 280, and 300 words. For each of these samples, I created LMs to test for the main effects of frequency, length (number of letters), orthographic N, and bigram frequency. It is pretty well established that frequency and length effects are real, although orthographic N and bigram frequency effects are still somewhat elusive. The question is how many of these analyses yield p-values smaller than 0.05, and how this compares in for continuous compared to dichotomised models. Here is what happens when all variables are treated as a continuum: the x-axis shows the number of words in the sample, and the y-axis the number of analyses with p < 0.05 out of 1,000. The red dashed line indicates 80% power, which is considered to be good.


Next, I dichotomised frequency: for each sample, frequency was coded as high (0.5) if its value was above the median for this particular sample, and as low (-0.5) if it was below the median. The other predictors and other aspects of the analyses were identical to the models above. The analyses showed that dichotomising frequency decreases power: at N = 20, the power was 55.6%, at N = 40, power was 88.8%, at N = 80, power = 98.6%, and only at N = 100 did it reach 100%. In comparison, the power using the continuous frequency measure was 73.1%, 96.9%, and 100% for N = 20, 40, and 60, respectively (see Figure). An inspection of the average slopes showed that for each set size, the slope estimates for the effect of frequency were steeper for frequency as a dichotomous rather than the continuous distinction, for example, at N = 300, the average slope for the dichotomous frequency measure, β = -55.05, and for the continuous measure, β = -35.1. However, the standard deviations of the slopes also differ across the two measures, with a consistently higher standard deviation for the dichotomous compared to continuous measure (e.g., for N = 300, the standard deviations are 5.65 and 2.80, respectively). Thus, while dichotomising a continuous variable, on average, increases the raw effect size, it also increases the variability of the effect size estimate, resulting in lower power.

In summary, psycholinguists can take a step towards increasing their experimental power and thus creating a more replicable science by not dichotomising continuous variables. This potential change in analysis methods has no drawbacks and only gains.

Reference

Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287-304. doi:10.3758/S13428-011-0118-4

*******************************************
Edit 1/3/15: Fixed a typo.