Monday, June 19, 2017

Should we increase our sample sizes, or keep them the same? We need to make up our minds

Amidst the outcries and discussions about the replication crisis, there is one point on which there is a general consensus: very often, studies in psychology are underpowered. An underpowered study is one which runs a high risk, under the assumption that the hypothesis that the effect is true, to not detect the effect at the significance threshold. The word that we need to run bigger studies has seeped through the layer of replication bullies to the general scientific population. Papers are increasingly often being rejected for having a small sample sizes. If nothing else, that should be reason enough to care about this issue.

Despite the general consensus about the importance of properly-powered studies, there is no real consensus about what we should actually do about it, in practice. Of course, the solution, in theory, is simple – we need to run bigger studies. But this solution is only simple if you have the resources to do so. In practice, as I will discuss below, there are many issues that remain unaddressed. I argue that, despite the upwards trend in psychological science, drastic measures need to be taken to enable scientists (regardless of their background) to produce good science.

For those who believe that underpowered studies are not a problem
Meehl, Cohen, Schmidt, Gelman – they all explain the problem of underpowered studies much better than I ever could. The notion that underpowered studies give you misleading results is not an opinion – it’s a mathematical fact. But seeing is believing, and if you still believe that you can get useful information with small or medium-sized effects and 20 participants, the best way to convince you otherwise is to show you some simulations. If you haven’t tinkered around with simulating data, download R, copy-and-paste the code below, and see what happens. Do it. Now. It doesn't matter if you're an undergraduate student, professor, or lay person who somehow stumbled across this blog post. I’ll wait. 

*elevator music*

# Simulating the populations
# This gives us a true effect in the population of Cohen's d = 0.4.
# Sampling 20 participants from the population
# Calculating the means for the two samples
# Note how the means vary with each time we run the simulation.
# Note how many of the results give you a “significant” p-value.

The populations that we are simulating have a mean (e.g., IQ) of 100 and 106, respectively, and a standard deviation of 15. The difference can be summarised as a Cohen’s d effect size of 0.4, a medium-sized effect. One may get an intuitive feeling for how strong an experimental manipulation would need to be to cause a true difference of 6 IQ points. The power (i.e., probability of obtaining a significant result, given that we know that the alternative hypothesis is true and we have an effect of Cohen’s d = 0.4 in the population) is 23% with 20 participants per cell (i.e., 40 altogether). You should see the observed means jumping around quite a lot, suggesting that if you care about quantifying the size of the effect you will get very unstable results. You should also see a large number of simulations returning non-significant effects, despite the fact that we know that there is an effect in the population, suggesting that if you want to make reject/accept H0 decisions based on a single study you will be wrong most of the time.

For the professors who forgot what it’s like to be young
So, we need to increase our sample sizes if we study small-to-medium effects. What’s the problem? The problems are practical in nature. Maybe you are lucky enough to have gone through all stages of your career at a department that has a very active participant pool, unlimited resources for paying participants, and maybe even an army of bored research assistants just waiting to be assigned with the task of going out and finding hundreds of participants. In this case, you can count yourself incredibly lucky. My PhD experience was similar to this. With a pool of keen undergraduates, enough funds to pay a practically unlimited amount of participants, and modern booth labs where I could test up to 8 people in parallel, I once managed to collect enough data for a four-experiment paper within a month. I list the following subsequent experiences to respectfully remind colleagues that things aren’t always this easy. These are my experiences, of course – I don’t know how many people have similar stories. My guess is that I’m not alone. Especially early-career researchers and scientists from non-first-world countries, where giving funding to social sciences is not really a thing yet, probably have similar experiences. Or maybe I’m wrong about that, and I’m just unlucky. Either way, I would be interested to hear about those others’ experiences in the comments.  

-       Working in a small, stuffy lab with no windows and only one computer that takes about as long to start as it takes you to run a participant.
-       Relying on bachelor students to collect data. They have no resources for this. They can ask their friends and families, stop people in the corridor, and only their genuine interest and curiosity in the research question stops them from just sitting in the lab for ten hours and testing themselves over and over again, or learning how to write code for a random number generator to produce the data that is expected of them.
-       Paying for participants from your own pocket.
-       Commuting for two hours (one way) to a place with participants, with a 39-degree fever, then trying hard not to cough while the participants do tasks involving voice recording.
-       Pre-registering your study, then having your contract run out before you have managed to collect the number of participants you’d promised.
-       Trying to find free spots on the psychology department notice boards or toilet doors to plaster the flyer for your study between an abundance of other recruitment posters, and getting, on average, less than one participant per week, despite incessant spamming.
-       Raising the issue of participant recruitment with senior colleagues, but not being able to come up with a practically feasible way to recruit participants more efficiently.
-       Trying to find collaborators to help you with data collection, but learning that while people are happy to help, they rarely have spare resources they could use to recruit and test participants for you.
-       Writing to lecturers to ask if you can advertise my study in their lectures. Being told that so many students ask the same question that allowing everyone to present their study in class is just not feasible anymore.

I can consider myself lucky in the sense that I’m doing mostly behavioural studies with unselected samples of adults. If you are conducting imaging studies, the price of a single participant cannot be covered from your own pocket if the university decides not to pay. If you are studying a special population, such as a rare disease, finding seven participants in the entire country during your whole PhD or post-doc contract could already be an achievement. If you are conducting experiments with children, bureaucratic hurdles may prevent you from directly approaching your target population.

So, can we keep it small?
It’s all good and well, some people say, to make theoretical claims about the sample sizes that we need. But there are practical hurdles that make it impossible in many cases. So, can we ignore the armchair theoreticians’ hysteria about power and use practical feasibility to guide our sample sizes?

Well, in theory we can. But in order to allow science to progress, we, as a field, need to make some concessions:

-       Every study should be published, i.e., there should be no publication bias.
-       Every study should provide full data in a freely accessible online repository.
-       Every couple of years, someone needs to do a meta-analysis to synthesise the results from the existing small studies.
-       Replications (including direct replications) are not frowned upon.
-       We cannot, ever, draw conclusions from a single study. 

At this stage, none of these premises are satisfied. Therefore, if we continue to conduct small studies in the current system, those that show non-significant results will likely disappear in a file drawer. Ironically, the increased awareness of power amongst reviewers is increasing publication bias at the same time: reviewers who recommend rejection based on small sample sizes have good intentions, but this leads to an even larger amount of data that never see the light of day. In addition, studies that have marginally significant effects will be p-hacked beyond recognition. For meta-analyses, the published literature will then give us a completely skewed view of the world. And in the end, we’ve wasted a lot of resources and learned nothing.

So, increasing sample size it is?
Unless we, as a field, tackle the issues described in the previous section, we will need to increase our sample sizes. There is no way around it. This solution will work, under a single premise:

-       Research is not for everyone: Publishable studies will be conducted by a handful of labs in elite universities, who have the funding to recruit hundreds of participants within weeks or months. These will be the labs that will produce high-quality research at a fast pace, which will result in them winning more grants and producing even more high-quality research. And those who don’t have the resources to conduct large studies from the beginning? Well, fuck ‘em. 

This is a valid view point, as a world where this is the norm would not have any of the problems associated with the small-study-world described above. And yet, I would say that such a world would be very bad. First, for individuals such as me (of course, I have some personal-interest-motivations in writing this blog post), who spend months and months, lugging around the testing laptop through trains and different departments in search of participants, while other researchers snap their fingers and get their research assistant to run the same study in a matter of weeks. Second, it disadvantages populations of researchers who may have systematically different views. As mentioned above, populations with fewer resources probably include younger researchers, and those from not-first-world countries. Reducing the opportunity for these researchers to contribute to their field of expertise will create a monotonous field, where scientific theories are based, to a large extent, on the musings of old white men. By this process, the field would lose an overwhelming amount of potential by locking out a majority of scholars.

In short, I argue that publishing only well-powered studies without consideration of practical issues that some researchers face will be bad for individual researchers, as well as the whole field. So, how can we increase power without creating a Matthew Effect, where the rich get richer and the poor get poorer? 

-       Collaborate more, as I’ve argued here.
-       Routinely use StudySwap to look for collaborators who help you to get the sample size you need, but also to collect data for other researchers if you happen to have some bored research assistants or lots of keen undergrads.
-       For the latter part of the last point, “rich” researchers will need to start sacrificing their own resources, which they could well use for a study of their own, that would have a chance of getting them another first-author publication instead of ending up as fifth out of seven authors on someone else’s paper.
-       As a logical consequence of the last point, researchers need to change their mindset, such that they prefer to publish fewer first-author papers and to spend more time collecting data, both for their own pet projects and for others'.
-       And why are we so obsessed with first-author publications in the first place? It’s our incentive system, of course. We, as a field, should stop giving scholarships, jobs, grants, and promotions to researchers with the most first-author publications.

And where to now?
Perhaps an ideal world would consist of large-scale studies, and small studies and meta-analyses, as it kind of does already. But in order to allow for the build-up of knowledge in such as system, to be able separate true effects from crap in candy wrappers, we, as a field, need to fix all of the issues above.

And in the meantime, there are more questions than answers for individual researchers. Do I conduct a large study? Do I bank all of my resources on a single experiment, with a chance that, for whatever reason, it may not work out, and I will finish my contract without a single publication? Do I risk looking, in front of a prospective longish-term employer, like a dreamer, one who promises the moon but in the end fails to recruit enough participants? Or do I conduct small studies during my short-term contract? Do I risk that journals will reject all of my papers because they are underpowered? Do I run a small study, knowing that, most likely, the results will be uninterpretable? Knowing that I may face pressure to p-hack to get publishable results, from journals, collaborators, or the shrewd little devil sitting on my shoulder, reminding me that I won’t have a job if I don’t get publications?

Wednesday, April 19, 2017

How much statistics do psychological scientists need to know? Also, a reading list

TL;DR: As much as possible.

The question of how much statistics psychological scientists should know has been discussed numerous times on twitter and psychology method groups on facebook. The consensus seems to be that psychologists need to know some stats, but they don’t need to be statisticians. When it comes to specifics, though, there does not seem to be any consensus: some argue that knowing the basics of the tests that are useful for one’s specific field is enough, while others argue that a thorough understanding of the concepts is important.

Here, I argue, based on my own experience, that a thorough understanding of statistics substantially enhances the quality of one’s work. The reason why I think statistical knowledge is really important is that the amount of knowledge you have constrains the experiments you can conduct: If one’s only tool is ANOVA, there is only a limited set of possible experiments that fit within the mould of this statistical test. *

First, a little bit about my stats background. I don’t remember much from high school maths: I think I had a lot of motivation to repress any memories about it. During my undergraduate course, one of the biggest mysteries is how I even passed my statistics courses. I guess they had to scale everyone’s marks to avoid failing too many students. After these experiences, though, I have spent a lot of time learning about the tools that are the corner stone of making sense of my experiments. As my supervisor told me: the best way to learn about statistical analyses is when you have some data that you care about. When I had started my PhD, I dreaded the day when I would be asked to do anything more complex than a correlation matrix. But during the PhD, I learned, through trial and error and with a lot of guidance from experienced colleagues, to analyse data in R with linear mixed effect models and Bayes Factors. When I started my post-doc, driven by my curiosity about how it is possible that we can get two identical experiments with completely different results (i.e., with p-values on different sides of the significance threshold), I decided to learn more about how this stats thing actually works. My interest was further sparked by several papers I read on this topic, and a one-day workshop given by Daniël Lakens in Rovereto, which I happened to hear about via twitter. It culminated with my signing up for a part-time distance course, a graduate certificate in statistics, which I’m due to finish in June.

This learning process has taken a lot of time. Cynically speaking, I would not recommend it to early career researchers, who would probably maximise their chances of success in academia by focussing on publishing lots of papers (quantity is more important than quality, right?). If you have only a short-term contract (anywhere between 6 months and 2 years), you probably won’t have time to do both. Besides, you will never again want to do N=20 studies, and unless your department is rich, conducting a high-powered experiment might not be feasible during a short-term post-doc contract. Ideologically speaking, I would recommend this learning process to every social scientist who feels that they don’t know enough. In my experience, it’s worth putting one’s research on hold to learn about stats: moving from following a set of arbitrary conventions** to understanding why these conventions make sense is a liberating experience, not to mention the increase of the quality of your work, and the ability to design studies that maximise the chance of getting meaningful results.

Useful resources
For anyone who is reading this blog post because they would like to learn more about statistics, I have compiled a list of resources that I found useful. They contain both statistics-oriented material, and material which is more about philosophy of science. I see them as two sides of the same coin, so I don’t make a distinction between them below.

First of all: If you haven’t already done so, sign up on twitter, and follow people who tweet about stats. Read their blogs. I am pretty sure that I learned more about stats this way than I have during my undergraduate degree. Some people I’ve learned from (it’s not a comprehensive list, but they’re all interconnected: if you follow some, you’ll find others through their discussions): Daniel Lakens (@lakens), Dorothy Bishop (@deevybee), Andrew Gelman (@StatModeling), Hilda Bastian (@hildabast), Alexander Etz (@AlxEtz), Richard Morey (@richardmorey), Deborah Mayo (@learnfromerror) and Uli Schimmack (@R_Index). If you’re on facebook, join some psychological methods groups. I frequently lurk on PsycMAP and the Psychological Methods Discussion Group.

Below are some papers (again, a non-comprehensive and somewhat sporadic list) that I found useful. I tried to sort them in order of difficulty, but I didn’t do it in a very systematic way (also, some of those papers I read a long time ago, so I don’t remember how difficult they were). I think most papers should be readable to most people with some experience with statistical tests. As an aside: even those who don’t know much about statistics may be aware of frequent discussions and disagreements among experts about how to do stats. The readings below contain a mixture of views, some of which I agree with more than with others. However, all of them have been useful for me in the sense that they helped me understand some new concepts.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304.

Savalei, V., & Dunn, E. (2015). Is the call to abandon p-values the red herring of the replicability crisis? Frontiers in Psychology, 6, 245.

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.

Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist, 97(4), 310-316.

Lakens, D., & Evers, E. R. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3), 278-292.

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., ... & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23(2), 640-647.

Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn't). Psychophysiology, 54(1), 146-157.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195-244.

Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115-129.

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.

Wagenmakers, E. J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., ... & Morey, R. D. (2015). A power fallacy. Behavior Research Methods, 47(4), 913-917.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103-123.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157-1164.

Kline, R. B. (2004). What's Wrong With Statistical Tests--And Where We Go From Here. (Chapter 3 from Beyond Significance Testing. Reforming data analysis methods in behavioural research. Washington, DC: APA Books.)

Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. The American Statistician, 40(4), 313-315.

Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods.

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS One, 11(3), e0152719.

Forstmeier, W., & Schielzeth, H. (2011). Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse. Behavioral Ecology and Sociobiology, 65(1), 47-55.

In terms of books, I recommend Dienes’ “Understanding Psychology as a Science” and McElreath’s “Statistical Rethinking”.

Then there are some online courses and videos. First, there is Daniel Lakens’ Coursera course in statistical inferences. From a more theoretical perspective, I like this MIT Probability course by John Tsiskilis, and Meehl’s Lectures. For something more serious, you could also try a university course, like a distance education graduate certificate in statistics. Here is a very positive review of the Sheffield University course which I am currently doing. However, at least if your maths skills are as bad as mine, I would not recommend to do it on top of a full-time job.

Learning stats is a long and never-ending road, but if you are interested in designing strong and informative studies and being flexible with what you can do with data, it is a worthwhile investment. There is always more to learn, and no matter how much I learn I continue to feel like I know less than I should. However, it’s a steep learning curve, so even investing a little bit of time and effort can already have beneficial effects. This is possible, even if you have only fifteen minutes to spare each day, through the resources that I tried to do justice to in my list above.

I should conclude, I think, by thanking all those who make these resources available, be it via published papers, lectures, blog posts, or discussions on social media.  

* To provide an example, I will do some shameless self-advertising: in a paper that came out of my PhD, we got data which seemed uninterpretable at first. However, thanks to collaboration with a colleague with a mathematics background, Serje Robidoux, we could make sense of the data with an optimisation procedure. While an ANOVA would not have given us anything useful, the optimisation procedure allowed us to conclude that readers use different sources of information when they read aloud unfamiliar words, and that there is individual variation in the relative degree to which they rely on these different sources of information. This is one of my favourite papers I’ve published so far, but it’s only been cited by myself to date. (*He-hem!*) Here is the reference:
Schmalz, X., Marinus, E., Robidoux, S., Palethorpe, S., Castles, A., & Coltheart, M. (2014). Quantifying the reliance on different sublexical correspondences in German and English. Journal of Cognitive Psychology, 26(8), 831-852.

** Conventions such as:
“Control for multiple comparisons.”
“Don’t interpret non-significant p-values as evidence for the null hypothesis.”
“If you have a marginally significant p-value, don’t collect more data to see if the p-value drops below the threshold.”
“The p-value relates to the probability of the data, not the hypothesis.”
“For Meehl’s sake, don’t mess up the exact wording of the definition of a confidence interval!”

Tuesday, April 11, 2017

Selective blindness to null results

A while ago, to pass time on a rainy Saturday afternoon, I decided to try out some publication bias detection techniques. I picked the question of gender differences in multitasking. After all, could there be a better question for this purpose than this universally known ‘fact’? I was surprised, however, to find not two, not one, but zero peer-reviewed studies that found that women were better at multitasking than men. The next surprise came when I started sharing my discovery with friends and colleagues. In response to my “Did you know that this women-are-better-at-multitasking-thing is a myth?” I would start getting detailed explanations about possible causes of a gender difference in multitasking.

Here is another anecdote: last year, I did a conference talk where I presented a null-result. A quick explanation of the experiment: a common technique in visual word recognition research is masked priming, where participants are asked to respond to a target word, which is preceded by a very briefly presented prime. The duration of the prime is such that participants don’t consciously perceive it, but the degree and type of overlap between the prime and the target affects the response times to the target. For example, you can swap the order of letters in the prime (jugde – JUDGE), or substitute them for unrelated letters (julme – JUDGE). I wanted to see if it matters whether the transposed letters in the prime create a letter pair that does not exist in the orthography. As it turns out, it doesn’t. But despite my having presented a clear null result (with Bayes factors), several people came up to me after my talk, and asked me if I thought this effect may be a confounding variable for existing studies using this paradigm!

Though I picked only two examples, such selective blindness (or deafness) to being told that an effect is not there seems to be prevalent in academia. I’m not just talking about instances of papers citing those articles which support their hypothesis, and conveniently forgetting that a handful of studies failed to find evidence for it (or citing them as providing evidence for it even when they don’t). In this case, my guess would be that there are numerous factors at play, including confirmation bias and deliberate strategies. In addition to this, however, we seem to have some mechanism to preferentially perceive positive results over null-results. This seems to go beyond the common knowledge that non-significant p-values cannot be interpreted as evidence for the null, or the (in many cases well-justified) argument that a null-result may simply reflect low power or incorrect auxiliary hypotheses. The lower-level blindness that I’m talking about could reflect our expectations: surely, if someone writes a paper or does a conference presentation, they will have some positive results to report? Or perhaps we are naturally tuned to understand the concept of something being there more readily than the concept of something not being there.

I’ve argued previously that we should take null results more seriously. It does happen that null results are uninterpretable or uninformative, but a strong bias towards positive results at any stage of the scientific discourse will provide a skewed view of the world. If selective blindness to null results exists, we should become aware of it: we can only evaluate the evidence if we have a full picture of it.