Replication in behavioral research

Via boingboing, I found an article in the Chronicle of Higher Education on some (thus far failed) efforts to publish a failure to replicate the (in)famous findings “supporting” the existence of “psi” (i.e., ESP, or pre-cognition, or some such). I put “supporting” in scare quotes because, from what I’ve read, there are real methodological issues with the Bem study that started the most recent furor over this issue.

These are beside the point with respect to this blog post, and there is a lot of stuff online about it, so I’m not feeling inclined to retrace the steps I took so many months ago to come to this conclusion. The short version is that I remember being convinced that the sample sizes in the experiments reported by Bem were reasonably likely to be indicating that the data had been probed repeatedly and that the experiments had been stopped when statistically significant results were found. That is, it seems reasonably likely to me that the effects found by Bem are due, at least in part, to the statistical significance filter.

Okay, so the point of this blog post is to comment on a quote from the Chronicle article. Some researchers tried to replicate some of Bem’s findings, wrote up their negative results, and then said this in an email to the author of the Chronicle article: “Here’s the story: we sent the paper to the journal that Bem published his paper in, and they said ‘no, we don’t ever accept straight replication attempts’.” To the credit of the Chronicle writer, he points out the file drawer effect, quoting a New York Times piece about the Bem work:

And, of course, the unwritten rule that failed studies — the ones that find no effects — are far less likely to be published than positive ones. What are the odds, for instance, that the journal would have published Dr. Bem’s study if it had come to the ho-hum conclusion that ESP still does not exist?

It’s bothered me for some time that replication is not a regular occurrence in behavioral research, but it’s particularly distressing to see an anti-replication ethic stated so plainly. Don’t get me wrong – I am pretty sure I understand why replications are not a part of behavior research.

We could take a very cynical tack and say that replication isn’t common because behavioral effects are small, unreliable, and, more likely than we’d like to admit, statistical artifacts (i.e., false). Now, this may be the case some of the time, but my guess is that it is not the case all that often. I think replication is uncommon – and occasionally explicitly frowned on – because behavioral research is very difficult, and because, when we study behavior, we are studying extremely complex systems. There are a huge number of factors that can influence behavioral effects, many of which are unknown, ignored, or otherwise not taken into account when designing, conducting, and reporting behavioral studies.

There’s the old joke about how psychology is the science of college sophomores, but there’s an important insight in the humor. The characteristics of college sophomores differ from the characteristics of non-college-sophomores in any number of possibly relevant ways. Of course, this basic point applies, in principle, to just about any subset of the human species that might be available for your particular behavioral study. We simply don’t know what is the set of factors that we don’t know about that might be relevant in any given project. We have a serious case of meta-ignorance on top of our ignorance, with unknowns and unknown unknowns and all that.

I’ve probably overstated the case a bit, since it seems likely to me that lots of behavioral effects would, in fact, be fairly easy to replicate, if there were any incentive for people to carry out replications. The difficulty of doing a good replication may well account for a cultural disinclination to do replications, since a failure to replicate a behavioral finding could be due, of course, to the original finding being bunk, but it could also be due to any number of unknown differences between the original study and the replication, differences that no one thought to consider in the first (or second) place. A long term cultural disinclination could then lead to explicit policies to reject straight replications without review. Sad, but understandable (and also maybe wrong – the cynical view, or some other view I haven’t brought up here, could provide a better explanation, after all).

So, what’s to be done? I’ve thought about this some, but the best I’ve come up with so far is to assign, when and where appropriate, replications as term projects. At the beginning of graduate school, I found it frustrating to be expected to formulate reasonable research questions, design appropriate experiments, and collect and analyze data while simultaneously learning the content of a course. When you only know what’s been covered in the first few weeks of a class, it limits the range of possible term project topics pretty severely. As a teacher, I’ve thought about this from the other side, which is to say that I’ve pondered at length the organization of a course’s content and how it influences students’ abilities to do good, and interesting, term projects. Of course, my experiences with this are tied very  closely to linguistics and second language studies. My indirect experience with psychology (i.e., what I heard from the other lab members when I was a linguistics and cog-sci student in a psych lab) indicates that research projects are less tied to ‘content courses’.

Anyway, there’s an obvious educational benefit to assigning replications as term projects, since students are likely to learn quite a bit from trying to replicate what someone with a lot more research experience did. The scientific value of this is, um, let’s call it less obvious. Less experience doing research leaves a lot of extra room for unknown but possibly relevant differences to creep in and mess things up.

There’s also the scientifically more promising psychfiledrawer.org, which seems to be a good faith (and well-funded) effort to make replication attempts available to behavioral researchers. Just because journals won’t publish replications doesn’t mean they shouldn’t be published.

(A quick, and amusing, aside: When looking for links about the Bem study, I found a page on Bem’s website from 1994 (1994!) that says “Recent laboratory research suggests that parapsychologists might finally have cornered their elusive quarry: Reproducible evidence for psychic functioning.” Or maybe not.)

This entry was posted in SCIENCE!. Bookmark the permalink.

2 Responses to Replication in behavioral research

  1. Mike Gasser does (or at least used to do) this with his Introduction to NLP class. You pick a research paper in NLP and try to replicate it – and he’s happy to supply suggestions if you can’t pick a research paper on your own. At some point about three weeks from the end of class, we all presented our projects and got feedback from the rest of the class, leaving us enough time to overcome any remaining hurdles. The majority of participants failed to replicate the results reported in the paper, usually due to vagueness in the original paper about implementation details and failure to respond to email requests in requisite detail on the part of the original authors. I heard a rumor that he had dropped the practice due to the high rate of failure to replicate, but I can neither confirm nor deny it, and I’m not honestly sure if he even still teaches intro to NLP. CSCI has undergone a lot of restructuring since forever ago when I started gradschool. But I did think it was a really good idea from a pedagogical standpoint.

  2. The idea of having intro grad students doing replications as course projects is very interesting. But I think to be useful for science, as opposed to pedagogy or apprenticeship, the replications would have to be very true to the original studies in all their procedural care and detail. Replicating the Bem studies, for example, would be very challenging for a class project because there are strong constraints on the random number generators, the types of stimuli, etc.

    When I taught the capstone course in cognitive science, I had students program an experiment (of their choice), analyze the data, and fit the data with a cognitive model. Even trying to replicate well-established phenomena, such as search times in visual search, was a big challenge for a course assignment, because there are so many conditions in any realistic experiment and so few experiment subjects to provide data. I think it was a great exercise for the students, but the data were more-or-less unusable for science at large.

    One other detail: For course projects with minimal risk, there’s no need to get IRB approval. But for data that might be published (even in an online database), you must have IRB approval.

    But the more I think about this, the more I like the idea. For scientific replication, great care must be taken in the procedural details. This could be done in a course, but it would be a major undertaking and, I think, it would have to be the focus of the course. That is, it would be a course specifically designed for the systematic replication of a particular experiment/phenomenon. There could be sub-teams of students, each team working on their own replication, all informed by course-wide discussion of exact procedural details of the original experiment. Results from this sort of careful replication might be of high enough caliber to submit to online archives. At it could be a useful exercise for students at the same time.

Comments are closed.