Rationally Speaking #155 - Uri Simonsohn on "Detecting fraud in social science"
Rationally Speaking Podcast- 826 views
- 20 Mar 2016
He's been called a "Data vigilante." In this episode, Prof. Uri Simonsohn describes how he detects fraudulent work in psychology and economics -- what clues tip him off? How big of a problem is fraud relative to other issues like P-hacking? And what solutions are there?
Rationally speaking, is a presentation of New York City skeptics dedicated to promoting critical thinking, skeptical inquiry and science education. For more information, please visit us at NYC Skeptic's Doug. Welcome to, rationally speaking, the podcast, where we explore the borderlands between reason and nonsense. I'm your host, Julia Gillard. And with me is today's guest, Professor Uri Simons. And Uri is an associate professor at the Wharton School of Business at the University of Pennsylvania. And he blogs at Data Kelada, which is a favorite blog among my data, nerd friends and my social science friends, urías research.
I like to think of it as breaking down into two categories. On one level, your researches factors that influence human decision making, like, for example, the weather. So I think of that as the object level. And then on the meta level, your research is the scientific method and problems in the scientific method that lead studies to be less reliable than we would ideally like, especially in social science and, for example, psychology. So that's the part we're going to focus on in today's episode.
For example, URI has been called the Fraud Vigilante for his work, uncovering and explaining instances of fraud in social science. So we'll discuss that. But we'll also put it in the broader context of issues with the scientific method and maybe touch on the current replication crisis, the crisis of faith that's affecting psychology and social science in general. That's been discussed in the news in recent weeks. So you're welcome to the show.
Hi, thanks for inviting me. You know, one thing that you said last time we spoke was that the there's a connection between those two levels that I termed the object and meta level of your research in which on the object level, you're investigating biases in human judgment and decision making or factors that can unconsciously influence our judgment. And then on the meta level, you're applying some of those principles, some of those findings to science itself, to the way that scientists think.
Can you expand on that? Yes, so I came to realize that after I have been doing work and methodology, but but I saw the connection between those two areas that you describe, which are which is the connection is that statistics, textbooks tend to think of of researchers, the way economists used to think of people 50 years ago as a purely rational maximizers and objective processes of information and economics went through a major revolution. It first just to interpret incorporating limited processing of information and then becoming more and more psychologically realistic.
And statistics hasn't really gone through that. And so once you start thinking about researchers as being motivated, thinkers who want to be successful want to keep their careers, but also just believe that they're right. And that's why they're doing a study to have a very clear expectation and how that will tend how they interpret ambiguity that's present in any and any data analysis. You can really you can really gain insight into how things go wrong and how to prevent that from going wrong.
So I think of the problems with the scientific method as falling into a few categories where one of the biggest categories is something sometimes called P hacking, which we've talked about before on this show, and it basically entails ways that researchers can end up not necessarily intentionally, but get results that fit their hypothesis more than they would have of that P hacking weren't happening. And and then on on a broader level, in an institutional level, there are ways that the process of BI, which studies get published, can bias the field of published studies towards positive or exciting results.
And we've talked about this is the file drawer effect or the publication bias effects on the show before.
And so both of those categories seem like they fall under the umbrella of unconscious biases affecting the way that we do statistics and thereby what we conclude. And then the third category is outright fraud, where researchers are literally just fabricating data. And so I'm guessing that doesn't fall. Is there a connection between fraud and the kinds of judgment and bias research that you were referring to earlier, or would you put that in its own category?
I would put most of the the ways in which we relax, how humans behave, don't go all the way to criminal behavior or just unambiguously wrong behavior. And so I don't think this is the training that I have or the research that I've read or created gives me much insight into the mind of somebody who would go through entire career.
Creating false claims that she or he knows are false and going to conferences, presenting them, writing papers, editing, responding to reviewers and all that for a big fiction, I think that's that's more psychopathic. So maybe maybe a clinical clinical psychologist would be better equipped to think about that. Right.
And can you put these categories of problem in? Can you compare them to each other in terms of the magnitude of the different problems, like how big of a problem is fraud relative to hacking or other unconscious influences on an individual research? Researchers work as compared to, say, publication bias in the way it affects which studies get published. Right.
So I think. There's too too many ways you could measure them. So how prevalent they are and how how how much impact they have. And so I think it's hard to tell what he will say. Fraud is not very common, but what they really are saying is they hope it's not very common because they don't have it. They don't have any meaningful way of accessing it. It's very hard to do that. And I suspect it's more common than most people describe it to be.
So I would ballpark it basically. I think most most people would ballpark it a trivially small and I would say a substantial small share, let's say. And it's of course, it's necessarily a number out of thin air. But just based on my experience, I would say I would go back to closer to five percent or five percent of study studies are fraudulent. Yeah, wow. Yeah.
That contained fraudulent data that they'll be my my rough estimate. And of course, there's a lot of a lot of guesswork in there. But any of us, most people would say, oh, maybe a handful of researchers are doing it. That's what I think would be the moral answer. What I do agree with the majority and especially influential people who talk about this is that I suspect very few. Ideas that are well known are based on fraudulent data.
So I think most fraud tends to happen in the in the margins and less important journals by less important people. And because that's where the incentives are. Right. If you can't if you can't be successful in the other way and you can and you can fake data in a journal a few people will ever read, then you can get away with it for a long time. And so in that sense, it be much more common, but less influential.
And the only way in which fraud is does have more of a burden on the rest of the scientific community is the two two main ways. I think one is through mate analysis. So main analysis is this idea that when you when when literature is mature and you want to summarize it quantitatively, you go out and you try to find every study published and not published and try to aggregate it. And then because fake data is not constrained by reality, they will often have just gigantic, enormous effects and it can dramatically bias the overall literature.
So even if you if you faked data and nobody ever read your paper, you could still have an impact in the overall understanding of the literature, because somebody is going to include your extreme data point into an overall average and you're going to have disproportionate way. For a recent paper we had we were trying to improve on a technique that we have to correct for packing, for selective reporting of analysis. And we simulated what would happen if you throw in the studies by one of the people that that that I caught committing fraud and just won one study by that person would make if I remember correctly, if you take a look, if you take if you take 20 studies that are all false positives, so none of them are real effective, just by chance, they came about and you throw out you throw in just one fake study, it makes it almost certain that you will conclude the effects real because it's such an outlandish, outlandishly large effect.
So in that sense, I think it is potentially more consequential. I don't know that that answers the question.
It's the angle that you what you were interested in, although I'm a little alarmed to learn that meta analysis don't generally adjust for outliers.
I would hope that they would do. Yeah, there's a lot of variation in how to deal with it and what are outliers that are really tricky problem in and from the statistics standpoint, because you never know when you have an outlier, if that's if that's an anomaly or that's really an extreme value. And so. There are so there are different ways to deal with it one way, for example, that a similar analysis is they saw by the FXI study and they'll show.
So you will very easily see if there's an outlet, if there is an outlier value. But it's not clear what to do with it. Is is that the person who really understand the phenomenon? That's why they get a large effect. So so you're right that it is possible to manage the problem in principle, I think in practice people will be reluctant to say we decided to exclude this resolve merely because it was large. It could easily be construed as you being biased against finding evidence for such phenomena.
For example. Did you say there was another another way to to think about the size of the fraud problem?
Yeah, so and the other way I think it enhances that it in terms of sort of. The so even if you're somebody doing work that it's not very foolish at all, you're taking a position, a faculty position or a research position that somebody else could take and that they would do good work on influential work. So if you think of people in the margins, then you think of the big. The impact any any one of them could have on their students, and so that's what becomes more influential.
So example, I happen to at a conference run across a person who who applied for the same job at Michigan, that the froster. Took and and for that person, the cost of fraud was very real, that they didn't have a job in Michigan because somebody was faking data had it. And so that's the human angle that.
Right, so there are human victims and then there's the general integrity of science victim, which is maybe more indirect but but pretty substantial consequence. Is this just, generally speaking, what motivated you to go after the fraud problem or was there some more specific impetus for the problem?
We were working on something else and we just we just stumbled on the paper that was so incredibly the effects were central to we trying to figure out what was happening. And then I think like like most people who want to do any sort of science, intrinsic curiosity is primarily what drove it at the time. And they just are just curious, if it were fake, how could you prove it? How can you provide evidence that is fake?
And then at that point, if. At that point, I share the prior that this was very, very rare, so I thought just because if you if you come across something so rare and so and so terrible, you should do something about it. If it happened today, I'm not sure I would react the same way because I no longer think it's such a rare thing.
Do you think this is a common reaction on the part of other social scientists that they see surprising or or surprisingly large effects? And they think to themselves on some level that can't really be real, but you're the only one who really you're one of the few people who really tried to to find out if it was real. I know there's been at this a few cases, I think some people perhaps are too quick to I mean, so after the papers, I describe the fraud and how it was discovered and so on were published, I will receive a ton of emails from people and from researchers and even data from political campaigns and electoral data, financial statements and so on.
And you can you can tell us some people who are very quick to judge something is fake because it strikes me surprising. So. I don't know, I think I think I should say that I'm an outlier in terms of how common fraud is, I know of two other people who have such negative views on this and almost everybody in ninety nine point nine percent. People I know think it's much less of a severe problem than I did. I do. So, well, one thing that I know you've talked about in the past is that it's troubling for social science, that we can't just assume that a surprising result that we read in the literature is true and therefore interesting.
We sort of have to have this prior that a surprising results, one that contradicts our expectations. There's a large probability that it's if not fraudulent, then, you know, flawed research for other reasons. And it's sort of limits our ability to make updates from the research.
Right. But my view on that has it has evolved a bit from in here. So some people would especially the more Bayesian oriented researchers or people who are looking for Bayesian methods, they would say, would you really have to bring in your prior support phenomenon before accepting it?
Yeah, and I think the risk with that is that you end up. Being too skeptical of the most interesting work and so you end up in a way creating an incentive to doing obvious and boring research. And so I have a bit of a twist on that in practice, maybe not sufficiently different, but I do think it is it is different, which is why I think we should bring in the Prior's and our general understanding of skepticism towards evaluating the methodology, almost blind, almost blind to the question or the hypothesis as being so.
So let's say you tell me you own experiment about how preferences for political candidates shift. Then I should bring into the table how easy it is to shift political preferences in general, how noisy those measures are and so on, and not put too much weight on how crazy I think it is that you tell me you're changing everything by showing an apple below awareness. Because because my intuition about how big the impact of Apple's built awareness on people are, it's not very it's not very scientific prior.
It's a gut feeling. And so then if I start judging your scientific evidence with my gut feeling, that doesn't seem right. But but I do have a lot of experience, especially from a political scientist, from an experimentalist, on how easy it is to move the dependent variable under different circumstances.
So if you tell me, for example, if even the microscale this one to seven, how strongly disagree with this or that, they sound kind of flaky. But when you work with them, you know that moving moving Oliker scale like that more than, say, one and a half or two points is really, really hard.
And just to clarify, by moving the scale two points, you mean trying an intervention that will cause people's average response on the one seven scale to move an average of two points? Absolutely right.
So some people we show this and they said for an average and people said that and they say six, an average of two points.
So if I a paper that shows an effect of three, say, my prioress, you don't get that unless it's an incredibly obvious thing, like what's taller, a building and a house. And so unless you're asking an incredibly obvious question, I do allow myself to be skeptical, but not because the manipulation necessarily doesn't resonate with my intuition. So I don't think the distinction is clear, but it's one of my pride about the specific intervention you're claiming there.
I try not to not to trust my intuition. And the other one is, what do I know about the reliability of the measures, how easy it is to move the dependent variable and there IQ, because in the latter case, it's based on data and the other one is just my my gut feeling. And if you want to be surprised by science and change your mind, then it's interesting. Somebody comes up and show something that you wouldn't expect.
It's reliable and replicable.
Well, to push this logic to the extreme, you take if we take a study that purports to show evidence for such phenomena. Right. People being able to predict the future, for example. Mm hmm.
Surely my intuition about the the prior implausibility of that being real should factor in somehow, right? How could it not?
Yes, I think that one of the examples, more of a gray area in terms of you could argue the hypothesis is so out there that that your intuition about what moves things. It's not clear where the psychology starts and where some of the methodology starts in terms of if they really is precognition really a psychological phenomenon or does it challenge your understanding of how the world operates?
So what I agree I agree that your example challenges my description, which is something heuristic. Right. But that is when you tell me about the trade preferences for how much you are willing to pay for shoes. Right. And I may have an understanding of that. And I know how much he moves one way or the other. And that's what's constraining him then if it's about an ad or if it's about experience, that's the thing I don't have a lot of.
Beyond my intuition, for when they get a recognition is about. What changes, depending variables and so from from everything we understand, only thing that happened in the past can change depending variables, no matter what it is. So imagine you thought recognition was possible then, whether it has to be arousing stimuli or only for men and for women. That part, I want to trust my intuition very much.
I see. So a lot of the frustration with psychology and social science in general has been directed at these kind of frivolous or sexy studies that will get reported in science news about, you know, I'm going to pull an example out of my lived experience in this very moment. So there's a window behind me and I can feel the sun on my back. So I could imagine a study of this kind showing that, oh, if there's a heater or the sun on someone's back, that will cause them to respond to surveys saying that there are better days are behind them because they have this feeling of warmth and positivity being behind them, you know, so.
Right. And there's there's so many other studies that that I mean, the framework that I'm thinking of here is basically that our behavior and our choices and our view of the world can be strongly influenced by all sorts of random factors, like like women are more likely to wear red on day six through 14 of their menstrual cycle, that kind of thing. And I do feel like my my prior on that not being the case is certainly a weaker prior than my prior on precognition not being real.
But it is I do have a noticeable prior. That's not really how the world works. And so. Right.
But let's go, let's go with it with the window wind behind you. OK, example, where will we have an informed I mean what what part of our experience would give us feedback as to whether windows behind us do or do not influence our perception? It's hard, right? So it would it's not impossible, but it's that just our gut reaction to it. A lot of really interesting findings in science, social and social seem crazy when we are presented with them.
So so I'm not trying to keep your instinct. I have the same instinct. They do seem far fetched to me. But what I try to do is it's almost like exercise control and say, OK, I'm going to suspend that prior and instead bring in this other product, which is to ask how big is your sample? And so, for example, those of electoral cycle studies, if they had if they had a really large sample. Right. And they have a very carefully designed control group for both.
And then you saw that they got the effect and then the authors shared the very natural skepticism that any small cost would have a detectable effect. And so they went out and they they carried out a very similar application that addressed a natural concern you may have if they did that. And so they addressed my prior intention, methodology, concerns. I would say there is good reasons to have my beliefs, because I don't have I don't have strong I don't have a well-founded belief about how different part of the TerraCycle influenced female preferences for clothing.
Where would that come from? It just it just my gut reaction to it doesn't resonate with me, but it's not very well informed. You see what I'm saying?
Yeah. In fact, now that I'm I'm introspecting, I think that some of my my skeptical prior about studies like that comes from the fact that I don't trust a lot of methods and social science. And so basically I have this model where if a study is there are different reasons the study can become known to the public either it can be really well conducted and therefore published in a journal and discussed by scientists and therefore more likely to filter into the public.
Or it can be a sort of fun, sexy result, and that alone can be enough to propel it to public awareness. So finding out that a study is sexy will make me somewhat less confident that it's true. I think sort of like, you know, if and if an actor can can either achieve fame by being incredibly talented or by and by being incredibly attractive. And, you know, if you know that an actor is famous and attractive, that might cause you to downgrade your expectation of them being incredibly talented, that kind of thing.
So I think it's reasonable. So I think it's reasonable to say I don't want to tell you as I have a very sexy finding, I'm not into any other details. It is perfectly rational to assume the methods are weak based on that information. But then if you're so can evaluate that, when I open the paper and read it right, I would point. I would point potentially. I challenge the findings. Right.
So I think this may be that I thinking this will be a maybe a good example of the contrast between the two. So a few years ago, I did research evaluating claims that our names really impact what we end up doing. So if you have, for example, if your name is Smith, you're more likely to marry somebody else whose name is Smith. And if your name is Dennis, you become a dentist and so on. And in my main reaction to it was so they show that people tend to marry others with similar last names.
OK, and when I read that claim, it didn't. It didn't. I don't have it. I don't have a strong intuition about how we choose our partners and to what extent last names influence our perception of others. I don't have well-informed Cryer's, but I did. I do have and how difficult it is to show an effect is causal. And so I thought, I can't imagine a way to make this case compellingly. So even if it were true, how would you ever prove it to a company?
So I was skeptical of the study for that reason. It just seemed like a very strong to just to document claim. And I ended up and actually that all the evidence for it was spurious. But at least at that time, I wasn't necessarily a skeptic of the main hypothesis, I was just skeptical that you can easily document. I think that makes sense, I think I think that captures my my reaction to it. Right, good. So let's let's dive into the question of how to react when data are sort of too good to be true.
So this this was one of my intuitions about how to detect either fraudulent data or data that are just a serious victim of hacking or other kinds of cherry picking that if the results are are much stronger than we would expect, even if that effect were real, then that's a strong red flag. Does that seem fair to you? Yeah.
Yes. So so far, definitely. This threat has been detected, and that's an important caveat, has the flavor of that is usually too good to be true, but the covid is we only know they probably detect and for you we only catch very slow thieves kind of thing. And so it could be that there's very smart fraud out there that's hard to detect and that's not too big to be true. And we don't detect them with be hacking with the selective reporting of analysis.
It's often the opposite where effects are just just credible enough to at least in terms of statistical significance, rather just below the threshold that we have to really impose. So so those are a little easier to spot where if you're trying multiple things just to just to get if you exclude some variables or some observations, rather to get your effect, or if you try one versus the other measure to get effect, you will tend to land just south of four or five.
So that will make it easier, easier to spot. And can you tell the difference between data that's fraudulent and data that's maybe just the result of unconscious massaging on the part of the researchers, is that is that detectable objectively? So, I mean, the problem with fraud is that it isn't made up, data doesn't doesn't follow any. Mathematical law. So it just whatever you come up with and so so so that's what makes detecting fraud tricky, that you can't there's infinite ways to to to to fake data.
And so there are literally five standard approaches to identifying data as problematic.
And those are all very different from selective reporting of analysis. But you could very easily fake data so that it looks like it's been hacked instead of faked it. So, in fact, the data extrapolate this famous psychologist from the Netherlands who got caught a few years ago and I was not involved in that case, he would describe that he would when he would fake data, he would try to make it look too good. He would consciously do that. He wasn't very good at it because his data did look too good.
But but at least he was trying to he was trying to it to not be too good to be true.
Well, I know that I've heard about methods exploiting the fact that that humans don't really know what randomness looks like, for example. So you can sometimes look at, you know, the second digit after the decimal points to see if the, uh, if the frequency of different digits in that position is what you would expect generated by random chance alone, that kind of thing. Do you do anything like that or if not, how do you actually prove that something was fraudulent?
Other than I mean, it could fail to replicate, but that's not proof of fraud. Right. So. Right. Right. In fact, he could successfully replicate even though it was fraudulent. Yes. So in the work that I did, I I did something that's similar to what Fisher did when analyzing Mendel's expense, which I am guessing a lot of your listeners may be familiar with, that that eBay recapped anyway. So.
So Fisher, who's who's one of one of the founding fathers of statistics, frequenter statistics for P values and so on, noticed a pattern in Mendels genetic studies that was troubling and that it was that Mendels Mendel's predictions were coming true too well in the sample. So you can say even even if Mindel were exactly right about the proportions of each trait that should be observed, the sample should have random error and they should deviate from those predictions. And they were systematically too similar to the predictions and to the point where Fisher computed, OK, let's imagine the theory was right.
How likely would you be to get this good evidence or better? It's like the opposite of the P value, which asked how likely is that? If your theory is wrong, the story is to be this far from the theory. Officials are asking how long between this close to the theory.
And he concluded that Mendels were impossibly similar to the predictions. And there's been debate that at least four or five years ago. Still those the paper and statistics where they're debating if it was right fraud or if it was selective reporting in this case. So some of the analysis I did were of that flavor where I found that the the samples in this study just didn't have enough variation and another way to detect fraudulent. So that's data not fulfilling expectations, mathematical properties.
But they can also deviate in terms of more conceptual or psychological properties, so if you know a domain, you know that data behave in certain ways. So, for example, if you ask people how much are willing to pay for things, they tend to which people in my field do a lot as a way to to capture how much people value things, how much you like them, or how they liking or interest changes with different Ashok's.
If you ask how much I pay for things, they tend to answer in multiples of five or 10. And and if you were to fake data and you are aware of that, you may fail. People valued t shirt and they said it would pay seventeen dollars, 18 dollars, 12 hours and not have that very smart tendency. And so one of his one of these cases that I collected valuation data from 20 different studies and all of them had very, very, very pronounced spikes at the multiples of five.
And and he said there were zero. And then I'd run a replication of his study. And not only did I not get the effect he got, but I did get spikes at multiples of five.
Right. And so so that was additional evidence that the data were not. I think this is a euphemism. But people that people use, whatever the data were not collected as described in the article.
That is. Wow, that is a very I was going to say that was the smoking gun. But that's also a way to describe it. I have to say, listening to this is making me it's making me feel the way I feel when I watch, like a crime procedural and the detectives come up with all sorts of clever ways to to, you know, finger the culprit, even though he thinks he's covered his tracks. And my reaction is always, God, I'm never going to commit a crime because there's no way that I would know everything I had to do in order to get away with it.
There's always something I'm going to forget or slip up. And as you talk, I'm thinking, man, I'm just never going to make up data because there's no way I get away with it, especially when I was working on this, when it happened for a couple of years.
But when I was a statistician, I would say, oh, I don't know that there's certain that it will be so easy to fake undetectable effectively.
They were going to describe something. And I know they will go on to describe something like, oh, they'll be totally detectable.
So, for example, they would say, oh, you just just generate random data with even an Excel. This is the normal distribution and get around the data and say, yeah, but if you used it, for example, for violations, you would immediately get caught because they don't follow the normal distribution by the pump at five. So I do think it's possible to fake undetectable. I think it's very hard to do it at your first attempt. And if you don't have feedback, right, if you don't have somebody saying, oh, this is how I would get I would get this out, get to.
I think I think it's actually way harder than it seems. So let's talk briefly about solutions. One solution category that has come up before is the idea of pre registering. So getting researchers to state ahead of time, what effect they're looking for and what methods they're going to use to test for that effect. And the the pessimism that I've heard from people and that I kind of feel as well about that solution is just that the incentives aren't really there.
You know, that as long as journals are going to keep publishing non preregistered studies, what benefit is there to researchers to tying their hands out of time when they could, you know, otherwise leave themselves free to kind of data mining and cherry pick until they get something publishable? What do you think?
Right. Yeah. So so one big decision is that registration is all about selective reporting of analysis by hacking, but it won't do anything with fakery. Right. If you're faking a preregister, you'll get it no matter what.
So so my view has evolved on preregistration. I used to be quite skeptical, but and I'm enough of a supporter that I that I quote, created a website for that call, as predicted that. And so when we created that, part of the reason was we so we did see a selfish incentive for preregister in your studies. And it is that once the readers of your work have become more skeptical and justifiably so, so more educated about how selective the reporting matters, they are also looking for signs that you selectively reported.
And so if you if you don't select the report, if you do if you are transparent, then you need a way to signal that you are transparent and the registration becomes a little bit like the organic label in the organic farmer Apple.
So if I want to create an apple, it's harder. It's harder to produce, more expensive and it's more how do I get credit for it? Well, I label it so. So actually, I just completely made my I was just completing a paper where we have a period of study study and we have collected three variables through alternative measures. And one of them we thought, I actually disagree with my call. I thought it wouldn't work and they thought it would.
And so if you if you're aware of what people were in the PIAC, you may choose not to collect that variable to avoid any suspicion. Instead, we preregistered that we are really about two variables. And the third variable was was exploratory and we were not going to include it in our analysis. And so preregistration bought us the freedom to include things in our study that we were planning on reporting the study, but we wanted to use to inform future research.
So I think it signals where you're not to look at reporting and it allows you to collect additional information or decide that you wouldn't normally like. And so we launched, as predicted, that on December 1st. And it's it's it has over 100 authors who contributed a restriction already, which is we're very excited about that.
Oh, that's wonderful. Are you are you actively trying to increase its adoption and in the field so it's growing fast enough that we haven't done any?
I basically we basically had an a blog post about it and tweeted about it. And we there's an aspect of the design that's kind of viral that we didn't build it in, but we got lucky about it, which is all authors have to approve any given preregistration. And so if I'm co-author with somebody and I just read, they all get emails and they find out about it. And so I think it's spreading through that. I don't know the way like Hotmail back in the day spread out, which you would get those emails people saying came from Hotmail.
And so I think it's having that not by design, but by chance.
So or like, you know, playing Candy Crush on Facebook. Would you like to show your friends your wonderful success, Candy Crush, yes or no? Right.
Right.
And instead of evil and in retrospect, once a few people start coming out with that, we were really hard to make it incredibly simple to use and to enforce so that they would be there are other options for people to register their studies, but it's very costly as a reader to to check if the things were done as predicted as preregistered because they can be like a 40 page document with an asterisk. We have a single page document. So the idea is that every readership within a minute be able to compare the study that was published with a study that was preregistered.
So at this point, I'd like to dive into the current replication crisis in psychology. Obviously, the problems with social science have been discussed for years now. But the recent context is that there was a paper published in Science a few months ago titled Estimating the Reproducibility of Psychological Science, in which the authors took 100 papers from the psychology literature and tried to replicate them. And they found that only 40 percent of those papers actually replicated, meaning that in only 40 of the cases were they able to find the same effect that the original study found.
So this is caused much gnashing of teeth and wailing and rending of garments in the social science world over the last few months. But the most recent update to the debate was a commentary published a few weeks ago by a couple of social scientists saying, you know, this actually isn't that bad. There are a bunch of reasons why studies can fail to replicate even if the effect is real. So, you know, this 40 percent figure shouldn't actually be that troubling.
There's no real replication crisis. And I was hoping you could speak to the this particular issue and talk about whether, in fact, you think this 40 percent figure is troubling or not, and then maybe more broadly about the process of replicating in general. How concerning is it when you try to replicate a study and fail to find the same effect? What do we conclude from that?
So starting with the specifics of this, of this of the original paper and the critique. I was recently having with somebody and we come up with a good analogy for it, which is the regional paper said 40 percent of study is replicated. So so that would be important. I tell you that in soccer playoffs, this team won 40 percent of games. You'll be forgiven for assuming that, oh, they must have lost the other 60 percent because you can tie in the playoffs.
But it turns out that in soccer playoff, because there's two two games you can tie in a given game. And so then if I tell you, well, they won 40 percent, they tied 30 percent and they lost 30 percent. You'll be surprised by that. And even there was no ill intent on my part. I just didn't realize you didn't know what soccer rules.
So I think the original paper said 40 percent of studies replicated. They didn't say that 60 percent fail to replicate, but a lot of people interpret it that way. And that's not right. And that's not a good read. Doesn't justify read of the evidence. Just get rid of the unnecessary and replicate it. 30 percent didn't. And 30 percent, we really can't tell one way or the other. And so I think that that's part of that's part of the discussion.
Some people have made a Bayesian haven't used a baseline approach to think of the problem. I've used a different approach. It doesn't really matter as long as you're willing to accept that sometimes the study is inconclusive as opposed to supporting or not supporting a conclusion.
You will you you will conclude, because so many of the studies had small samples that they were just inconclusive. So if we take it the evidence at that 40 percent of success, 30 percent failure, 30 percent unknown. That seems to me that that's very better than I would have expected. So I finally pessimistic about this. I am usually a semester here, but I think I don't know.
I think if you people who are in the business of trying to to improve the how science gets reported, I think. It seems a part of the reason is it's really hard and that's the that's part of the critique is it's hard to really replicate a study in one of the things how to think social sciences, that factor. So context affected by context and and also by measurement. So you can have you can have a study and then you run it very similarly, but you run across a problem that was just not present the original.
So, for example, you can get a floor effect, which is used in psychology a lot and a little bit in economics where all the responses are so low that you just can't get any lower than that. And so you fail to replicate it, right. You just can't detect anything.
And because it's social science as opposed to hard science, hard science is the flaw. It's going to be dependent on the sample. So maybe maybe if you ask Americans today, the floor is in one place. And if you ask Swedish people two years ago, the floor is a completely different place. And so that doesn't really falsify the psychological hypothesis. It just means you have to adjust your your measures, your sample size or who your sample is.
And so even if it were true, just just these factors and and this is the biggest debate, right?
Whenever somebody publishes a failure to replicate psychology, the original authors typically will say, well, it's because there's this big factor of change, which in my sample, in your sample, and that's easy to get into a unfalsifiable explanations.
But on the other hand, we don't know when that's true, but we know that that's true sometimes. And so 30 percent failure is that it's not terrible because of that reason, because I wouldn't say that's not my estimate of how many of the offices were wrong. Hmm, it's maybe an upper bound, although I can imagine other factors that that should make us more pessimistic about that 30 percent figure instead of more optimistic, like tell me if I'm wrong here.
But it seems to me that there might be just a regression to the mean effect where if you if there's random variation in how strong the effect seems when you do a study and you end up publishing when something when when the effect is unusually or abnormally strong, then the next time you look for that effect, its chances are it's going to seem weaker because you published one strong. Yeah, and people do raise their response, which is to say it's true that things change in social science, but why do they always change for the worse?
Why can it be that two years ago in Sweden, it was really hard to get the effect and today in the US, it's very easy to get the effect. Yeah, and that's fair, but that's that's fair. But. But even if that were the case out of that 30 percent, right, that that so when when you get stronger, you disagree before and you continue to detect it with, OK, hit his head. Maybe I'm thinking as they speak, when the original author gets unlucky because of publication bias, we don't make a record of it.
So so we're not compensating hypotheses that we thought were false and then we realized were true. So to some extent, it's still only working. Against you, right, in the sense that it helps to replicate the country's reputation among those failures, some of them must be explained by what he got when he got weaker. Now, I'm not I'm not trying to underestimate how I think it's a serious problem. And I think the solution is disclosing how to run preregistration or replication.
And so I'm completely on board with all of those things. But but we don't have we don't have to convince people that only 40 percent of studies, Rivergate, to make a case, even if only even if even if 80 percent of them replicate. We want to know which ones are more likely to replicate.
And the best way to do it is to report studies with at least selectivity of reporting as possible. Another bias that occurs to me that that might exist here is a kind of status quo bias in which I think it was actually Andy Gelman who made this point that I might be stealing here. He says, well, you know, if we have a study that gets published, it shows an effect and then we try to replicate it. And the result is inconclusive.
We still sort of assume, well, it's probably true because we didn't disprove it. But if we imagine the order of those studies being reversed, where the the replication, so to speak, was actually the first study and it found no effect. And then we did a second study that found an effect. Wouldn't we be differently anchored? Like, wouldn't we be our default assumption would not be that this effect is there. Right. We just think, yeah, I think those studies.
So I like to point and I like the way of building the argument and I think it's true. And in particular, I think if somebody somebody has a study and he fails to replicate, I don't think it should be enough to say, oh, there's three other things changed. I think I think unless they're over, unless unless it is just blatantly obvious that those things really matter.
And of course, it was incompetently performed it unless you can really have that strong argument, you should go out and test it as a person who wants to continue believing in the effect, you should go out and show that that moderator, that other variable, that change really is important. So I agree with that. But there is a way there's a way to to to turn the argument on its head and argue the other way, which is if the failure to replicate had come first.
Right. Almost surely the author who run it, whatever full up study, so when I ran a study and he fails, I don't abandon the project, I look at it and look at the data and I see the Americans correctly. Maybe I should design a stronger manipulation, maybe a larger sample and so on. But the replicator doesn't do that. When the replicator fails to obtain a result. That's where the project ends for the most part. And so if we are going to treat replication as if they had come first, we should look closer at things like do you replicate, for example, the range of values of the dependent variable?
This is the way it should be.
Any sort of quality check that you do apply to data is that quality is a quality check similar to the application as it is it is in the original or forget the original. Is the quality check sufficiently high for us to we trust in it? So. So I think the argument it's a good one and 80 percent argues against the original authors being so, so defensive about replication and dismissing them so quickly. But part of it also argues for original authors to try hard to get effects and not only the bad sense of the word, and this is a really understanding what's happening.
And and replicators are not intrinsically interested in the subject, typically. And so when they don't get it, they think it's case closed.
Right. Well, I think the if what I want to use my last question for is a general general takeaway for our listeners about how much to trust different levels of evidence and social science. So I you know, for a long time I've been skeptical of single studies. I'm basically I'm for the most part, I'm not going to trust a single study, you know, in isolation without knowing the context and what other studies have investigated the same phenomenon, unless it's a really, really well done study.
And then for a lesser amount, lesser amount of time, I've been skeptical of meta analyses, which, as you noted earlier in the episode, can be influenced by things like like outliers, which may or may not be fraudulent, and they have other problems as well. But then more recently, I would have said, well, you know, maybe we can't trust individual studies, maybe we can't trust a that analysis, but surely we can trust a consensus that's been around for over two decades in which study after study after study and multiple meta analysis shows again and again this phenomenon exists.
But one of the subfields, one of the consensus is in social science that has been prominent in this crisis of faith and psychology has been the idea of ego depletion, which is that your willpower can be sort of used up in a local like like over a short period of time. And so you want to sort of conserve willpower by, you know, not trying to stick to your diet when you also have to stick to a really hard task at work or something like that.
So that was a consensus. And now it's been cast into doubt by attempts to replicate it that have failed. And so I'm wondering if you have any general heuristics for like it should should the take away from the problems of social science that we've been talking about, be that you just have to retain a high level of skepticism about everything? Or are there some kinds of research or some levels of evidence that we can be pretty confident in? So I think.
When, whenever, and it's a high it's a high bar, I guess, to some extent, but whenever a skeptic replicates the effect that it's a good I.
So whenever somebody has all the psychological reasons to not find it, when they find it, I think it's just a small share. But that's something that it's definitely the rational thing to do is to to update that evidence and then that that can be attained. And maybe maybe that's the standard we should aspire to, just most findings to have, which is don't take them too seriously until that that's not really intrinsic interest. That shows just the effect I've heard I adversarial as adversarial research.
I forget the name, but basically to research collaboration. Yes.
Thank you to ABB's one one which believes the effect is real and one believes it isn't collaborate and sort of agree ahead of time on a protocol, a set of of research methods and then report the results that seems promising. Is that is that common? It's rare and I've spoken to some people who've done it and they don't like it as much as people who haven't done it. And I think part of the reason is. Is that and this goes back to the point about how replicators and original authors react differently to when their studies don't work, it is it is hard to understand things.
And so when when when you're running a study and it doesn't come as expected, you immediately see problems that you haven't seen before. And part of it is self-deception and it's bad. But part of it is when you really need to figure something out, that's when you figure it out. And and if you talk to any, I'm sure if you talk to any successful any any creator of a successful idea, if you tell them if they got it right right away, they'll say almost invariably not that they had a lot of failed attempts until this figured it out.
And so I think one however so collaboration is that it assumes that after one study, your beliefs will immediately update and usually the moment the recycled version collected data. Whichever side didn't get excited, they expect that you will see a problem they haven't seen before and that doesn't need to be disingenuous or bad. It is just a natural process of updating our understanding. And so, I don't know, social science doesn't usually deal with urgent matter. So I think it's fine for us to say.
I mean, I don't really understand why the newspapers have to cover studies the moment they come out. I think they wait for years until somebody on an opposing side shows shows the effect to that. It I think nothing nothing bad will happen. Yeah, I mean, I think that's partly a coordination problem, right, that they are trying to beat everybody. Yeah, exactly. Although I think there's also some weird psychological quirk where people are more interested in something that's new, even if, you know, there's all this other news, all this other signs that they haven't heard of before, that isn't new.
It's just been sitting around in textbooks for decades. And it's not quite clear why they should be more excited about the thing that was just discovered, the thing that was discovered ages ago that they'd never heard it before.
And I suspect I mean, if this is a falsifiable prediction, but I suspect that if you won run the experiment, there'll be very little benefit of running of reporting on a recent study versus an old study.
And so I suspect that page views if readership readership ratio. I think if The New York Times tomorrow reported on a very sexy finding from five years ago, I don't think it would have any fewer readers than if it's a new study. In fact, maybe more, because nobody else will be reporting on it and because it will require readers to be so sophisticated that they remember all these different findings and the subtleties of how they're different and they don't. And so, in fact, because I was involved in the debunking of those name studies, I have a Google alert on that.
And periodically, at least once a year, somebody writes a story in a major outlet about them and they are 14 years old. But if you say people choose whom to marry based on their name, your first reaction is not wait. Is that a recent finding or is that right?
That's true. That's interesting. I just taken this as a given that people want to read news stuff, but maybe this is an assumption on the part of the media that isn't fully warranted. I'll have to think about that.
Yeah, I mean, it'll be interesting to test. I suspect that. Isn't there? I know I can quite hard to tell. But a newspaper who ran the same story, the same editorial cartoon many, many times and I wrote to see if people would notice. I know about that. Well, that's a little different. That's about people actually reading it. But I suspect actually that I'm very curious and it'll be worth testing it.
Well, we are actually quite every time I give into temptation to continue the conversation, but I'm going to force myself to wrap things up now and we'll move on to the rationally speaking. Welcome back. Every episode of rationally speaking, we invite our guest to introduce the pick of the episode that's a book or website or movie that has influenced their thinking in some way. What's your pick for today's episode?
So my pick is as the it's by Paul M. who was a psychologist at Minnesota, and he gave his last seminar in nineteen eighty nine. And somebody videotaped it and they put out all the videotapes, all of the recordings online for download. And it's about it, it's, it's the this summer in the philosophy of science. And what makes it really interesting is that he may put psychology within the bigger picture of science in a way that I don't think anybody is doing anymore.
So he put all of our approach to understanding psychological phenomena from the perspective, understanding scientific phenomena more generally. And to find it, your listeners can go to the University of Minnesota and search for Paul Neil, which is spelled m e l. And I've also made a quick URL with the same files and reformative that just want to listen to it. And the URL is tiny url dot com slash salmonsen. Pick my last name pick.
Excellent. Well Urie, thank you so much for joining us. This was a fascinating discussion and I'll, I'll link to both data Clodagh and also your pick on the Russian speaking website.
Great. Thanks a lot. Appreciate it.
This concludes another episode of Rationally Speaking. Join us next time for more explorations on the borderlands between reason and nonsense. The rationally speaking podcast is presented by New York City skeptics for program notes, links, and to get involved in an online conversation about this and other episodes, please visit rationally speaking podcast Dog. This podcast is produced by Benny Pollack and recorded in the heart of Greenwich Village, New York. Our theme, Truth by Todd Rundgren, is used by permission.
Thank you for listening.