
Why is it so hard to find trustworthy studies of the impact of AI?
Let’s start with the scandal that unfolded last week. In November 2024, a PhD student at MIT released a pre-print of a major study titled, “Artificial Intelligence, Scientific Discovery, and Product Innovation.” The headline finding was that, at a major unnamed US manufacturing firm, researchers who used AI “discovered 44% more materials, resulting in a 39% increase in patent filings and a 17% rise in downstream product innovation.” The study gained major attention from important media outlets—including The Wall Street Journal, The Atlantic, and Nature—and was hailed as providing rigorous, empirical evidence that scientists using AI could rapidly improve the discovery of new materials.
A huge win for co-intelligence, if you will.
Just one problem: It was bullshit. Pure academic fraud, it seems. In one of the harshest press releases from a major research university you’ll ever read, MIT stated that it has “contacted arXiv to formally request that the paper be withdrawn…MIT has no confidence in the provenance, reliability or validity of the data and has no confidence in the veracity of the research contained in the paper…The author is no longer at MIT.” They concluded, “We award him no points, and may God have mercy on his soul.” (Just kidding.)1
Meanwhile, as this controversy was breaking, a separate mini-scandal developed on an ed-tech email listserv that I’m on, where an academic pointed to a new meta-analyses of ChatGPT uses in education that was published in Humanities and Social Sciences Communications journal of Nature. The headline findings:
This study aimed to assess the effectiveness of ChatGPT in improving students’ learning performance, learning perception, and higher-order thinking through a meta-analysis of 51 research studies published between November 2022 and February 2025. The results indicate that ChatGPT has a large positive impact on improving learning performance (g = 0.867) and a moderately positive impact on enhancing learning perception (g = 0.456) and fostering higher-order thinking (g = 0.457).
These are stunning results for an education intervention—trust me when I say an effect size of .867 on student learning is massive. So much so that George Siemens, who writes a newsletter titled “Sensemaking, AI and Learning (SAIL),” immediately argued the findings are “so substantive…that universities need to figure out what cognitive escalation looks like in the future and how metacognition will be taught and emerging literacies, outside of AI literacies, will be addressed.”
Perhaps you can guess where this is going. Shortly after this meta-analysis was shared, Steve Ritter, the founder and chief scientist at Carnegie Learning, mercifully did a spot check of the underlying studies and found, um, issues. With his permission, I now share his email to the listserv in full (my emphasis added):
I’d like to believe this result, and I was curious about what a meta-analysis of GPT usage would look like, given that there are so many possible usage models. The reported effect sizes are really large.
The first thing I noticed in the paper (Table 3) is that most studies have Ns that look suspiciously like “one class for the experimental condition and one for the control.”
So I picked one of the included papers to examine – Karaman & Goksu (2024) – sorta at random. The meta-analysis classifies Karaman & Goksu as using secondary students (one reason I picked it) and using GPT as an “intelligent learning tool” for “tutoring.”
None of this is correct.
The paper looked at the impact of using GPT to generate lesson plans that were used to teach third-graders (not secondary students). It compared a section taught by one of the authors using these GPT-generated lesson plans to a randomly-selected class taught by a different teacher. They used students as the unit of analysis and found no significant difference. The meta-analysis reports an effect size of 0.07, but I’m not sure where they got that (they gave pre- and posttests, but the main analysis just looked at posttest scores).
I guess this paper technically meets the inclusion criteria for the meta-analysis, but that just shows that their inclusion criteria need to be revisited. Given that one paper I looked at is so clearly misclassified, I’m not sure I trust anything in the meta-analysis.
Another researcher on the listserv then chimed in with the suggestion that “people glance through a few of the other studies that were included here. It's a perfect example of garbage in garbage out.”
So much for cognitive escalation.
This keeps happening. Back in January, an economist at the World Bank wrote a blog post regarding a pilot study of afterschool tutoring in Nigeria involving AI that led to “striking” learning improvements, “the equivalent to nearly two years of typical learning in just six weeks.” Again, these sorts of gains should on their face raise major red flags, but what’s worse is that the study involved human teachers acting “as ‘orchestra conductors’…who guided students in using the LLM, mentoring them, and offering guidance and additional prompting.”
Oh you don’t say! Unsurprisingly, the kids who received this additional “orchestrated support” from teachers performed better on tests than the kids who received nothing at all, but who knows what role AI played. Of course, if we had a formal write-up perhaps we could tease this out, but despite promises that one would be forthcoming, nothing appears to have been published.2
Unfortunately, our story doesn’t stop here.
One study that has garnered a great deal of attention in the AI-in-education space was published on arXiv by researchers at Stanford in October 2024 titled, “Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise.” This effort involved a randomized control trial where some human tutors were given access to AI to support them while tutoring high school students in math (via text-based chatting), whereas other tutors were not. More specifically:
This study is the first randomized controlled trial of a Human-AI system in live tutoring, involving 900 tutors and 1,800 K-12 students from historically under-served communities. Following a preregistered analysis plan, we find that students working with tutors who have access to Tutor CoPilot are 4 percentage points (p.p.) more likely to master topics (p<0.01). Notably, students of lower-rated tutors experienced the greatest benefit, improving mastery by 9 p.p. We find that Tutor CoPilot costs only $20 per-tutor annually…Altogether, our study of Tutor CoPilot demonstrates how Human-AI systems can scale expertise in real-world domains, bridge gaps in skills and create a future where high-quality education is accessible to all students.
This study received a lot of attention, with write-ups in MIT Technology Review, Education Week, the 74, K-12 Dive, and MarkTechPost, among others. I was even quoted in The 74 story praising the study design, and observing that if its findings held up, it offered promise for future tutoring efforts.
But after closer review, aided by other education researchers, I no longer see that promise.
Before continuing, I want to be crystal clear that there is zero evidence, and I am not suggesting, that any of the researchers involved in this study have committed any form of academic fraud along the lines committed by the PhD student at MIT. Nor am I contending that they undertook a shoddy empirical investigation a la the ChatGPT meta-analysis or World Bank pilot.
However. What I am prepared to say is that the authors of this study presented their findings in a way that deliberately plays up the positive impact of AI, while minimizing the evidence that calls into question the value of using AI to enhance tutoring. And I find this troubling. To understand why requires diving a bit into the research weeds, so please bear with me.
Let’s start with the core finding outlined above—that students who worked with tutors that had access to AI-enhanced tutoring (aka the Tutor Co-Pilot) were 4% more likely to master topics overall, and that students with lower-rated human tutors improved mastery by 9%. These are modest but non-trivial gains that were calculated through the use of “exit tickets,” essentially mini-tests given at the end of tutoring sessions.
When this study first crossed my desk back in October, I found it slightly puzzling that this was the outcome measure the researchers used, given that they also collected data on how students did on their end-of-year summative tests called MAP—more on this momentarily. Nonetheless, I assumed they had reasons for making that research design choice, and hey, it’s their study, they get to choose their measure.
But that is not exactly what happened. As is good practice, the researchers here pre-registered the primary outcome measures they intended to use, which included a healthy dose of survey measures of tutors and students, analysis of language use within the tutoring sessions, and—ahem—the students’ spring MAP scores.
So how did that go? Although this was a math tutoring intervention, students who received AI-enhanced tutoring did worse on the end-of-year MAP math test, though the results were not statistically significant. (Oddly enough, students did slightly better on the MAP reading test, but here too the results were not statistically significant.)
In other words, there was no measurable impact of AI-enhanced tutoring on the primary pre-registered outcome measure of MAP test results. Yet this seemingly important finding is reported on pages 31 and 32 of the 33-page study. In an appendix.
Now, to be completely fair, in their pre-registration the researchers did include a very long of additional data they planned to collect beyond their primary outcome measures:
Blink and you might miss it, but exit tickets are in there—yet plainly they are not what the study was designed around. In my view, to later feature exit-ticket data because it provides modest evidence of a positive AI effect contravenes the spirit if not the letter of the pre-registration process, which after all is to prevent data hacking to produce novel results.
As a sanity check, I asked two education researchers to independently review this study to see if my concerns were warranted. Here is what the first said (my emphasis added):
If I was reviewing this for a journal, I would say revise and resubmit, and would not accept for publication unless they centered the MAP score as the outcome of interest. That said, I think it's very publishable! But from a policy / practice perspective, I probably wouldn't use Tutor Co-Pilot, given the disappointing MAP results.
My biggest concern has to do with using the exit ticket as the outcome….There are several reasons why the exit tickets are not the appropriate outcome to focus on:
Post-session mini-tests (exit tickets) are not mentioned in the pre-registration plan (Spring MAP is in the pre-registration plan).3
Unless I missed it, the authors provide no evidence of reliability of exit tickets.
Exit tickets are (I think?) a researcher-developed outcome measure. Potentially meaningful differences in effect sizes can be obtained from measures created by study authors, and these measures may not be as informative to policymakers and practitioners as independent measures (from WWC V5 updates summary).
If exit tickets are the outcome, it’s not really an Intent to Treat study (causal estimate of impact) because students don’t have exit tickets if they don’t do the tutoring. You can do a real ITT estimate with the MAP data because that would be available for all students in the study, regardless of whether they engaged in tutoring.
And the second (again with my emphasis):
Is the overall pattern of results encouraging? Not obviously, even if taken at face value. On one side of the ledger, positive and statistically significant results for exit tickets. On the less-good side, insignificant negative end-of-year effects, and insignificant, mixed-direction results for participation and surveys. Everything hinges on what you think the exit tickets are telling us…
What they really need is an explanation for the divergence between the exit ticket results and the [end of year] MAP results and that's probably hand-waved away a little too easily….I think it's actually pretty easy to come up with a story where 1) exit tickets are valid measures of math skill in general but 2) the AI component helps students complete exit tickets without learning anything (making the exit tickets a less valid measure of learning).
In fact, their lit review notes some possibilities here (e.g., “[Large-Language Models] often generate bad pedagogical responses, such as giving away answers".)
There are other yellow-to-red flags too. For one thing, the researchers claim their study involved approximately 1,800 students—but that refers to the number of kids who were offered tutoring, not the number who received it. As it turns out, there was a 38% attrition rate over the seven-week intervention, and many students only participated in one or two tutoring sessions. The question of why student participation rates were so low is barely mentioned. Perhaps kids, being humans, do not like to learn by typing back and forth on the computer?
For another, the study reports the cost of delivering AI tutoring as only $20 per student, but that appears to exclude the extensive training that was provided to the human tutors to prepare them to use AI. As one of my anonymous reviewers noted, “the costs should include the tutor training (and the trainer’s time, and any materials or facilities needed for training).” Perhaps not coincidentally, not long after this report was published, the company that provided the AI-enhanced tutoring went out of business.
Ok, so. Is all this just academic sniping? I think not, and here’s why:
Just two weeks ago, I was invited to a conference at Stanford regarding the future of tutoring, one that involved some of the researchers involved in this very study. During one session, I listened to a program officer from the Overdeck Family Foundation, a major funder of education research and other efforts related to tutoring, make clear that as federal funding for tutoring dries up, his employer will soon be pushing for tutoring companies to become “more cost efficient” using technology (read: AI). The Overdecks get to choose how to spend their money, of course, but what they do not get to do is claim that there is robust empirical evidence suggesting that AI-enhanced human tutoring is effective. There isn’t.
Research is a public good, and one that’s vital to producing knowledge we can use to inform policy and practice. With Silicon Valley in full AI hype mode, we need independent studies of AI in education that we can trust to fairly evaluate this technology’s impact, positive or negative.
We’re not getting that right now and it’s a problem.
Addendum: I highly recommend this essay by Nick McGrievy, who recently received his PhD in physics from Princeton, on the limits of AI science. As he aptly notes, “be cautious about taking AI research at face value. Most scientists aren’t trying to mislead anyone, but because they face strong incentives to present favorable results, there’s still a risk that you’ll be misled. Moving forward, I would have to be more skeptical, even (or perhaps especially) of high-impact papers with impressive results.”
Addendum #2: After publishing this essay, Mike Kentz pointed me to this impressively comprehensive investigation into AI-in-education studies conducted by Wess Trabelsi, who reports being psychologically scarred by the “incompetence, lack of integrity, and downright confirmed fraud” in this space. It’s bad!
For more background on why this study should have raised red flags from the outset, check out this essay from the appropriated named BS Detector. Paul Bruno also shared a thoughtful thread exploring the challenges of sniffing out academic fraud via peer review.
Did these obvious methodological flaws stop Ethan Mollick from excitedly sharing the results with his (check notes) 275,000 followers on LinkedIn? Of course not. And it’s far from the first time Mollick has promoted questionable AI research, as I’ve covered previously.
This statement is correct as to primary outcome measures, but as noted exit tickets were technically preregistered.
Hey Ben you might appreciate my friend @Wess Trebelsi’s research into this topic. He’s got some great reviews and resources in here:
https://open.substack.com/pub/wesstrabelsi/p/the-good-the-bad-and-the-ugly-science?r=elugn&utm_medium=ios
Now it strikes me, how many times it is possible that I took a study about AI, and said - "this is a scientific study by researchers from MIT, it must be correct". Thanks for this article!