On your mark, AI is set. Go?
Some inconclusive musings about using generative AI to assess student writing
When it comes to using AI to grade student essays, I’m of two minds. Maybe three. It’s complicated.
On the one hand…no. Obviously, no! Why do we write? We write to express our thoughts to other humans. Just on first principles, and leaving aside (for the moment) any question of efficacy, we should resist any effort that degrades that fundamental purpose by outsourcing the “reading” and assessing of student writing to a mindless algorithmic process. Veteran classroom teacher Peter Greene argues that this is all just taking us one step further down the road to automated education hell:
What happens to a student's writing process when they know that their "audience" is computer software? What does it mean when we undo the fundamental function of writing, which is to communicate our thoughts and feelings to other human beings? If your piece of writing is not going to have a human audience, what's the point? Practice? No, because if you practice stringing words together for a computer, you aren't practicing writing, you're practicing some other kind of performative nonsense.
I’m nodding, I’m nodding. But on the other hand…maybe? Because here’s the thing: Many teachers loathe grading essays. Or perhaps more accurately, many teachers want to provide meaningful feedback to their students on their writing, but doing so results in a huge time suck that often takes place “after hours,” meaning, outside the formal time of schooling. I’m always interested to find ways to make teachers’ lives easier, so despite the very real risk of fostering performative nonsense…when it comes to essay grading, I’m at least curious about whether AI might help shoulder the burden that teachers currently bear.
And lo, we have some new research on that front, nicely summarized in this story by Jill Barshay of the Hechinger Report. The overall headline is that when assessing a swath of high school essays, ChatGPT performs reasonably well compared to “expert” human essay graders (I’ll explain the scare quotes momentarily). According to Tamara Tate, the researcher at University California-Irvine who led the study, ChatGPT is “roughly speaking, probably as good as an average busy teacher [and] certainly as good as an overburdened below-average teacher.”
Now about those scare quotes. The “expert” human graders in this study were teachers and PhDs who were given a whopping three hours of training in how to assess essays. In email correspondence with Barshay this week, she shared that one of her readers suggested this bestowed Caitlin Clark-level ability to evaluate student writing. That seems a stretch, to say the least, but it does make me wonder what the training entailed, and whether it’s something all teachers would benefit from undertaking.
Another finding of note: This study compared AI to human graders on essays written by English-language learners and native English speakers. In keeping with the theme of this essay, here too the results were mixed. Overall, humans and ChatGPT were more likely to agree on what to score an essay written by native English students compared to ELLs, but within specific sample sets of essays, the AI grading was more likely to align to human assessment for ELLs than for native English students. Thus, “we conclude that there is no specific harm to English learners in using AI scoring and, in fact, it is worth additional research to consider whether AI scoring is actually more reliable for English learners in other corpora and with improved AI models.”
On the additional research point, yes, keep it coming. Fun fact, the transformation architecture that underlies large-language models largely emerged from a Google research team that was working to improve language translation, using a model they trained on English, French, and German (seminal paper here). Language translation, in other words, is what these models were originally designed to do, and I think there are potential use cases here for AI in education, and especially for students learning English (in the American education system). My friend and teacher extraordinaire Michael Pershan has even suggested AI could revolutionize language learning, and he might be right. I’ll be writing more about this soon-ish.
But back to essay grading. Across the pond, English educator Daisy Christodolou and psychometrician Chris Weadon have been promoting a third-way solution to assessing student writing through use of “comparative judgment.” The idea is simple: Rather than evaluate essays using rubrics (the edu-jargon for checklists), why not simply compare two essays and decide which is better? The theory being that this comparative, relative approach can be as reliable – maybe even more reliable – than having humans grade against an absolute standard, and can be done much faster. And perhaps we could use generative AI to make the comparative judgements for us?
Well, Daisy and Chris spent months testing that very proposition, and I’m sad to report the results were, in their words, “underwhelming.” They found “poor agreement between human judging and GPT marks” and “poor predictive validity for GPT marks,” so that’s a bummer. What’s more, and surprisingly (at least to me), performance degraded with new iterations of ChatGPT – in their experiments, GPT4 actually did worse than GPT3. I spoke with Daisy recently and suffice to say she sounded pretty burned out by the whole experience of trying to make AI do something it is not yet capable of doing well.
Yet, yet. I’m trying to foster cognitive resonance around these parts, but my dissonance here is real. Because I can imagine a future where generative AI based on LLMs is capable of making reliable, predictive judgments about the quality of student essays. They are large-language models, after all, with an emphasis on language. And if it’s really true that today’s models are “laughably bad” compared to what might be coming, well, maybe AI will prove to be genuinely useful in helping to automate some – not all! – of the task of evaluating writing.
Or maybe Greene is right and we’re drifting into a Kafkaesque future of students using bots to write essays evaluated by bots. Somewhere, Jane Rosenzweig shudders. I don’t know, it’s complicated!