Wonderfully written and nourishing as always — particularly enjoyable somehow when I happen to be doing emails and I can read while the ink is still wet, justifying the distraction as one step closer to the near-mythical inbox zero.
However.
"here in the real world, humans can’t memorize all the modern-day classics, or even one classic, for that matter. We remember themes and ideas from books, not the order of each and every word. This is just fundamentally different from how LLMs operate"
Surely not so! Do you contest that having read every Murakami back to back over a year that the style of my own writing might not fundamentally change? By choice, perhaps, but certainly not by producing the words by rote. I might change my punctuation usage and mood; employ passive voice and increasingly abstract oddities in my turns of phrase. How is this different? The signal that drives these changes comes from assimilating Murakami's semantic maps through his writing in a very similar fashion to how language models do, actually. I can't replicate them verbatim like is often the argument in the 'reproduction half' of this IP debate, our brains are so much more than just 'the semantic assimilation function' that these LLMs represent. No? What am I missing here?
(I'm not drawing a line in the sand with the IP infringement btw, that feels deeply thorny and I can see both sides. Unlike breaking the book bindings and burning the content. That feels deeply revulsive with little to nothing on the other side of the coin.)
Thanks for the kind words, Dominic, but I'm afraid I must disagree that we disagree!
My claim here is simple---neither you nor any other human memorizes word-for-word the texts of the books we read. You've read every Murakami novel over the past year (fun year), but I guarantee you that if I ask you to reproduce the text of any of them, you'd fail miserably. This would be true even if I gave you the entire first chapter of any of his books; if I asked you to predict what words should follow, you couldn't do it. Even Murakami himself can't do it! What you've picked up instead is, well, what you describe---certain writing methods and grammatical flourishes that you associate with his style. That's a "theme" under my broad use of the word.
In contrast, large-language models are trained to look at text and predict the text that should follow. If I were to build a Large-Murkami Model based on his oeuvre, then feed it, say, the first chapter of the Wind-Up Bird Chronicles, and then ask my LML to continue the text....it likely would produce exactly what Murakami has written, word-for-word, for the next sentence, the next paragraph, maybe even whole chapters before eventually going off the rails. The New York Times lawsuit against OpenAI cited numerous examples of taking the first paragraphy of NYT news stories and plugging them into ChatGPT and getting the rest of the story regurgitated.
Of course, in part because of these lawsuits, these models are now adjusted ("fine tuned") to stop this from happening. But the fact that this can happen at all illustrates, to me at least, a fundamental difference between how humans read and interpret texts and how generative AI models are trained.
The irony here is not lost on me. LLMs were trained on some of the best human writing (and some of the worst, haha), and now it seems that their language operation mimics how humans write. But it doesn't. We fed LLMs a vast number of *products* of human writing — novels, poems, stories, etc. They mimic linguistic patterns from that data and mindlessly churn out new text to complete a set of instructions (a prompt). But they cannot reproduce all the other faculties of human writing, which made those texts possible in the first place. We write with much less "data" than an LLM. Our brains do very different cognitive work when we learn to use language and then also learn to think, to reason, to make inferences, to imagine hypotheticals, and so on. In other words, I want to focus on the process, on the value of learning to write and think, not just the end product.
Thanks for linking that other essay. I love this phrasing: "Yes, they can remix and recycle our knowledge in interesting ways. But that’s all they can do. They are trapped in the vocabularies we’ve encoded in our data and trained them upon. Large-language models are dead-metaphor machines." The Italian scholar Matteo Pasquinelli writes about this same point in his amazing article, "How a Machine Learns and Fails--A Grammar of Errors for Artificial Intelligence." In the section he calls "The Unprediction of the New," Pasquinelli notes, "Machine learning will never be able to detect or generate Rimbaud's famous line 'I is another' after running a statistical analysis of a million newspapers. Machine learning never invents codes or worlds, rather draws vector spaces that reproduce statistical frequencies of old data. In computational statistics, a new metaphor is a new vector with no similarities of frequency with old vectors--something that would easily disappear in the following computational passage" (p. 16). The passage continues, but I don't want to make this reply too long. The point is that something truly new, something that has not been computed and categorized by a machine from its training data, cannot be generated. So, it's more accurate to say that LLMs generate unique outputs, rather than new ones. https://mediarep.org/server/api/core/bitstreams/2cea80ce-1538-4792-8fbf-97ff4e2c0af0/content
“If the premise were true, perhaps Alsup’s conclusion would follow—but here in the real world, humans can’t memorize all the modern-day classics, or even one classic, for that matter. We remember themes and ideas from books, not the order of each and every word. This is just fundamentally different from how LLMs operate, and I hope lawyers that represent authors and other creatives hammer this point home in future litigation. “
Herman Goldstine once wrote of the mathematician and scientist John Von Neumann, “One of his remarkable abilities was his power of absolute recall. As far as I could tell, von Neumann was able on once reading a book or article to quote it back verbatim; moreover, he could do it years later without hesitation. He could also translate it at no diminution in speed from its original language into English. On one occasion I tested his ability by asking him to tell me how A Tale of Two Cities started. Whereupon, without any pause, he immediately began to recite the first chapter and continued until asked to stop after about ten or fifteen minutes.”
If the reason that LLMs are violating fair use laws is that they remember everything perfectly, does that mean that John Von Neumann (or anyone else with an eidetic memory) should not be allowed to read books and use that information to get better at things, like an LLM might?
Well, to answer your question, if Von Neumann or anyone else were to *re-create* the entirety of the works they've allegedly memorized, then yes, they would be violating copywrite law.
As for his incredible genius, stories abound, but hard data is lacking. There's a famous math puzzle involving Von Neumann that goes like this (I'm quoting from https://gcher.com/posts/2024-04-10-von-neumann-fly/):
Two bicycles are traveling toward each other at the same speed until they collide; Meanwhile a fly is traveling back and forth between them, also at a constant speed. The bicycles start 20 miles apart, and travel at 10 miles per hour, and the fly travels at 15 miles per hour. How far does the fly travel in total?
Answer: The trick is to notice that the bicycles will collide after exactly one hour, and since the fly travels at constant speed of 15 miles per hour it will have traveled exactly 15 miles.
According to the story, when John von Neumann was told of this puzzle by Max Born, he immediately gave the correct answer. Max Born then remarked than most other people he told the riddle didn’t see the trick and instead tried to sum all the segments of the fly path. Von Neumann then replied that he also didn’t see the trick and in fact did compute the infinite sum.
***
People cite this as evidence of VN's genius but we might note how inefficient his approach was to solving the problem (if it's really true he calculated the infinite sum). The capacity to generalize and abstract is a powerful thing.
Okay, but it’s not as if LLMs are constantly perfectly recreating the entirety of works they’ve consumed. It’s fair to say that that is a violation of copyright law, but it’s not what they’re doing. While you could argue that it’s possible that an LLM could do that and is therefore bad, John Von Neumann could do the same, so you would have to conclude that he was bad for the same reason.
That story is pretty funny, but I don’t really understand the relevance of you telling it.
Wonderfully written and nourishing as always — particularly enjoyable somehow when I happen to be doing emails and I can read while the ink is still wet, justifying the distraction as one step closer to the near-mythical inbox zero.
However.
"here in the real world, humans can’t memorize all the modern-day classics, or even one classic, for that matter. We remember themes and ideas from books, not the order of each and every word. This is just fundamentally different from how LLMs operate"
Surely not so! Do you contest that having read every Murakami back to back over a year that the style of my own writing might not fundamentally change? By choice, perhaps, but certainly not by producing the words by rote. I might change my punctuation usage and mood; employ passive voice and increasingly abstract oddities in my turns of phrase. How is this different? The signal that drives these changes comes from assimilating Murakami's semantic maps through his writing in a very similar fashion to how language models do, actually. I can't replicate them verbatim like is often the argument in the 'reproduction half' of this IP debate, our brains are so much more than just 'the semantic assimilation function' that these LLMs represent. No? What am I missing here?
(I'm not drawing a line in the sand with the IP infringement btw, that feels deeply thorny and I can see both sides. Unlike breaking the book bindings and burning the content. That feels deeply revulsive with little to nothing on the other side of the coin.)
Thanks for the kind words, Dominic, but I'm afraid I must disagree that we disagree!
My claim here is simple---neither you nor any other human memorizes word-for-word the texts of the books we read. You've read every Murakami novel over the past year (fun year), but I guarantee you that if I ask you to reproduce the text of any of them, you'd fail miserably. This would be true even if I gave you the entire first chapter of any of his books; if I asked you to predict what words should follow, you couldn't do it. Even Murakami himself can't do it! What you've picked up instead is, well, what you describe---certain writing methods and grammatical flourishes that you associate with his style. That's a "theme" under my broad use of the word.
In contrast, large-language models are trained to look at text and predict the text that should follow. If I were to build a Large-Murkami Model based on his oeuvre, then feed it, say, the first chapter of the Wind-Up Bird Chronicles, and then ask my LML to continue the text....it likely would produce exactly what Murakami has written, word-for-word, for the next sentence, the next paragraph, maybe even whole chapters before eventually going off the rails. The New York Times lawsuit against OpenAI cited numerous examples of taking the first paragraphy of NYT news stories and plugging them into ChatGPT and getting the rest of the story regurgitated.
Of course, in part because of these lawsuits, these models are now adjusted ("fine tuned") to stop this from happening. But the fact that this can happen at all illustrates, to me at least, a fundamental difference between how humans read and interpret texts and how generative AI models are trained.
There’s a Pierre Menard joke in there somewhere—Claude Opus, author of the Quixote?
Dammit, I wish I'd thought of that. How did Borges predict everything that's come to pass?!?
The idea of the totality of his work as a kind of template for what will happen is itself like a Borges story.
[Insert Always Sunny Conspiracy Meme]
The irony here is not lost on me. LLMs were trained on some of the best human writing (and some of the worst, haha), and now it seems that their language operation mimics how humans write. But it doesn't. We fed LLMs a vast number of *products* of human writing — novels, poems, stories, etc. They mimic linguistic patterns from that data and mindlessly churn out new text to complete a set of instructions (a prompt). But they cannot reproduce all the other faculties of human writing, which made those texts possible in the first place. We write with much less "data" than an LLM. Our brains do very different cognitive work when we learn to use language and then also learn to think, to reason, to make inferences, to imagine hypotheticals, and so on. In other words, I want to focus on the process, on the value of learning to write and think, not just the end product.
Thanks for the thoughtful comment. And irony is one thing I strongly suspect LLMs will always lack! https://buildcognitiveresonance.substack.com/p/there-is-no-artificial-irony
Thanks for linking that other essay. I love this phrasing: "Yes, they can remix and recycle our knowledge in interesting ways. But that’s all they can do. They are trapped in the vocabularies we’ve encoded in our data and trained them upon. Large-language models are dead-metaphor machines." The Italian scholar Matteo Pasquinelli writes about this same point in his amazing article, "How a Machine Learns and Fails--A Grammar of Errors for Artificial Intelligence." In the section he calls "The Unprediction of the New," Pasquinelli notes, "Machine learning will never be able to detect or generate Rimbaud's famous line 'I is another' after running a statistical analysis of a million newspapers. Machine learning never invents codes or worlds, rather draws vector spaces that reproduce statistical frequencies of old data. In computational statistics, a new metaphor is a new vector with no similarities of frequency with old vectors--something that would easily disappear in the following computational passage" (p. 16). The passage continues, but I don't want to make this reply too long. The point is that something truly new, something that has not been computed and categorized by a machine from its training data, cannot be generated. So, it's more accurate to say that LLMs generate unique outputs, rather than new ones. https://mediarep.org/server/api/core/bitstreams/2cea80ce-1538-4792-8fbf-97ff4e2c0af0/content
“If the premise were true, perhaps Alsup’s conclusion would follow—but here in the real world, humans can’t memorize all the modern-day classics, or even one classic, for that matter. We remember themes and ideas from books, not the order of each and every word. This is just fundamentally different from how LLMs operate, and I hope lawyers that represent authors and other creatives hammer this point home in future litigation. “
Herman Goldstine once wrote of the mathematician and scientist John Von Neumann, “One of his remarkable abilities was his power of absolute recall. As far as I could tell, von Neumann was able on once reading a book or article to quote it back verbatim; moreover, he could do it years later without hesitation. He could also translate it at no diminution in speed from its original language into English. On one occasion I tested his ability by asking him to tell me how A Tale of Two Cities started. Whereupon, without any pause, he immediately began to recite the first chapter and continued until asked to stop after about ten or fifteen minutes.”
If the reason that LLMs are violating fair use laws is that they remember everything perfectly, does that mean that John Von Neumann (or anyone else with an eidetic memory) should not be allowed to read books and use that information to get better at things, like an LLM might?
Well, to answer your question, if Von Neumann or anyone else were to *re-create* the entirety of the works they've allegedly memorized, then yes, they would be violating copywrite law.
As for his incredible genius, stories abound, but hard data is lacking. There's a famous math puzzle involving Von Neumann that goes like this (I'm quoting from https://gcher.com/posts/2024-04-10-von-neumann-fly/):
Two bicycles are traveling toward each other at the same speed until they collide; Meanwhile a fly is traveling back and forth between them, also at a constant speed. The bicycles start 20 miles apart, and travel at 10 miles per hour, and the fly travels at 15 miles per hour. How far does the fly travel in total?
Answer: The trick is to notice that the bicycles will collide after exactly one hour, and since the fly travels at constant speed of 15 miles per hour it will have traveled exactly 15 miles.
According to the story, when John von Neumann was told of this puzzle by Max Born, he immediately gave the correct answer. Max Born then remarked than most other people he told the riddle didn’t see the trick and instead tried to sum all the segments of the fly path. Von Neumann then replied that he also didn’t see the trick and in fact did compute the infinite sum.
***
People cite this as evidence of VN's genius but we might note how inefficient his approach was to solving the problem (if it's really true he calculated the infinite sum). The capacity to generalize and abstract is a powerful thing.
Okay, but it’s not as if LLMs are constantly perfectly recreating the entirety of works they’ve consumed. It’s fair to say that that is a violation of copyright law, but it’s not what they’re doing. While you could argue that it’s possible that an LLM could do that and is therefore bad, John Von Neumann could do the same, so you would have to conclude that he was bad for the same reason.
That story is pretty funny, but I don’t really understand the relevance of you telling it.