What is the allure of LLM AI chatbots in the search for empirical evidence of ordinary meaning? Judge Newsom’s two concurring opinions channel recent scholarship in developing four main selling points. And they advance those points as grounds for endorsing LLM AIs over tools of corpus linguistics.
Our draft article presents our opposing view that corpus tools have the advantage notwithstanding—or even because of—the purported features of LLM AIs. We outline some key points below.
LLM AIs are enormous
The first claim is that LLMs “train on” a “mind-bogglingly enormous” dataset (400-500 billion words in GPT-3.5 turbo)—language “run[ning] the gamut from … Hemingway novels and PhD dissertations to gossip rags and comment threads.” The focus is on the size and the breadth of LLMs. The assertion is that those features assure that the LLMs’ training “data … reflect and capture how individuals use language in their everyday lives.”
Corpus size can be an advantage. But size alone is no guarantee of representativeness. A corpus is representative only if it “permits accurate generalizations about the quantitative linguistic patters that are typical” in a given speech community. Representativeness is often “more strongly influenced by the quality of the sample than by its size.” And we have no basis for concluding that an LLM like ChatGPT is representative. At most, OpenAI tells us that ChatGPT uses information that is “publicly available on the internet,” “licensed from third parties,” and provided by “users or human trainers.” This tells us nothing about the real-world language population the creators were targeting or how successfully they represented it. And it certainly doesn’t tell us what sources are drawn upon in answering a given query. In fact, “it’s next to impossible to pinpoint exactly what training data an LLM draws on when answering a particular question.”
Even if LLM AIs were representative, that would not answer the ordinary meaning question in Snell and related cases. Representativeness speaks only to the speech community dimension of ordinary meaning (how “landscaping” is used by the general public). It elides the core empirical question (whether and to what extent the general public uses “landscaping” to encompass non-botanical, functional improvements).
LLM AIs produce human-sounding answers by “mathy” and “sciency” means
The second claim nods at the empirical question. It notes that “researchers powering the AI revolution have created” “sophisticated” “cutting edge” “ways to convert language … into math that computers can understand” through an “attention mechanism” that assigns “an initial vector to each word in a sentence, which is then enriched by information about its position in the sentence.” And it draws on an appeal to the masses—that “anyone who has used them can attest” that LLM AIs produce responses that are “often [so] sensible” that they “border on the creepy.”
This tells us that LLM AIs have a “mathy” and “sciency” feel. It says that a federal judge believes that lots of people find their use of language “sensible.” But it tells us nothing about the reliability of LLM AI responses in reflecting ordinary meaning.
As one of Judge Newsom’s sources concedes, “working with LLMs … requires a leap of faith, a realization that no better explanation is forthcoming than long inscrutable matrices that produce predictions” —matrices “whose meaning is wickedly hard to decipher, and whose organization is alien.”
The social science world has an answer to this problem—in an established academic discipline (corpus linguistics). This discipline follows peer-reviewed, transparent methods for developing falsifiable evidence of ordinary meaning. Actual math and science are involved. No leap of faith is required.
LLM AIs produce accurate “probabilistic maps”
The third claim ups the ante on the assertion of empirical accuracy. It contends that LLM AIs are “high-octane language-prediction machines capable of probabilistically mapping … how ordinary people use words and phrases in context.”
The basis for this claim is not clearly stated. The Snell opinion presents no empirical evidence or data to support its views—and certainly not any probabilistic maps.
To the extent Snell cites a source for this claim, it seems to be the Engel & McAdams article on Asking GPT for Ordinary Meaning. This piece is framed around an inquiry into whether various objects (buses, bicycles, roller skates, etc.) should be viewed to fall within the ordinary meaning of “no vehicles in the park.” The authors present AI results in the seemingly empirical terms of “probability” but concede that this is only an assessment of “the probability … that GPT would give a positive response” to a given query, and not a direct measure of how the word “vehicle” would ordinarily be understood in a given context.”
With this in mind, the authors pivot to a quest for an “accepted benchmark” for “assess[ing] the performance” of the AI in judging the ordinary meaning of “vehicles.” They settle on the results of Kevin Tobia’s human subject surveys on whether the studied objects fall within the ordinary meaning of “vehicle.” And they workshop a series of GPT queries in search of one that might align with the results of these surveys.
Survey results are not “ground truth” on the ordinary meaning of “vehicle,” for reasons we have noted. But even assuming that they were, Asking GPT for Ordinary Meaning didn’t develop a generalizable query framework that consistently produced results that align with surveys. It studied only “vehicles in the park.” And even there, the best the authors could do—after a long trial-and-error series of attempts—was to formulate a prompt that produced a low correlation with survey results on some of the studied objects (wheelchair, skateboard, and toy car) and a “high correlation” with results on others (like truck, car, bus, moped).
This shows that researchers can tinker with a query until they get results that fit part of one dataset on one research question. It doesn’t tell us that the chosen query formulation (a “belief” prompt + a Likert query) generally produces evidence of ground truth about empirical linguistic facts. The authors offer no basis for believing that this query will fit another dataset or apply to another question.
LLM AIs are accessible, simple, and seductive
The final pitch is that LLM AIs are widely available, inexpensive, and easy to use. “[T]oday’s LLMs,” on this view, are “[s]o convenient” and their “outputs” are “so seductive” “that it would be … surprising if judges were not using them to resolve questions” of interpretation.
All of this may be true. But inevitability is not the same thing as advisability. And for all the reasons we have been discussing, a judge who takes empirical textualism seriously should hesitate before jumping on the AI train.
We are all in favor of accessible, simple tools. Tomorrow we will contemplate a future in which the accessibility and ease of AI may be leveraged in a way that aligns with the empirical foundations of corpus linguistics. For now, we note that corpus tools themselves are open and accessible. Judges are already endorsing them. And they should continue to do so if they are serious about the empirical premises of the ordinary meaning inquiry.
The post Corpus Linguistics v. LLM AIs appeared first on Reason.com.