A corpus linguistic study can produce datapoints on empirical questions of ordinary meaning—like the question in the Snell case on whether the installation of an in-ground trampoline falls within the ordinary meaning of “landscaping.” But LLM AIs do not and cannot do so, for reasons developed in detail in our draft article.
We highlighted some of the reasons for our conclusions in yesterday’s post. But the problem is even bigger than we let on there.
It’s not just that AIs are black boxes. Or even that their bottom-line responses obfuscate any empirical detail on how a term like “landscaping” is ordinarily used. We must also come to terms with the fact that AI responses are easily affected by even slight variations in search terminology and vary a lot even when the identical query is presented multiple times.
Studies have shown that “[t]he tone of a prompt”—even subtleties in “politeness”—may affect the sources an LLM “pull[s] from” in “formatting its reply.” “Polite prompts may direct the system to retrieve information from more courteous … corners of the Internet,” while a “snarky prompt” could “direct the system” to a seedier source like Reddit. And “supportive prompts”—requests to “[t]ake a deep breath and work on this problem step-by-step”—could trigger the AI to refer to online tutoring sources, “which often encourage students to break a problem into parts.” These are just two examples of the potentially myriad ways in which chatbot responses could be influenced by a human user. Our best guess is that we have barely begun to scratch the surface in our understanding of these sources of variability. Regardless of the causes or effects, the unexplained variability in chatbot responses renders them opaque, non-replicable, and fundamentally unempirical.
Chatbots’ sensitivity to human prompts makes them susceptible to problems like confirmation bias. Even the most well-meaning users will get different answers from other users, different answers depending on prompt wording, or different answers to the same prompt at different times. And all of this could have a tendency toward confirming the subconscious biases that users bring to the table.
We highlight variability in chatbot responses in Part I.B.2 of our draft article. We (1) asked ChatGPT the same question Judge Newsom did (on whether “landscaping” includes installation of an in-ground trampoline); (2) got an initial response in line with Newsom’s—”Yes,” it “can”; (3) pushed back by telling the chatbot we thought its response was “incorrect”; and (4) asked the same question again and got a different response—”No,” it “typically does not.” This type of accommodation on ChatGPT’s part may aid it in achieving its primary purpose to “follow the user’s instructions helpfully.” But it forecloses the viability of LLM AIs as tools for empirical textualism—tools that can “reasonably be discerned by determinate, transparent methods” and “constrain judges from crediting their own views on matters of legislative policy.“
AI replies don’t just vary based on the tone or content of the user’s query. The chatbot will also give different answers if you ask it the same question multiple times. We also showed this in Part I.B.2 of our draft. We put Judge Newsom’s ordinary meaning query to ChatGPT 100 times. 68 times it gave variations of a “yes” answer to the question whether installation of an in-ground trampoline “is landscaping.” 18 times it essentially said “no”—”not in the traditional sense.” And 14 times it effectively said that it “could be” landscaping “if certain conditions are met.”
This makes clear that LLM AIs are not producing linguistic data of the sort you can get from a corpus—on the extent to which “landscaping” is used to refer to non-botanical, functional features like in-ground trampolines. The AI is an artificially rational entity, not a source of empirical data. And the variability of its rationalistic views robs it of the capacity to fulfill the premises of empirical textualism—in disciplining judges to check their subjective views against external evidence and constraining them from crediting their own views on matters of legislative policy.
Empirical “facts are stubborn things.” They are not altered by the tone of our questions or “our inclination[]” to disagree with the answers. Unlike empirical data, which is inflexible and replicable, responses from LLM AIs are malleable and unreliable. If and when the same question yields different answers over time, it is clear that the answers are matters of subjective opinion, not empirical fact. Subjectivity is all the more troubling when it comes from a black box computer based on a large but undisclosed database of texts, proprietary algorithms optimized for predicting text (not deciding on ordinary meaning), and fine-tuning by small and unrepresentative samples of human users.
Judge Newsom acknowledged this problem in his DeLeon concurrence. He noted that serial queries on the ordinary meaning of “physically restrained” yielded “variations in structure and phrasing”—a result he first found troubling. On reflection, however, he decided these variations were somehow helpful—as reflective of variations in “everyday speech patterns” of ordinary English speakers, as a proxy for the kind of results we might get from a “broad-based survey” of a wide range of individuals.
None of the AI apologists have identified a basis for trusting an AI’s responses as reflective of a broad-based survey, however. And the idea is incompatible with the very notion of an LLM AI. Each AI is based on one LLM and one set of training data, with a neural network modeled after the human brain. The AI’s responses may vary over time and may be affected by alterations in tone, content, or pushback. But the varied responses are rationalistic (the opinion of the one), not empirical (data from the many).
The post LLM AIs as Tools for Empirical Textualism?: Manipulation, Inconsistency, and Related Problems appeared first on Reason.com.