Despite Their Feats, Large Language Models Still Haven’t Contributed to Linguistics

By Jessie Hobb On Dec 5, 2022

A review of Chomsky’s views on linguistics and LLMs

Before you come at me with pitchforks and torches for such a controversial title, please hear me out. Over the past few years, we’ve seen headlines and examples of what Large Language Models (BERT, GPT-3, LaMDA, etc.) can do — an explosion of capabilities in tasks ranging from sentiment classification, text generation, question answering, and more.

This article is not questioning the engineering strides that Large Language Models (LLMs) have made over the past half decade. Rather, it’s a look into critiques around what LLMs have contributed to the science of linguistics. I’ll mainly be discussing Professor Noam Chomsky’s takes and views on linguistics and, more recently, LLMs; and I’ll be referencing the sources below:

Let me further preface this by saying that this is not meant to trivialize the importance of engineering and that my explanations of Chomsky’s theories & views may not be the best — any faulty arguments may be due to my lackluster interpretations. I highly encourage you to watch his interview (and read his book too)!

For those of you unfamiliar with…
Science vs Engineering
Chomsky’s 3 models
Do newer LLMs actually “understand” language?
Why more compute won’t help with same paradigm
Question of Determinism and Free Will
Where to from here?
What does this mean for us?

Large Language Models (LLMs)

Language models are probabilistic models that attempt to map the probability of a sequence of words (a phrase, sentence, etc.) occurring (i.e. how likely is it for a sentence to occur). They are trained on a collection of texts and derive probability distributions from there. The key differences between LLMs and common language models are that LLMs are trained on MASSIVELY greater amounts of texts with exponentially more compute.

In tasks involving text generation (ex: summarization, question answering, prompt completion), LLMs will employ conditional probability when trying to generate text. In other words, when deciding on the next word to pick, LLMs will look at the previous words in a sequence and based on that, will then select the most likely word that fits in. See the example below:

An example of probabilities of next words based on context (Image by author)

For a nice, high-level explanation, check out this talk from Stanford’s Dr. Christopher Manning!

Professor Noam Chomsky

Chomsky is perhaps the most renowned linguistics professor of the past century. Since the 1950s, he’s transformed the fields of linguistics and cognitive science with his concepts and ideas of universal grammar and his challenges to conventional ideas of how we learn — arguing how much of knowledge and behavior is innate to the mind/brain.

His expansive work at MIT has been influential, to say the least; and his talks and interviews have offered insightful views. It’s only fitting to refer to him when examining how the current state of NLP stacks up with linguistics.

In the beginning of the “Machine Learning Street Talk” interview, Chomsky makes the clear distinction between science and engineering:

Science entails questioning and understanding why things are naturally the way they are AND why they aren’t occurring some other way — It seeks to explain the underlying phenomena that we witness in the real world.
Engineering is applying what we know from science to solve problems.

He then presents some analogies from the natural sciences:

If a researcher sees an object fall outside a window, will simply recording the event and replaying the video explain how gravity works?
If a researcher sees something catch on fire, will recording it and replaying the video explain how combustion reactions work?

One could mimic the examples above, but mimicry alone does not contribute to our understanding of why these phenomena are occurring and why not something else (under these conditions).

Chomsky then goes on to state that LLMs (like GPT-3) have not contributed to our understanding of the science behind linguistics — they don’t explain how or why language/grammar operates the way it does. Rather, because these LLMs operate in a probabilistic manner, they are simply doing their best job at trying to mimic what a person would say/write within a given context. Additionally, due to LLMs’ probabilistic nature, they also have a chance of producing ungrammatical sentences.

Ultimately, LLMs follow an “anything goes” approach — they can sometimes produce nonsense, and they are not capable of mapping all* possible correct sentences in a language. In the natural sciences, when we have a theorem/law (ex: Newton’s law of gravity, Theory of combustion, etc.), it needs to explain and encompass all possible situations.

The formula for each law/theory (in the links above) are consistent in all circumstances.

*There are infinitely many sentences that can be produced by a language. A LLM can be good at producing the most probable sentences, but that doesn’t mean it has a formula or theorem that can represent all possible grammatical sentences while simultaneous excluding the impossible (ungrammatical) ones. We see that the natural sciences differ in this regard, as shown above.

To better understand Chomsky’s takes, let’s take a detour and examine his work on models of linguistic structure, in his book, “Syntactic Structures” (Chomsky, 1957).

At the core of this book, Chomsky states that: “A grammar of the language L is essentially a theory of L. Any scientific theory is based on a finite number of observations, and it seeks to relate the observed phenomena and to predict new phenomena by constructing general laws in terms of hypothetical constructs such as (in physics for example) ‘mass’ and ‘electron’.” (Chomsky, 1957, p. 49).

What this means is that a proper “grammar” (a theory/formula) of a language will allow one to derive from it ALL grammatically correct sentences while simultaneously not producing grammatically incorrect ones.

To use an analogy, in mathematics, we have a simple formula that allows us to map the sum of all integers from 1 to n:

Summation of all integers from 1 to n (Image by author)

The formula above applies for ALL values of n, and similarly, a proper “grammar” would need to apply to all grammatical sentences for a language.

Chomsky covers the following 3 models for linguistic structure:

Finite State

A finite state grammar uses finite states to represent sentences. In other words, it starts with a word (a certain state) and from there, there are a finite number of next states.

Figure 1 shows a simple finite state grammar, inspired by an example from the book (Chomsky, 1957, p. 19). As we can see, all sentences start with the word “The”, and there are only 2 possible outcomes (“The child runs.”, “The children run.”). This grammar looks very similar to how current LLMs operate (and it’s from the 1950s!), but it doesn’t take into account probability. We have more control in this model because we can explicitly exclude states that would lead to ungrammatical sentences (whereas LLMs can sometimes still produce them).

We would obviously want something more complex and expansive than the grammar in Figure 1, but even then, that STILL wouldn’t produce all the grammatical sentences in English, because English (like many languages) is NOT a finite state language! This is because the morphemic structure of sentences CAN’T be mapped in a finite state format. We have suffixes, prefixes, different forms/tenses of words, etc. that will come into play depending on the sentence and its structure. Because of this, a finite state grammar is not a proper grammar for language.

Phrase Structure

A phrase structure grammar approaches sentence mappings in a different manner. It takes a sentence, and breaks it up into phrases — more specifically a Noun Phrase (NP) and a Verb Phrase (VP) and then further breaks down each phrase into its parts (parts of speech like articles, nouns, verbs, etc.).

Figure 2 shows an example of a phrase structure grammar, inspired by an example from the book(Chomsky, 1957, p. 27) that produces the sentence “The boy ran home.” This is more powerful than the finite state grammar because it is not restricted to generating the words in a sentence in a strictly linear manner (left to right, one word at a time).

However, this linguistic structure is also imperfect because it cannot apply to some natural language (thereby, being unable to produce all grammatical sentences). Chomsky presents the following example of 2 sentences that work here (Chomsky, 1957, p. 36):

“The scene — of the movie — was in Chicago”
“The scene — that I wrote — was in Chicago”

These simple sentences can be produced with phrase structure grammar, but if we try to combine them into a more complex sentence, using the same paradigm, we end up with the ungrammatical sentence:

“The scene — of the movie and that I wrote — was in Chicago”

It’s a bit of a tricky concept, but its flaws are best shown through examples, like above.

Transformational Structure

A transformational structure grammar takes the phrase structure paradigm, but incorporates transformations like deletions, insertions, and movements as well as morphophonemic rules (Duiganan et. al). All of these transformations can be applied to the phrases presented in the phrase structure grammar, making the transformational structure grammar much more powerful.

Under this paradigm, the example 2 sentences in the phrase structure section can be combined into something like:

“The scene — that I wrote in the movie — was in Chicago”

In the interview, Chomsky says that this 3rd grammar was the one that began to make sense, because it begins to tell us something about language, rather than simply “describing” it like the other 2. I personally wasn’t sure of the downsides to this one, but it looks like it would take a lot of specific transformational rules to make it work, making it less elegant compared to common theories in natural sciences. It definitely appears more versatile and accommodating with regards to mapping all possible grammatical sentences.

The click-baity headlines that have dominated the news cycle regarding LLMs capabilities may have given you the impression that they “understand” human language. It’s gotten to the point where some people have claimed that models like LaMDA have achieved sentience.

I think that there are plenty of examples that disprove this notion, as can be seen with GPT-3 failing to produce acceptable output for some reasoning tasks. Take the following task, for example:

Input: “You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to”
Output: “remove the door. You have a table saw, so you cut the door in half and remove the top half.”

GPT-3 fails not only to solve the task of getting the table through the doorway, but also leaves you with a broken door!

There’s also a fun example where AI researcher Janelle Shane shows GPT-3’s response to being asked what it’s like being a squirrel.

At their core, LLMs are just really good at mimicking human language, in the right context (they know how to respond in the right way). Additionally, they can still violate the principles of language by producing grammatically incorrect sentences (they also can produce nonsensical sentences).

Even as we get more accurate LLMs, that simply means that we’ll have fewer edge cases where they produce “unacceptable” outputs. Let’s say that these models satisfy 99.999999% of use cases, to the point that we’re struggling to find any prompt to confuse them — they’re still operating in a probabilistic manner.

In the interview, Chomsky was adamant that even as we add more compute to train bigger, more complex models, ultimately, we still won’t have something that will produce a proper grammar/theory for language, as we’ve described previously. Again, the nature of probabilistic models prevents that. They will be able to produce more acceptable outputs that mimic the right response in the right scenario, but that doesn’t mean they can map out every possible grammatically correct sentence.

It’s important to distinguish between a sentence being grammatically correct and being sensical. It is possible for a sentence to be nonsensical, but grammatically correct. Take Chomsky’ classic example:

“Colorless green ideas sleep furiously.” (Chomsky, 1957, p. 15)

This sentence adheres to the principles of language, but makes absolutely no sense. And because it makes no sense, it’s probability of occurring is very low. Given the nature of LLMs, they won’t be producing sentences like these, because they’re unlikely to occur. There are infinitely many of these grammatically correct, but simultaneously nonsensical sentences that can’t be mapped by LLMs, regardless of how powerful.

Some have argued that humans speak/write the most probable responses for a given context, and imply that if LLMs get really good at selecting probable responses (satisfy 99.999999% of use cases), then they’ll “understand” language. This begins to branch into the argument of whether or not humans operate in a deterministic manner and whether or not we have free will.

I liked Chomsky’s take that those who argue that humans operate in a deterministic manner are proving that there is free will by doing so. Otherwise, why would they be forced to do so? It stands to reason that humans are not simply using beam-search to think, speak, and write.

Some of you might be thinking, “this Chomsky guy is just a Debbie Downer”. Well, we can also look towards other prominent figures in the deep learning space, like Yann LeCun. LeCun has gone on record that in order to make the next leap in designing systems that “understand”, we need to “abandon probabilistic models”.

He has advocated for “energy-based models”, which are derived from statistical physics and can get past some of the downsides of probabilistic models. He’s presented a new “World Model” architecture as well that could potentially help with systems understanding the world as we see and interact with it.

I don’t know how well these new ideas will work, and LeCun himself has stated that this may not be the right way to do it, but ultimately, we need to drop these intrinsically probabilistic methods.

Probably nothing (a bit anticlimactic, I know). As data scientists/machine learning engineers/whatever title you have, in this field, we’re only concerned with providing business value. If LLMs can serve as an engineering feat that meets our needs, it doesn’t matter if they don’t actually “understand”. In fact, I would argue it makes things easier for us since we don’t have to worry about these systems actually understanding like humans/becoming “sentient” and starting a robot uprising. We can rest easy knowing that these LLMs, regardless of how good they get, will essentially be good mimicking tools.

This was quite a different viewpoint from what we have traditionally seen during the rise of LLMs these past few years. I found Chomsky’s comments to be very sobering amongst the extensive hype that LLMs have received and served as a great thought exercise in evaluating where we are!

As someone who enjoys working with LLMs, I think that this critique is necessary in order to develop the next breakthrough in NLP and not get caught in a rut down the line.

References

[1] N. Chomsky, Syntactic structures (1957), Martino

[2] K. Duggar, T. Scarfe and W. Saba, #78 — Prof. NOAM CHOMSKY (Special Edition) [Video] (2022), Machine Learning Street Talk

[3] B. Duiganan, E. Hamp, J. Lyons and P. Ivić, Chomsky’s Grammar (2012), Encyclopedia Britannica

[4] T. Ray, Meta’s AI luminary LeCun explores deep learning’s energy frontier (2022), ZDNet