What are Large Language Models?

A look at LLMs and their popularity

A photo of lots of books, all open, from above
Photo by Patrick Tomasso on Unsplash

Advances in natural language processing (NLP) have been in the news lately, with special attention paid to large language models (LLMs) like OpenAI’s GPT-3. There have been some bold claims in the media — could models like this soon replace search engines or even master language?

But what exactly are these large language models, and why are they suddenly so popular?

What’s a language model?

As humans, we’re pretty good at reading a passage and know where the author might be heading. Of course, we can’t predict exactly what the author will write next — there’s far too many options for that — but we notice abrupt changes or out-of-place words, and can make a stab at filling in endings of sentences. We intuitively know that a message saying “I’ll give you a call, how about” is likely to end with “tomorrow” or “Thursday”, and not “yesterday” or “green”.

This task of predicting what might come next is exactly what a language model (LM) does. From some starting text, the language model predicts words that are likely to follow. Do this repeatedly, and the language model can generate longer fragments of text. For all the recent interest, language models have been around for a long time. They’re built (or trained) by analysing a bunch of text documents to figure out which words, and sequences of words, are more likely to occur than others.

One method of building LMs called n-grams has been around for a long time. These models are quick and easy to build, so people have trained them on different kinds of text. Examples include text generated from Shakespeare: “King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv’d in;” and from Alice in Wonderland: “Alice was going to begin with,’ the mock turtle said with some surprise that the was”. Train one of these models on something else, like articles from the Financial Times, and the model will predict an entirely different style of text.

N-gram models aren’t good at predicting text that’s coherent beyond a few words. There’s no intent or agency behind what they’re saying; they create sequences of words that might seem sensible at first glance, but not when you read them closely. They’re simply regurgitating patterns in the training data, not saying anything new and interesting. These models have mostly been used in applications like autocorrect, machine translation, and speech recognition, to provide some knowledge about likely sequences of words into a bigger task.

The emergence of large language models

There’s always been a drive to use more and more data for training AI models, and LMs are no exception. In the past decade, this has only accelerated. Training a model on more text means the model has potential to learn more and more about the patterns in language. More data is one part of the ‘large’ in ‘large language models’.

The second part of ‘large’ comes from the size of the models themselves. The past 15 years has seen neural networks as a popular choice of model, and they’ve got larger and larger in terms of the number of parameters in the model.

GPT-3 for example, has 175 billion parameters and is trained on around 500 billion tokens. Tokens are words, or pieces of words. Most of that text data has been scraped from the web, though some comes from books. The combination of lots of data & large models makes LLMs expensive to train, and so only a handful of organisations have been able to do so. However, they’ve been able to better model much longer sequences of words, and the text they generate is more fluent than that generated by earlier LMs. For example, given an initial text prompt to write an article about creativity, GPT-3 generated the following as a continuation:

The word creativity is used and abused so much that it is beginning to lose its meaning. Every time I hear the word creativity I cannot but think of a quote from the movie, “The night they drove old dixie down”. “Can you tell me where I can find a man who is creative?” “You don’t have to find him, he’s right here.” “Oh, thank god. I thought I was going to have to go all over town.”

This is far more readable and fluent than the earlier examples, but it’s worth noting that “The night they drove old dixie down” is a song, and not a movie, and it has no lyrics or lines about a man who is creative. These facts are hallucinated by the model because the sequences of words are probable. As readers, we naturally try and infer the author’s meaning in this passage, but the computer has no agency — it really wasn’t trying to say anything when it generated the passage.

How do language models relate to other NLP technology?

NLP is a broad field — language modeling is just one NLP task and there are many other things you might want to do with text. Some examples include translating text from one language to another, identifying entities like names and locations in your text, or classifying text by topic.

To built models for these other NLP tasks, you can’t just analyse a bunch of documents like for language modeling. Instead, you need to have labelled data — i.e. text that is labelled with the entities or topics that you’re interested in. Or in the case of machine translation, text that means the same thing in two languages. Labelling data is time-consuming and expensive, and a barrier to building good NLP models.

Why all the hype about LLMs?

The bold claims about large language models are inspired by some of their interesting emergent behaviour.

First is that these models can be used as a type of interactive chatbot. By learning appropriate continuations of my text input, they can generate appropriate responses in a conversation. The current generation of chatbots are hand-crafted systems with carefully designed conversations, and they take a lot of effort to create. LLMs offer the possibility of chatbots that are simpler to build and maintain.

The second is that because these models have been trained on a lot of data, they can generate a huge variety of texts, including some that are unexpected. Give GPT-3 the text input, or prompt, to translate text into another language and it’s seen enough multilingual text to have a good go at the translation. That’s without ever being explicitly trained to do translation! The ability to recast NLP tasks into text generation ones and use LLMs to do them is powerful.

A third ability is that LLMs can be fine-tuned to different NLP tasks. An LLM has learned a lot about language during its training, and that knowledge is useful for all NLP tasks. It’s possible to make some small changes to the structure of the LLM so that it classifies topics rather than predicts next words, but still retains most of what it’s learned about patterns in language. Then, it’s easy to fine-tune on a small amount of data that’s been labelled with topic and build a topic classifier. This way of building NLP models by first building an LLM on a large dataset (or, more realistically, using one that a large company has built and released) and then fine-tuning on a specific task, is a relatively new way of building NLP models. This way, it’s possible to build NLP models using far less labelled data than if we built the same model from scratch, and is cheaper and faster. For this reason, LLMs have been dubbed ‘Foundation Models’.

But what are the downsides?

LLMs have some interesting behaviours, and many state-of-the-art NLP models are now based on LLMs. But, as they say, there is no free lunch! There are some downsides to these models that need to be taken into account.

One of the biggest issues is the data that these models are trained on. As with the Shakespeare and Alice in Wonderland examples, LLMs generate text in a similar style to that which they’re trained on. This is obvious in those two examples because of the distinct styles. But even when LLMs are trained on a wide variety of internet text, it’s still the case that the model output is heavily dependent on the training data even if it’s not as immediately obvious in the text they generate.

It’s especially problematic when the training data contains opinions and views which are controversial or offensive. There are many examples of LLMs generating offensive text. It’s not feasible to construct a neutral training set (it raises the question, ‘neutral’ according to whose values?). Most text contains its author’s views to a varying extent, or some perspective (bias) about the time and place it was written. Those biases and values inevitably make their way through to the model output.

As in the creativity example above, LLMs can hallucinate facts and generate text which is just wrong. Because of their ease of use and the superficial fluency of the text they generate, they can be used to quickly create large amounts of text containing errors and misinformation.

The impact of these downsides are exacerbated by there being just a handful of LLMs which are fine-tuned and deployed in many different applications, thus reproducing the same issues again and again.

In summary, large language models are large neural networks trained on lots of data. They have the ability to generate text that’s far more fluent and coherent than previous language models, and they can also be used as a strong foundation for other NLP tasks. Yet, as with all machine learning models, they have several downsides that are still being figured out.


What are Large Language Models? was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.


Posted

in

by

Tags: