What Is ChatGPT Doing ... and Why Does It Work?

GET THE #1 EMAIL FOR EXECUTIVES

Subscribe to get the weekly email newsletter loved by 1000+ executives. It's FREE!

Summary of "What Is ChatGPT Doing ... and Why Does It Work?" by Stephen Wolfram

Links

Stephen Wolfram's book "What Is ChatGPT Doing ... and Why Does It Work?" delves into the inner workings of ChatGPT. Wolfram explores the architecture, training, and implications of ChatGPT, providing clear examples and explanations along the way.

The fact that this is freely available on the Wolfram website is amazing. It's 105 pages long and is a perfect read that is a step up from a typical blog post without the overhead that some AI books have that make them quite difficult to read.

We just covered the Wolfram website in a previous post which has a whole bunch of articles on AI, physics, and mathematics.

Key topics that the book covers

Introduction to ChatGPT: Explaining the purpose of ChatGPT and it's capabilities.
Architecture and Training: The book delves into the architecture of ChatGPT and how it is trained to generate human-like text.
Implications of ChatGPT: Wolfram discusses the implications of ChatGPT for AI, language understanding, and human-computer interaction.
Future Directions: The book concludes with a discussion of future directions for ChatGPT and AI in general.

We don't know why this works!

It has to be emphasized again that (at least so far as we know) there’s no “ultimate theoretical reason” why anything like this should work.

Stephen Wolfram

That quote led me down a bit of a rabbit hole, but it's true. We don't know why ChatGPT works. It's a bit like the universe. We don't know why it works, but we know it does.

I found this interesting quote by Sam Bowman

So there’s two connected big concerning unknowns. The first is that we don’t really know what they’re doing in any deep sense. If we open up ChatGPT or a system like it and look inside, you just see millions of numbers flipping around a few hundred times a second, and we just have no idea what any of it means. With only the tiniest of exceptions, we can’t look inside these things and say, “Oh, here’s what concepts it’s using, here’s what kind of rules of reasoning it’s using. Here’s what it does and doesn’t know in any deep way.” We just don’t understand what’s going on here. We built it, we trained it, but we don’t know what it’s doing.

*Reference: Sam Bowman Interview

Anyway, that isn't the main takeaway of the book! We are here to understand what ChatGPT is doing and why it works.

One word at a time

The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.”

Stephen Wolfram

I enjoyed this initial description of how ChatGPT works:

Chat GPT Next Work Example

And the remarkable thing is that when ChatGPT does something like write an essay what it’s essentially doing is just asking over and over again “given the text so far, what should the next word be?”—and each time adding a word.

Stephen Wolfram

Simple, right? He used an example showing that if you just sample how often the different letters occur in English text, and then just pick letters at random according to these frequencies, you get something that looks a bit like English text.

Chat GPT Next Work Example

The temperature of words

Your intuition would be that you always just take the highest probability next word, but it turns out if you do that, you get a very flat and boring essay that starts to repeat itself. The concept of temperature is introduced to make the essay more interesting. Temperature essentially means that you sometimes pick lower-ranked words to make the essay more interesting.

It turns out that a 0.8 temperature seems to be the best for essay generation. You will also see this parameter in AI models when you use them.

Here is an example with 3 different temperature settings for email.

Example 1 - Low Temperature (0.2 - 0.5)

Professional follow-up on the status of an invoice

Dear [Client Name],
I hope this email finds you well.
I am writing to follow up on the status of Invoice #12345, which was issued on [Date]. As of today, we have not yet received payment for this invoice. We kindly request that you review this matter and let us know if there are any issues or if additional information is needed to process the payment.
If the payment has already been made, please disregard this message. Otherwise, we would appreciate your prompt attention to this matter.
Thank you for your cooperation.
Best regards,

Example 2 - Moderate Temperature (0.6 - 0.7)

Friendly yet professional follow-up on the status of an invoice

Hi [Client Name],
I hope you're doing well!
I'm reaching out to check on the status of Invoice #12345, which we sent over on [Date]. We haven't seen the payment come through yet and just wanted to make sure everything is in order on your end. If there's any issue or if you need additional information from us, please let me know.
If the payment is already in process, thank you, and please ignore this email. Otherwise, we would appreciate your assistance in settling this invoice at your earliest convenience.
Thanks so much for your attention to this. Looking forward to hearing from you soon.
Best,

Example 3 - High Temperature (0.8 - 1.0)

Casual follow-up on the status of an invoice

Hey [Client Name],
Hope all's well on your end!
Just wanted to touch base regarding Invoice #12345 that we sent on [Date]. Haven't seen the payment come through yet and wanted to make sure everything is good to go on your side. If there's any issue or if you need more info, just let me know.
If the payment is already on its way, awesome, and you can ignore this email. Otherwise, we'd appreciate it if you could take a look when you get a chance.
Thanks a bunch and talk soon!

ChatGPT is awesome at estimating probabilities of words

There are about 40,000 reasonably commonly used words in English. And by looking at a large corpus of English text, we can get an estimate of how common each word is. Using this we can start generating “sentences”, in which each word is independently picked at random, with the same probability that it appears in the corpus.

Stephen Wolfram

So the model essentially repeats the same process as the letter example but with words. It estimates the probabilities of words based on a large corpus of text and generates sentences by picking words at random according to these probabilities. This is the essence of ChatGPT's ability to generate human-like text.

The big idea is to make a model that lets us estimate the probabilities with which sequences should occur—even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at. And at the core of ChatGPT is precisely a so-called “large language model” (LLM) that’s been built to do a good job of estimating those probabilities.

Neural Networks

How is this done? Through a neural network. The neural network is trained on a large corpus of text to estimate the probabilities of sequences of words. This is the essence of ChatGPT's ability to generate human-like text. The book goes into a lot of detail on how this is achieved.

But what makes neural nets so useful (presumably also in brains) is that not only can they in principle do all sorts of tasks, but they can be incrementally “trained from examples” to do those tasks. When we make a neural net to distinguish cats from dogs we don’t effectively have to write a program that (say) explicitly finds whiskers; instead we just show lots of examples of what’s a cat and what’s a dog, and then have the network “machine learn” from these how to distinguish them.

Stephen Wolfram

A visualisation of ChatGPT words in their embedding space

Chat GPT Training

But it turns out that even with many more weights (ChatGPT uses 175 billion) it’s still possible to do the minimization, at least to some level of approximation. And in fact the big breakthrough in “deep learning” that occurred around 2011 was associated with the discovery that in some sense it can be easier to do (at least approximate) minimization when there are lots of weights involved than when there are fairly few.

Stephen Wolfram

The learning and training of ChatGPT

How about something like ChatGPT? Well, it has the nice feature that it can do “unsupervised learning”, making it much easier to get it examples to train from. Recall that the basic task for ChatGPT is to figure out how to continue a piece of text that it’s been given. So to get it “training examples” all one has to do is get a piece of text, and mask out the end of it, and then use this as the “input to train from”—with the “output” being the complete, unmasked piece of text. We’ll discuss this more later, but the main point is that—unlike, say, for learning what’s in images—there’s no “explicit tagging” needed; ChatGPT can in effect just learn directly from whatever examples of text it’s given.

Stephen Wolfram

The Fundamental Tension of Learnability and Computational Irreducibility

I liked this concept talking about how there is a tension between learnability and computational irreducibility. Learning involves compressing data by leveraging regularities, but computational irreducibility implies that there’s a limit to what regularities there may be.

And in the end there’s just a fundamental tension between learnability and computational irreducibility. Learning involves in effect compressing data by leveraging regularities. But computational irreducibility implies that ultimately there’s a limit to what regularities there may be.

Stephen Wolfram

An example of this

An example of this is weather forecasting. While meteorologists use patterns and historical data to predict weather, the atmosphere's behavior is ultimately computationally irreducible. Despite advancements in technology and data analysis, long-term weather predictions remain challenging due to the atmosphere's chaotic nature.

The fundamental tension between learnability and computational irreducibility lies in the contrast between the ease of extracting patterns from data and the inherent complexity of certain processes. Learning, at its core, seeks to simplify and generalise data by identifying and leveraging regularities, making it possible to predict and understand phenomena with reduced effort. However, computational irreducibility asserts that some systems or processes are so complex that they cannot be simplified or predicted without fully simulating each step.

Other examples where this tension plays out

Financial Markets: Traders and analysts use historical data and patterns to predict stock prices and market trends. However, the market is influenced by countless variables and human behaviors, making it computationally irreducible. Sudden events or shifts in sentiment can render predictions inaccurate, demonstrating the limits of learnability in financial forecasting.

Biological Systems: Scientists study genetic sequences and biological pathways to understand diseases and develop treatments. While certain regularities and patterns can be identified, the sheer complexity of biological interactions and the influence of countless factors make precise predictions challenging. This complexity showcases the tension between the desire to learn and simplify versus the irreducibility of biological processes.

Urban Planning: Urban planners use models and historical data to design efficient cities and predict future growth. However, the myriad of human behaviors, economic factors, and environmental changes make urban development highly unpredictable. This unpredictability highlights the computational irreducibility of urban systems, despite efforts to learn and model them effectively.

Or put another way, there’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreducibility, and the less it’s going to be trainable. And the more it’s fundamentally trainable, the less it’s going to be able to do sophisticated computation.

Stephen Wolfram

Have computers become vastly more powerful?

Well its interesting because he argues that the tasks that ChatGPT is doing are not fundamentally different from what computers were able to do before. The difference is that now we have models like ChatGPT that can do these tasks more efficiently and effectively.

And now that we see them done by the likes of ChatGPT we tend to suddenly think that computers must have become vastly more powerful—in particular surpassing things they were already basically able to do (like progressively computing the behavior of computational systems like cellular automata). But this isn’t the right conclusion to draw. Computationally irreducible processes are still computationally irreducible, and are still fundamentally hard for computers—even if computers can readily compute their individual steps.

Stephen Wolfram

Quotes around tokens and words

How do we set this problem up for a neural net? Ultimately we have to formulate everything in terms of numbers. And one way to do this is just to assign a unique number to each of the 50,000 or so common words in English. So, for example, “the” might be 914, and “cat” (with a space before it) might be 3542. (And these are the actual numbers used by GPT-2.) So for the “the ___ cat” problem, our input might be {914, 3542}. What should the output be like? Well, it should be a list of 50,000 or so numbers that effectively give the probabilities for each of the possible “fill-in” words.

Stephen Wolfram

Strictly, ChatGPT does not deal with words, but rather with “tokens”—convenient linguistic units that might be whole words, or might just be pieces like “pre” or “ing” or “ized”. Working with tokens makes it easier for ChatGPT to handle rare, compound and non-English words, and, sometimes, for better or worse, to invent new words.

Stephen Wolfram

How ChatGPT works

OK, so we’re finally ready to discuss what’s inside ChatGPT. And, yes, ultimately, it’s a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights. In many ways this is a neural net very much like the other ones we’ve discussed. But it’s a neural net that’s particularly set up for dealing with language.

Stephen Wolfram

The book explains how ChatGPT operates in three basic stages:

Token Sequence to Embedding: ChatGPT takes the sequence of tokens corresponding to the text and finds an embedding that represents these tokens.
Neural Network Operation: The embedding is processed through a neural network, with values rippling through successive layers to produce a new embedding.
Token Probability Generation: The last part of the new embedding generates an array of about 50,000 values that represent probabilities for different possible next tokens.

The computational cost of generating text

I found it interesting just how much computation is required to generate text!

"But for each token that’s produced, there still have to be 175 billion calculations done (and in the end a bit more)—so that, yes, it’s not surprising that it can take a while to generate a long piece of text with ChatGPT."

The simplicity of language

So how is it, then, that something like ChatGPT can get as far as it does with language? The basic answer, I think, is that language is at a fundamental level somehow simpler than it seems.

Stephen Wolfram

The future of ChatGPT & laws of thought

I really enjoyed the conclusion of the book where he talks about the future of ChatGPT and the implications of its success. Essentially there might be other laws of thought out there that we can discover.

The success of ChatGPT is, I think, giving us evidence of a fundamental and important piece of science: it’s suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover. In ChatGPT—built as it is as a neural net—those laws are at best implicit. But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.

Stephen Wolfram

What laws of thought might we discover?

Well for that, I did the main thing that we should do in an article like this... I asked ChatGPT!

Prompt: What are three example of potential laws of thought that could be out there?

Law of Conceptual Association: This potential law might explain how different concepts are linked together in human cognition. For instance, it could outline the principles by which the brain connects ideas like "apple" and "fruit" or "vehicle" and "car." Understanding this law could enhance our ability to design AI systems that mimic human thought processes more closely, allowing them to draw connections between seemingly disparate ideas with greater accuracy and relevance.

Law of Cognitive Economy: This law could describe how the human brain prioritizes information processing to optimize mental efficiency. It might detail how we allocate cognitive resources to tasks based on their perceived importance or relevance, and how we filter out unnecessary information. Discovering and formalizing this law could lead to AI systems that manage resources and prioritize tasks in a manner similar to human thinking, improving their ability to handle complex, multitasking environments.

Law of Semantic Generalization: This law could define how humans generalize from specific experiences to broader concepts. For example, it might explain how learning about a specific dog leads to understanding the general concept of "dogs." By making this process explicit, we could develop AI that learns and generalizes from data more effectively, allowing it to apply knowledge gained from specific instances to a wider range of situations, thereby improving its adaptability and problem-solving capabilities.

Thanks for reading! If you found this helpful, please share this article with 1 friend!