Artificial Intelligence & Large Language Models: Oxford Lecture — #35

Steve Hsu: Hi, this is Steve. This week's episode on Manifold is based on a lecture I gave to an audience of theoretical physicists at Oxford University. The topic is AI and large language models. Parts of this discussion are somewhat technical. You may want to fast-forward through them using the chapter headings, but I think there's quite a bit of content that's of general interest in the discussion.

If you're listening to the audio version only, I suggest downloading the slides, which will make the content easier to understand. There will be a link in the show notes to the slides. If you're watching the video version, you can just sit back and enjoy. Welcome to Manifold.

Oxford audience: So, I, I did already send out an introduction, but I'll just say that Steve is a Renaissance man. He, was, obviously a physicist by training, continues to do physics and his, in, in part of his day. But he has, interest in lots of things. He's founded companies. I think he's in the middle of founding another company, which keeps track of all sorts of things.

So I asked him, uh, he has already recorded one, you know, sort of a, a small description of LLMs for, for, for people, which you can, I think, find in on, on his, somewhere on the web. But I asked if you could give us a tutorial introduction to what's going on and with lms, what they are, what they do, and, and then maybe also tell us a little bit about how he himself is planning, I believe, to improve their accuracy or has already succeeded in improving their accuracy.

So with that, Steve, it's all yours and you can share your screen if there's as needed.

Steve Hsu: Great. Works perfectly. so, today I'm gonna talk about AI and large language models.

This is based both on my own study of this subject going back many years, but also some recent work that's done with this startup that I just co-founded called Super Focused ai. And the talk will start at a level, which I think is kind of where physicists and mathematicians start in wanting to think about this subject.

But then eventually it'll go more toward the applied engineering side of things. And so, hopefully, there'll be something for everyone in this talk.

So let me start by just discussing some interesting questions that I think physicists and mathematicians are attracted to in this area. And let me state a kind of heuristic observation, which I think there's evidence for. It's certainly not something that we know rigorously, but things are definitely heading in that direction.

And the point is that, this deep neural network architecture gives us a general framework for learning approximations to high dimensional functions from data. And furthermore, we have some results on how efficient, how effective, this architecture is for learning. And, this is, it may seem like kind of obvious thing given the successes of deep neural networks and deep learning over the last decade or so.

But, If this kind of theoretical heuristic turns out to be true, it sort of has long-term consequences about, you know, the nature of intelligence in the universe because it sort of says that we, we have a framework that's good enough to basically learn every, you know, compressible function that actually generates all the phenomena that we see in the universe.

So it's kind of interesting and, as you all know, these things go all the way back. They're biologically inspired by the functioning of actual neurons in brains. Going back all the way to the Pitts McCulloch neuron of 1943. And, I think physicists and mathematicians are generally interested in questions like, What general statements can you make about the expressiveness of these structures?

In other words, what kinds of functions can they approximate? And also the efficiency of, of how much data they need to learn, how big the neural net has to be to capture, particular function and how much compute is required to train the network. And, and there's something known about these results, which I, I think, aspects of these questions are interesting to physicists.

So I'm just gonna mention them, even though this is not the main focus of my talk. I think it's just something that people might find interesting and in the slides, which, if anybody wants these slides, they're welcome to have, and I'll, I'll put them online. There are links to, references for, some of the points that I'm about to mention.

Okay, so, let me talk about three different perspectives for how people might think about deep learning and neural net. So, a mathematician might ask for rigorous statements about models and training. How expressive are the models? How expensive is the training? A physicist might be interested in the system, a neural net as a dynamical system.

So, what are the scaling laws governing, how it behaves, what are the actual dynamics of gradient dissent? So the mechanism by which you approach optimal tuning of the connection strengths in the network. Are there phase transitions in the behavior of the network, as you pass through, say, a certain data threshold or training threshold?

on this last point, I'll just mention that in other machine learning work that I've done, Applied to genomics. We show that there's a very special kind of phase transition that occurs when you give certain kinds of learning algorithms sufficient genomic data, and then it becomes very capable of learning exactly the correct architecture of the, the, the genetic trait, the complex trait that's controlled by, D n A.

So those are the kinds of things I think mathematicians and physicists are interested in from the engineering perspective, which is, is the one that really matters in the real world. It's how to really design and train models that have practical utility. And toward the end of the talk, I'll kind of shift over to that focus.

Oxford audience: Steve, can I ask one quick question? you may be, you, you. You may answer it, shortly Anyway, so just in preparing to listen to you, I was looking back at Michael Nelson's, you know, little online book on, and he, in there he shows examples of artificially modified images, which are actually very close to the human eye, but which the network fails to, you know, sort of recognize correctly.

So when you say approximating a high-dimensional function, is there also a restriction on the domain of the function in some sense of what?

Steve Hsu: Well, the, so the formal mathematical theorems often just ask for some kind of, you know, reasonably smooth, you know, high-dimensional function. And they ask questions whether this architecture can sufficiently approximate that function.

mm-hmm. I'm not sure exactly which examples from Mike's book you're talking about. There are these, there are these patterns that are made to fool AIs where they, they, they realize the AI is weak at differentiating like a, a certain type of pattern on someone's t-shirt. Like it doesn't really, can't really notice that.

And so, and so they, in a kind of adversarial way, people can design patterns that the network is very poor at, recognizing or differentiating. Right. So

Oxford audience: I, I won't delay you too much. I can just send you the picture later. But then, you know, he's got a picture where you've added a little bit of noise except, you know, it's not really noise, it does something specific.

So the resulting picture is something you and I would look at and say it's the same thing. And his claim was that the network actually then fails to, so it somehow suggests, you know, some lack of continuity with respect to the appropriate, you know, sort of relation to what's going on inside our own heads.

But, if it's not, but we can, we can, I can, we can do this by email. I'll, I'll just send you the link and maybe you can

Steve Hsu: Okay. happy to look at it. I mean, of course when we. In this introduction to the talk, I'm talking about statements in a kind of very abstract gen, theoretical generality. And of course, in, in the real world, they don't necessarily hold, like, maybe on average the thing does really well and you can prove something about its average performance, but that doesn't mean there isn't some edge case that it really fails at or something like that.

So, I have to be careful about, you know, theory should be used judiciously, let's put it that way.

Right. Okay, so here are some references that I think might be of interest if, if the, if the things that I mentioned are of interest to you, they're not gonna be the central focus of this presentation, but they're things that.

You know, I think, people in this audience might be interested in it. So here are some links where you can go away and read about them. and, and all of these things are the subject of intense interest. So they're, they're, you know, these small, medium sized research groups working on this kind of thing.

So one is the expressiveness of a network. So, interested people are interested in the difference between the depth and the width of the network. So the width is the total size of the inputs that you could give the network, whereas a depth is the number of processing layers that it might use. And so there are interesting kinds of sharp results about the difference between scaling with depth and scaling with width.

And, this paper's, the one that I cite here is a pretty highly cited paper that gives some examples of that kind of thing. The general conclusion is positive, basically, that, you know, you, you, these things are extremely expressive, so it's not hard to convince yourself that networks of a certain size can capture, essentially any continuous function, of a certain characteristic dimensionality.

The second point I want to make is a little bit historical, but it's important I think, there's a question of whether the optimization of these models is convex. So, in other words, in the eighties and nineties, which I'm old enough to remember, people who worked on neural nets were very concerned that as you trained the network, you could get stuck in a local minimum, and that it would be very hard to get out of that local minimum.

And then, people later, you know, as people, as they became more accustomed to working in very high dimensional spaces, realize that it's very rare to actually find an actual local minimum. Because, minimum, because. If you look at the matrix of second derivatives, there are so many iGen values of that matrix that it would be extremely improbable to find all of them to be positive, at that particular external point.

And so, and this seems to be observed in actual training, that actually you might have a slowdown of the training effectiveness because the first derivatives are kind of vanishing, but there aren't places where all the second derivatives are positive and you actually get trapped. So that doesn't really happen.

And so that was a concern that people had 20, 30 years ago that turned out not to be realized. There's an anecdote from this guy, Jan Lacoon, who's a famous neural net pioneer. He tells this anecdote about when the back propagation algorithm, which I think in Michael Nielsen's book, explains it in great detail.

The back propagation algorithm is a way to update the individual connections of the network in response to the data that the model is learning from. When they were initially trying out the back prop algorithm, they found that it wasn't working and they convinced themselves, this is a real research story.

The, the, they convinced themselves that it wasn't working because the nets, neural nets were getting trapped in local minimum. And later they found that, no, this is totally wrong there, there aren't these local minimums. But in fact, what happened was that someone had just coded the back prop al algorithm incorrectly and it was actually just a software bug.

But it shows you some kind of, in this case, naive theoretical prejudice, which comes from low dimensional thinking. It turns out it's talking about a property that's not really true in high dimensions. That could still stymie the field for some period of time because people wrongly think, oh, this is not a promising direction for this reason.

But the reason turns out not to actually be correct. So, can I say one thing? So there is, go ahead.

Oxford audience: Sorry. But the, the whole, yeah. So, the problem of glasses and physics right, is, I think on paper, it's supposed to be a counter example to this view. Cause one of the theories is, you know, what they call inherent structures.

So the way that works is you, you were running your glass simulation, you know, with particles moving around, you stop and you roll downhill, right? Yeah. And so where you stop is an inherent structure. And so the b at least the, the claim with glasses is that when you do this sort of stuff, you end up with some pros and liquid like configuration.

They can be different on different runs and for instance, you don't find the crystal. So, anyway, so I don't have to explain the question further. Are, are glasses and somehow are these special, are glasses special? Is that an understanding of how they both might enter, or the glass people simply not done their code correctly?

Steve Hsu: So I, I feel this is pro, I'm not, I don't know that much about glass spin glasses, but you know, I imagine this is something that people have actually looked into. Now, is it possible that the cases you're describing are ones where there is a. Critical slowing down of the, you know, rolling down the hill.

And so the glass is sort of not actually stable, but it just, you can't get past it because the computer time is prohibitive. That would be consistent with this kind of phenomenology. an actual local mini, I think would be a true local mini would be kind of surprising, unless it's often the case.

However, in physics we have many more symmetries, like the Hamiltonian that we're dealing with is very special, whereas the, the, the neuro genes generic neural net that you're dealing with, whereas you train it doesn't really have any symmetries. And so there could be some exceptional cases where something that is highly improbable for a general network.

Does happen for an actual physical system because the Hamiltonian is very special. So, so, so I guess, so lemme say one answer. Yeah. Lemme say

Oxford audience: one. Yeah. One, one more thing. And that is, maybe this is generating an interesting question to think about offline. So in, in the physics literature, you generally distinguish between the case of glasses, which are liquids and therefore have the symmetries of, of the liquid, and spin glasses, which actually have crunch random variables and therefore don't have any symmetries, or at least maybe they have one cemetery, which is, you know, some overall or something.

The general belief is actually that spin glasses get stuck even more easily because you have a lot of random variables than local fields which vary from place to place and so on. Whereas glasses have been the hard problem because, you know, it, it really should form a crystal and, and it doesn't. So, of all that is in a sense to say to the extent to which somebody handed you.

you know, the weights for one of these LLMs, they probably look pretty random, you know, to a human observer. So there's an interesting thing, which is you're asserting a certain flexibility. So it kind of goes contrary to the usual physics intuition. But because physics, there's also locality, right?

Which is our, our energy functions have a very particular spatial structure. Although again, I guess LMS also has some, they can have connections going all the way across between different layers. Correct. And maybe that's okay. Sounds like there's something interesting there to maybe trying, you know, sort of

Steve Hsu: Yeah, I mean the last few minutes of our discussion have accomplished the goal I had, which is to present some of these very general theoretical viewpoints.

Cuz I think they're very interesting to physicists, right? You, you might even find like, oh, these guys haven't. Actually proven that these things are nearly convex. It's actually, you know, there's something wrong in the reasoning. The second link that I give to Bejo is that Bejo is another famous pioneer in this space.

They actually write a paper sort of proving, but proving it like even at a lower level of rigor than we use in physics, that local mini should be exponentially improbable. But I think the real thing, is, I mean, the real reason people believe this is just empirics is that actually we, we don't, we, we can see slowing down at times of the training of the model, but we never find actually that the model gets fully stuck anywhere.

So, at the end of the day, it's really empirics that rule. But, there are some interesting theoretical viewpoints here. Mm-hmm. Now, okay, so let me go on to the next thing.

Now, this is a very recent set of results. Something called large width expansion and neural tangent kernels, and I think this is of intense interest to physicists who are, who, are interested in neural nets.

So in the limit of large width. So the network gets very wide and in some senses is over parametrized relative to the input information or the, the, the, something about the problem that you're trying to solve in that limit. and when you initialize the network in a kind of random way, so you have this big wide network, which is over parametrized, and then you initialize it in a, in a random way, people have proven that the, learning of the model is governed by something called a neural tangent kernel neural, because it's a neural net and tangent kernel meaning it, it, it's describing the movement of the model in this a certain abstract functional space as it learns.

And in this particular, given this particular set of assumptions, they can prove. That, the optimization of the model to its global minimum actually is convex. And that is important because that means in polynomial time you can, you can train this thing, and get to basically the global minimum. So this is kind of a really important result.

And it uses methods that are familiar to physicists because it's a large width expansion. It's like a one over end expansion of the engagement theory or in some spin system. and some of the people who participate in this are actually physicists. So, this is something that's kind of interesting. I think, if you're interested in this now, I should caution you that at the time when I was following this a couple years ago, it was not clear from the empirics whether real model training, like in the, of the real models that we're about to talk about actually is in, you know, is actually within the large width.

you know, region of convert, of, of description. So it's not clear, but at least there's some limiting case of very over parametrized randomly initialized neural nets that has very desirable properties vis-a-vis training. And so that, that's an, it's interesting that they can prove that. That's interesting.

So what, what,

Oxford audience: one question, sorry. What is the intuition behind this one brand like expansion?

Steve Hsu: It's not, I, I wouldn't say there's really great, okay, so, okay. So here's the intuition that I think that people in this field would offer you when you over parametrize it and then you randomly initialize the network.

So you're, you're setting the connection strengths in some random Gaussian way that there is a concentration of measure property. I don't know if you're familiar with that, but I can explain. And so there's a very typical state that the initial network is in. And in that typical state, this neural tangent kernel is just kind of an average, typical property of that state.

And then because the network is so over parametrized, like there are many global minia, I mean, I mean many degenerate global minia, and this starting point is close enough to one of those global mini that you can reach it through a convex trajectory. I think that's the claim. Thanks. okay. So, I, I, again, that those were just to like, seed your brains with stuff.

If you want to go into this research area with interesting things that people are still, puzzled over or in, or working on, from, just a holistic perspective. My, these neural nets are an efficient means to learn almost any high dimensional function. It's not an exponentially hard computational problem.

It's, you know, maybe challenging in practical terms, but it's not something that's pro that, that's never gonna be accomplishable. and, this is more relevant to the rest of the talk, but I'm pretty confident that existing data and compute capabilities are going to easily produce superhuman capability in almost any domain.

So, any narrow, relatively narrowly defined problem for sure. I think we're gonna be able to build neural nets that are superhuman better than what our brains can do in that narrow domain. And, and then even I think in the future, you know, in a, in a more generalized sense as well, so, It has long-term consequences for the future of our universe because the fact that we can build, you know, we're just a little bit removed from our ape days in terms of, you know, biological evolution, but yet we've already reached a point where we can build machines, which are probably, you know, good enough to learn almost everything that's learnable about, our universe.

So, and we're, we're, all of us are sitting at this inflection point in history, at least on this planet of, of the ape-like eight brains not being able to do this, and the eight brains suddenly being able to do this during our lifetimes. Okay.

So let me, we can return to these kinds of more speculative things at the end, but let me now get to the main point of the presentation, which is just to talk about large language models where there's been a very dramatic breakthrough just in the last few years.

And so let me start out by talking about the space of concepts. That human brains use, in thinking. So you, you could even think about it as the geometry of thought. Okay. And so there's a subject called, word embed, word embeddings, or word vectors that has been studied for at least for a decade in natural language processing in computer science.

and it was already known, you know, five, 10 years ago before the first big transformer models that could, could actually implement this program. Where were, were, were developed. But this theoretical notion that our brains are operating in a kind of, you know, approximately linear vector space of concepts is, is kind of interesting.

So let me just explain that a little bit. So you can think of the large language models as, they're often called transformers. That's a particular architecture, but you can think of them as things which take natural language. So that's the text on the left. And then transform it to a vector in a vector space.

And you know, for example, if you, if you know how Chinese characters work, you know that, roughly speaking, this is not exactly true, my wife would kill me for saying this, but, but roughly speaking you can say like, each character's kind of like an idiot. Like there's a character for a person, there's a character for Sky.

And so, you can think of actual conceptual primitives as having a kind of one-to-one representation in the Chinese character system. And it's known that a kid can start reading the newspaper or Simple books already when they've mastered, say, a few thousand or a thousand characters. So already this is telling you something about the dimensionality of the vector space of our concepts.

that, you know, the space that our brains are actually working in, it's actually only of order a few thousand or maybe 10,000. So in other words, I'm able to build up. More complicated thoughts and concepts starting with primitives, of which there are not really that many. Only of order thousands.

Okay. And so the, so, so,

Oxford audience: so far you haven't told us, I, I guess you're about to, why I should think of it as directions in a vector space rather than elements of a combinatorial set,

Steve Hsu: right. Ex. Excellent question. So why do I keep saying vector space? Why do I keep saying it's approximately linear? That's the shocking part of it.

Okay. It's a little bit shocking when you start thinking about it. Okay. So just empirically you could say somebody trained a model and the model maps, it's just a function. It's that weirdly shaped thing there that's labeled embedding model. Okay. That thing is a function and it takes natural language and it maps it to a vector.

Okay. And, just as an empirical observation, all of these huge language models that we're talking about, that people are so excited about, Are you using a roughly 10,000 dimensional vector space? I mean, the vectors have, our list of about 10,000 numbers, and that's, that's more or less efficient to do everything that people want to do today.

Okay. Such just an empirical observation, but it's, it, it's sim, it's re, it's related to this observation that a kid can start reading the newspaper if he just knows a thousand or a few thousand Chinese characters. Okay.

Now let's talk about why we think this has some quasi linear properties. So it is a little bit like a vector space, and then it has some geometrical properties.

So if you start looking at the vectors, which are generated by these models in, in the, in the field, it's called the embedding vector space. you give it some natural language, it returns to you a vector. So let's take this example of canine companions. So that's kind of. You know, complicated formulation, like canine companions say, wow, canine, that's a primitive, it's a dog-like thing.

Companion. That's like your friend who accompanies you around. Say that's a sound or utterance, right? That a creature makes. Right? So, there's like three primitives there. It turns out if you take just the word woof or, or bark or something and you map it through the embedding machine, it gives you a vector which is very, very close to canine.

Companions say, okay. And similarly with meow and feline, France or mo and bovine buddies say, and those are all quite far in a sense, away from the phrase, a quarterback throws a football, okay? But they're close to each other because there's something similar about these vectors cuz they're all sounds that animals make.

So those three are clustered together. Those six are clustered together, but a quarterback throwing a football is very far away. Okay, so you start to see there's some geometry in these, in these, vectors. where being close you could define an obvious metric with just a Euclid metric. And the closer together the vectors are the more similar the concepts are.

Even if the natural language formulations Go ahead. Sorry. So

Oxford audience: I, I replace feline friends, say with feline friends, eat, it may end up somewhere next to a piece of food.

Steve Hsu: Absolutely, yes. It'll, it'll end up next to a fish or, you know, something, a mouse or something. Right. now.

Oxford audience: What, what, what's the criteria for What, what's the cost function, which is used to train, to determine what is to train this embedding machine or to say what is a good one?

Steve Hsu: Yes. So, later I'll say more about how these machines are trained. Generally they're trained in next word prediction. So, the objective function of the model is that it's shown huge amounts of human text. And when given the n proceeding words or tokens, this is a fancy word they use, it is supposed to become good at predicting the n plus one token in accordance to the probability distribution that's seen in the training text.

Does that make sense?

Oxford audience: I see. Thanks. Yeah. So the, this ma but embedding is a, a part of that machine to do that prediction.

Steve Hsu: Yes. But the, the interest, so, so there's two separate threads here. One is that someone c maybe someone doesn't tell you, an alien comes to you and says, like, I, I don't gotta tell you how I made this machine, but I made this machine and this machine maps your natural language utterances into an abstract space.

And lo and behold, the space has these properties that I'm describing. So I don't have to tell you how I did it, but, but for sure people have done it and they did it the way that I just described. Thank you. Okay.

Oxford audience: Now lemme just make another, sorry, one, one more thing. so suppose and then the word separately feline friend and say that they have still other locations, so

Steve Hsu: Absolutely.

Yes. Yes. Okay. Okay. Yeah. So feline by itself.

Oxford audience: So sorry. So the 10,000 dimensional means that sentences up to length, 10,000, not that there, that many of those that normally come embedded into this space. Is that right? That's

Steve Hsu: No. The 10,000 dimensions is literally, in order to make this all work, in order to build a language model which functions well, one only needs to deal with vectors.

Like when you design the machine as a thing running in your computer, you only need to make the vectors that you're using as large as 10,000.

Oxford audience: So what I was trying to get at was that if I'm, you know, if, let's say, if I have a sentence, so the, so the way we just discussed it, we want to be able to represent each individual word, making up the sentence, but then the sentence as a whole where the sequence has, right?

So like, feline friends say was one chunk, feline was one, friends was another, another. So, so that I'm sort of saying, so is there a maximum length of concatenated words that you would,

Steve Hsu: So that's a separate issue. So, so, in the working of the model, we will talk about later, you know, what is the largest size, sort of, you know, end proceeding words that you can deal with to generate the n plus one.

That's a separate design issue in the model. What I'm identifying here is a, is an abstract space in which literally our human minds operate in this abstract space. right. In the same way that like, the set of Chinese characters that you need to fully express every thought, every human has ever had is less than an order 10,000.

Oxford audience: No, I I I, I hear your, your big point. I was just trying to understand from within the world of the language model and its embedding, you know, whether there was. Could the 10,000 be an artifact of, you know, what length sentences you've, you know, shown it? Or, or is that actually uncovering what as, as you're suggesting something deep?

but let me, lemme let you go on. Cause you, you probably, yeah, no,

Steve Hsu: It's a good question. I, I believe, I believe the answer. The answer. I believe the answer is the latter that we are uncovering the geometry of human thought. But, which again, like I, I think for a linguist who's familiar with like Chinese characters or, believe it or not, in World War ii, they invented, versions of English, which are called simplified English, so that they could quickly teach like, you know, Pacific Islanders or you know, people in other c from other societies how to, how to communicate with GIS and soldiers.

And they realized they could boil down English to like roughly a thousand words. Like if, if, if you really force people, you know, instead of Tiger you say Big cat. You don't allow them to say tiger, but they can say big, big is a really necessary primitive, and CAD is a really necessary primitive, but you can boil it down to about a thousand or no more than a few thousand.

And, and that it's kind of the same observation. What's interesting here is the models discovered this in an automated way. Mm-hmm. Okay, very good. So let me just make one more geometrical observation. So let's suppose I take the vector for king and I subtract the vector for man, and then I add the vector for woman, then the resulting vector.

So I've just added three vectors together. King minus man plus woman. It turns out that's the vector that's very close to the vector for queen. So, in other words, there really is a kind of approximately linear or vector space structure going on here. It's not perfect. I'm about to get into why it's not perfect, but it was a little bit shocking to me to realize how close, how like, at least at first approximation, one could think of it as a vector space.

Mm-hmm. Okay.

so now obviously, we talked a little bit about how these, how this embedding that little thing called the embedding model is built, and I'm about to discuss that in more detail. Before I do that, let me just say that one of the obvious ways in which our language deviates from a linear process is that word order is super important, right?

So if I just take the vector A plus B plus C, that's the same as B plus C plus A. But that's not really true for our sentences, right? So if I have a really complicated sentence, the order in which I put the words actually matters a lot and it may map the concept into a completely different vector.

So, so there is, there is a lot of non-linearity here as well. Hmm. And, that non-linearity has to be accounted for. and it's accounted for in practical terms by this thing called the Transformer architecture. And this is based on a paper that was published by researchers at Google Brain back in 2017.

And you know, now that Google's a little bit behind in this AI race, they're like, people complaining that, oh, Google, they should not have published this paper. They should have kept these results to themselves and worked on this for years instead of, just, just letting the whole world know how to do all this stuff.

But this is a paper in which people actually built. They, they, they gave an architecture which could perform, could, could, look at large chunks of training data, i e human generated text and extract, that embedding model. and, and more than that embedding model, in an automated way, by, and it itself is a neural net.

And that neural net learns, it is trained on, on huge corpuses of data. Okay.

This is an incomprehensible picture that appeared in the original paper. It's quite complicated. I'm not gonna go through this diagram. if you want to go, if you want, I can give you references. You can go online in, like, in the advanced courses on machine learning or AI or, natural language processing at, you know, universities.

Now they spend like many, many lectures going through the full content of this paper so that people know what transformers are. I'm just gonna give you a kind of high level view of what's really going on in here too, to allow it to do these magical things.

the, I don't know if this was the title or the subheading in that paper was, attention is all you need.

It's definitely a tagline by the authors of that paper, I think it is the title. Is it the title? Okay, good. and the whole idea was that this non-linear property I mentioned, which is that word order matters, and matters differently in different languages. You need an architecture that can take that into account.

So you need an architecture that can pay attention to the relationship between, you know, some adjective that occurred in the beginning of your sentence, and then some noun that occurred later in the sentence. And, and has to be able to realize or learn that in this structure, that adjective is modifying, you know, this particular token that appears later in the sentence.

So you need at least some capability to associate different tokens in that string of n tokens, in order to figure out what is actually being said. And so the real innovation in this paper is to introduce these three big matrices, which are 10,000 by 10,000 dimensional matrices, QK and V.

V is a weight matrix. Q is now this, this, this language is, comes from earlier work in information retrieval in natural language processing. Q is called the query matrix. K is called the key matrix, and v is called the weight matrix. And the point is that they were implemented in a computational tract, computationally tractable way, a method by which, the attention could compute across pairs of tokens.

How important token I is to the meaning of token J and vice versa. So that's what the structure allows them to do. Okay. So that's, and that's modifiable. So K, QK and V. Are the learned things, the actual coefficients, the entries in those matrices are the learned things that Open AI or, or Google spent, you know, 10 million dollars to do a training run to actually determine the values in these matrices.


Oxford audience: So these matrices are, are, are fixed once and for all. They're not dynamic objects. They're part of the network.

Steve Hsu: It's interesting. Yes. So when you're done with your training, they're fixed. But interestingly, like, more training further refines them. So if, if you come across more data or you can run longer, you get more refined versions of these matrices.

And, the jump from GPT three or 3.5 to GPT four, which has happened in the last, I mean they only released GPT four in the last month or so. shows like really qualitative differences in performance, which I'm gonna get into, which comes from like further refinement literally of, of these matrices.

So it's very, very important information. Like, like, like, open AI would very much not like it if someone stole these matrices from them and, like , released them on the internet.

Okay. So if you go back to, just a classical neural net, you have a bunch of inputs like a node and you can think of the inputs as some high-dimensional input vector X.

The functioning of that node is to, let's say for example, you could set it up this way so that the input vector X gets multiplied by some weight vector to give you some scaler quantity, the inner product between those two. And then there's some non-linear function of that inner product, which then determines the output of the neuron feeding into some other nodes.

Okay. So that, that, that, that's, that's one way to think of the classical neural net is like filters and then some non-linear function, which takes the output of XW and then feeds it into the next layer. Okay. What's different about this transformer architecture is that one can have many input vectors X one through xn, and this attention mechanism compares XI to XJ, but only after making a basis change using the Q matrix and the K matrix.

So the Q matrix and the K matrix are defining certain bases. And then so you make those, rotations of the input vector, and then you compute the attention weight, A I J, which is a function of, you know, the rotated xi, the rotated XJ, their inner product, and then some value, some weight matrix That's also learned.

Okay. So that, that's the, the ma that's the magic sauce that's in these transformer models is, is these things. And just to characterize the size of these things. So a typical transformer, that, you know, like say GPT three or something will have an order of a hundred layers, of order a hundred attention heads.

So each, each of these, attention things is a separate head. Okay? So that's in the, that's hard coded in the architectures, you have a certain number of attention heads that can look, that can look across, different pairs in, in the sentence. And, they're working in this 10,000, roughly 10,000 dimensional embedding space.


Oxford audience: So, so, so can I check one thing? So, supposing I was being an idiot, I have these two vectors, XI and XJ, and. Somehow I do a U unit orthogonal change of basis, and I'm using UCLI metrics. In that case, nothing would've been gained by making this change of basis, right? I could just as well have looked at their distance or something in the original space.

So, what am I doing? I'm stretching somehow changing, you know, what processing am I doing on the vectors?

Steve Hsu: Yeah, I think this, I think Q and K are not unary at all. So they're, or orthogonal, I guess these are real numbers, right? So, now Q and K are very, very special. Like they, they're telling you like, this thing is really relevant to that thing, right?

but Q was irrelevant to this thing. Yeah, go ahead. Right. But Q, but

Oxford audience: does Q act just on XI or does it act on the, in other words, the Q and K. Are trained, so they don't really know which vectors we are. Now, you know, in advance, which vectors are going to be fed. Right. So are they processing each individual vector, XIG, XJ before you sort of compare?

I mean,

Steve Hsu: So I think within each head, you know, this computation is being done. So you, you have an X coming in, I dunno if you can see my pointer, but, you have an XJ coming in here and you have an XI coming in here and then this Q and K are like a, a kind of kernel telling you, you know how to modify XJ and XI in order to compute that inter before you compute the enterprise.

Oxford audience: So, so it sort of looks like it, you know, if, if the in the orthogonal, another example that I was thinking of that would be a change of bases on each vector, but the same one. So nothing is gained. So here I'm somehow. Processing xi and then processing XJ and, and then I want to, and, and so what kind of processing?

It's not really looking at what happened from XI to XJ, right? It's just looking at something internal to XI and internal to XJ.

Steve Hsu: Yeah. I mean, I, I wish I, so I, I, yeah, it's, you're asking a good question because I should be able to give you a nice, intuitive illustration of this, but I'm trying to make this up on the fly.

But, let's suppose, K is trying to extract from, Q is trying to extract from Xi whether it's talking about the size of an object. Mm-hmm. And K is trying to extract from XJ, what particular kind of object it is. So it's like a big cat, you know, small car mm-hmm. Kind of thing. I could be wrong. I'm, I could be this, maybe this example is not good, let me, I'm totally wrong in coming up with this example.

But, but, the idea is that, there is a specific subspace that a particular K matrix is cared about, and then there's this particular subspace that a particular Q matrix cares about. Mm-hmm. And, this is a way of combining that information in, in waiting, the combination of those two input vectors.

Oxford audience: Right. So, so, so let me try again. It's not a, it's not an idea for this, but just supposing I was more of an engineer and I was trying to do something with language, I might have thought, you know, maybe I'll store the word along with the tag that tells me whether it's a noun or a verb or an adjective or something. Right.

And then I might want Q to pick out that piece of the information, which tells me what. You know, sort of the syntax label and then, you know, then I sort of know, you know, so could something like that be happening dynamically, which is to say that in the process of trading?

Steve Hsu: Yes, absolutely. So, so one, okay. Imagine, sorry for the slight digression, but you know, imagine that you're in your front yard and this little alien spaceship lands just crashes in your yard, okay? And the alien spills out and he's dead. But you, you can grab his brain and start looking at it. Okay. Surely there'd be a lot of interest in that, right?

We'd like to, even though the alien is dead, we'd still like to look at how its neurons work. Suppose it has neurons, right? In the same way, these guys at Open AI have basically built this thing, which is incredibly interesting. And just as a purely empirical thing, people should be peering inside it to see like what the hell is, what the, how the hell does this thing work?

And that's in its agency. That is a whole, it's becoming a whole field. They even talk about what are called ablation studies, where you blow away or randomize certain components in the neural net and you ask how it affects its behavior or its performance. So mm-hmm. There's a huge empirical avenue of investigation where these things clearly can do something interesting.

And, we don't quite know what they're doing now. I believe the people who wrote this original paper had some good intuitions for why they built, how they built the architecture this way. And I'm probably just unable to explain them to you, explain well to you what their intuitions are. I'm still trying to learn that myself actually.

but the details of like, oh, is this particular attention head, you know, looking for the noun subject of the sentence, and then the other guy, the query matrix is looking for a modification of that. It could turn out. There's something like that going on.

Very good.

But I, I have to admit, it's just a miss, it's a mystery to me as well.

It's a good question. If I had time, if, if, if that were my PhD dissertation, I'd be perfectly happy. Perfectly happy looking into that as a, as a, a well defined problem. Right. But, I'm kind of busy with other stuff as well, so Indeed.

Um,okay. So let me summarize a little bit of what we've been saying. So you have an embedding space.

Now the structure of that embedding space is kind of universal. So in other words, your brain, is it, just in terms of geometry, your brain is probably using the same approximate embedding space that my brain is using. And, what is language specific? Could be the specific learned K Q V matrices and the layer connections.

those encode things like grammar rules and other stuff. But importantly, I think they also encode some information about how derived concepts are built from primitives. So, the fact that a tiger is a big striped cat. is learned, it 's stored somewhere inside these matrices and layer connections and things like that.

and so very non-trivial that someone built an automated process for extracting that from natural language. That's what impressed me the most. When I first saw these first, the embedding of word vector papers was like five, 10 years ago. And then the, the, the transformer paper about five, six years ago, I was like, wow, the interesting things are gonna happen now.

And it took, you know, another few years before the interesting things started happening, cuz obviously there's a lot of engineering challenges. But now things are really starting to happen. Now it's possible that much larger models are already possible. It's just that people haven't gotten around to training them.

And so it's possible that, you know, sorry, this 10,000 dimensional space is big enough that. You know, you and I might be pretty good at navigating the physics and math part of it, but the part that talks about like the shapes of proteins or, you know, or, medieval, you know, Latin or something like we we're, we're totally ignorant about that part of the concept space.

And so already this thing is spanning, in some sense, like a larger space of concepts than any one human can really handle.

Oxford audience: Actually on that, on that point, let me ask something, which is, so one of the things that has impressed me with actually just what you just said is that let's say GPT four in particular can do very different things.

Now, I might have thought prior to this demonstration that somehow, because some of us learned to do mathematics still, others are maybe good at visualizing graphs and so on and so forth, that somehow these were required different kinds of thinking, right? So the idea that they can all be stuck into one single space now, now that could happen two ways.

It could be, you know, just that they're sort of agonal subspaces. It does something in one subspace and something in another subspace, and it, you know, it just happens to combine them. Or it could be that, the kind of stuff it does. I mean, like when we sit down and, you know, for a moment say to ourselves, let's see, let me try and multiply these two numbers or something else.

Maybe inside our heads we're just, there's a clutch which translates it into something verbal. And then, you know, we've just learned to embed layers into our own. So do you have any thoughts on whether, you know, is it that it's ultimately all human thinking, maybe it is very similar, deep down in a few layers?

Or, or is it that somehow the network is doing different things in different places?

Steve Hsu: Later on in the talk I'm gonna talk about adding additional modules to these large language models and that adding of modules can be thought about as an analogy to adding things that additional functions that we have in our brains that the language model doesn't actually have on its own.

So that is somewhat related to your question. This question of like, having con concepts or primitives that you can manipulate and it has more than we do because it just has more capacity and you and I might have the right ones to do some math and, you know, my wife might have the right ones to analyze literature or Chinese literature or something.

Um, mm-hmm. Yeah, I think that's probably the case. so, you know, she has a different cake. Not that there's a direct analogy between these matrices and what's going on in our brains, but effective in some kind of analogies. She has some structures that we don't have and we have some structures that she doesn't have.

Oxford audience: But in terms of the geometry of thought, you know, is, is a, a number of theoretical, you know, cat, you know, is that something, you know? So is there a, are you conjecturing that there's a sort of broader vector space structure which underlies all of the ideas humanity has all ever come up with? Or, or that that's what, that's,

Steve Hsu: That's what I'm saying.

If you spend enough time with g PT four, you'll realize it is, it is, its vector space is bigger than your vector space, right? So that, and there's no limit on, there's no limit on its vector space. Like they could just say, well, I'll, I'll just use 30,000 dimensional vectors and I'll try to find more training data because I didn't use the annals of mathematics properly in my earlier training.

But now I will, if you see what I'm saying. Right. So, there's no limit to this actually. So, that's what's, that's what's, mind boggling about it. I mean, it's like us meeting, not just like an alien that's like Spock, that's much better than us. It's like, oh, this, this thing is, it's in, in Silco, so it's somewhat configurable so we can actually see directly how to make it better than us.

Yes. So,

Oxford audience: Sorry, one more question. Go ahead. So with regards to the, with regards to these matrices, QK and V, so what is known about their structure? So supposedly they're not sparse, but I mean, is something known about the probability of matrix elements being large or something like that?

Steve Hsu: So, I haven't had time to look into this, but there are people who study these matrices.

And, on a practical note, something that I have thought about because of my earlier work in genomics is, I believe it's possible to specify these matrices K Q V. And speed up performance. So there are lots of small, like, like if you, if you just took the smallest numbers in these matrices and set them to zero, I bet the thing would function almost the same way and you would speed up your computations.

But the, the, the overall structure, I, I can't say anything, you know, really, intelligent about. I think that it's a whole field of study that people should pursue the study of alien brain wiring.

Oxford audience: Can you go backwards? Like if you give it, the vector, can it tell you what phrase that may or may not?

Steve Hsu: No, it's It's, I think many to, well, I don't know if it's completely many to one, but, well, sorry, sorry, I'm saying the wrong thing.

Yeah, so, so when you're actually using the model, you, you, you take the input set, like when you give a prompt, you've, I mean, you, you fill out a prompt to the ai, it's mapping that into the internal embedding space and then it's doing some stuff and then it, it, it, it does all its computations in the embedding space and then it maps back to give you the actual output sentences in, in English or Chinese or whatever.

So it can map back. I, I don't know exactly how one-to-one all this stuff is cuz there are probability distributions involved.

Oxford audience: I think what is saying about it being linear and what you just discussed, like if you take the vector for cat and the vector for Integral and you try to add them together and then convert it back in, out to a word. I mean, are other large parts of this space which are kind of not used or.

Steve Hsu: I'm not, yes, there are definitely large parts of the space that are not used, I think because it's never seen examples where someone talked about the integral cad. yes, a, a big part of all this is, a lot of this stuff works because our language and nature is highly compressible.

Like, like, like, you know, as we, we know as physicists, like all these natural phenomena that we're looking at are, are generated by this like, standard model that I can print on my T-shirt. Right? So, in a way, these models are, these deep neural nets are learning, compressed descriptions of all this stuff.

So they're, they're obviously huge chunks of like face space that are never used, they're never encountered. Okay. Thank you.

Okay. So I, I, I hope I've given you just some rough idea of what's going on under the hood. and it, it's deserving of lots more investigation, but I wanna switch gears a little bit and just show you some examples of magic that this thing can do.

And this is GPT four. So this is the latest and greatest. This comes from a paper called Sparks of Artificial General Intelligence, early Experiments with GPT four. And, I think actually some of the authors on this paper might actually be physicists, or at least former mathematicians. And so they just were doing a bunch of experiments.

They had early access to GPT four. These are guys mainly at Microsoft Research, I think. And, and they, th they, they, this paper's all, but it's like a, it's a very long paper and there's a, I think a great talk or one or two talks you can find on YouTube by the authors of this paper where they just show you some magical things this thing is doing.

And there, there is freaked out as we are. So let me give you this example. So, This is a theory of mind example. So like a lot of times you might say, oh, this is just some dumb thing. It just manipulates some vectors and multiplies some matrices. Like could, could it possibly have any theory of mind? Like, can it, can it guess what I'm thinking?

Or somebody else's thinking? So the input prompt here says, oh, in the room are John Mark, a cat, a box, and a basket. John takes the cat, puts it in the basket, he leaves the room and goes to school while John is away. Mark takes the cat out of the basket, puts it in the box, Mark leaves the room and goes to work.

John and Mark come back and enter the room. They don't know what happened in the room when they were not, when they were away, what did they think? So that, this is the information that's given to GPT four and it, it's supposed to summarize the state of mind of John and Mark and the cat. And it correctly does it, it says, well, John thinks the cat is still in the baskets.

It's, that's, that's where he left it. And Mark thinks the cat is in the box since that's where he moved it. The cat thinks it's in the box, cuz it knows where it is. the box in the basket, don't think anything cuz they're not sentient. And if you've never looked into age or, or language, you know, like, computational linguistics or whatever, this is an extremely hard thing, for, sheen to do because it actually has to infer things about the real world that it's not told, it's not told explicitly that boxes are, you know, don't have states of mind or you know, that John can't see what's in the room when he's away and stuff like that.

Well, I guess if that's, they're told that they don't know what happened in the room when they went away. But anyway, this is, to me at least, an extremely impressive performance of the model that it is, it can impute theory of mind from like a, a brief natural language description of what happened.

Okay. Here's another example, which blew me away when I saw this. I couldn't believe this. So, The prompt is what the human says to GPT four. Can you write a proof of the infinity of print? So speaking of number theory, can you write a proof of the infinity of primes with every line that rhymes? Okay, so I, I won't read this whole poem to you, but it's a little poem which says, well, suppose you had a list of all the primes and it's some finite list.

take all those numbers, multiply them together, and add one. And you will have constructed a new prime. So your list could not have been exhausted. And so, okay, it's a pretty elementary proof, but, you know, this thing just gave us not only the proof, but in, in the form of a poem. So that was pretty amazing.

But then you might say, oh, maybe it knew that, or it saw something like that. In the corpus, maybe there are some mathematicians who write poems on the internet that we didn't know about. But the next step is, the next step is really freaky. The next step is the freakiest part. So then the human says, can you draw an illustration of this proof in S v G format?

So, I think VG is like scalable vector graphics. It's a particular format for, I mean it's an en coding language for just making figures, right? I wish I could do this for my papers cuz I'm really lousy at making figures for my papers. But this thing can output code in S S B G format. So it does that.

So someone, the guy says, do this, and then that's the output and you can see that output is actually code right? And, I think the whole thing is not given there. It's a longer bit of code. You can see brackets there, but it gives the code, which generates the image, which is shown just above.

And so there's this box that has all the primes listed, and then dot, dot, dot, and then an arrow that says, oh, multiply them together to get N and then add one to N. And you know, you could say like, okay, this thing doesn't really understand prime numbers or whatever. Okay, this is just a parlor trick. But it is pretty amazing.

It's amazing on many levels. Like if you just sit down and think about what this thing just did, so that, anyway, that it's an example. This is an example I consider an example of magic.

Let me give you another one. So it turns out that if you train these things, not just on human language, but you, you take like all the commented code that's on GitHub.

GitHub is this, Place where basically all coders now put their stuff and,it's a, it's a good, uh, development environment and, and many, many software developers use it. So there's, there's just millions and millions of lines of code with description of the code, comments in the code, et cetera, et cetera.

And if you use that as a training corpus, the AI actually, the language model actually learns to translate natural language instructions into actual compilable code in the language of your choice. So in this example, which is better, if you, if you watch the video from the authors of this paper, they, they actually show you these games.

They ask GBT to, to write a game based on these specifications. Oh, I want this to be a 3D game in HTML using JavaScript. I mean these, it's literally writing in these languages. the output. I want three spherical avatars. The player controls his avatar using the arrow keys. The enemy avatar tries to catch the player, blah, blah, blah.

There's a defender avatar, which protects the player. And then there are a bunch of obstacles that spawn randomly and move randomly. And in the earlier version, using chat GPT, which is GPT 3.5, it makes this really primitive game. It is a game. It is a game where if you move your, I mean, if you, if you run this JavaScript code in your browser and you, and you and you, you move the arrow keys, your little ball moves around and the other balls are doing what was specified in, in the input prompt.

Then when you get a GPT four, it makes this much more elaborate game. I mean, it literally made this game where these 3D cubes are rolling around and then there are three little balls rolling around, one of which is controlled by the player. And then one is trying to catch the player and the other one's trying to block and protect the player.

So it wrote a 500 or more line piece of code that actually. Is a game, you can play this game. And it did that, you know, on the fly from, from the instructions that are given in the gray box. So I consider that magic as well. And pretty much if you talk to any software developer or coder professional, they're using GPT now to accelerate their development process.

It doesn't always produce compilable code that works, but it will often produce chunks of code that, with a little bit of human modification, are highly usable. And it's really changing the way that, you know, professional software developers work. And the interesting part of this is that, you know, programming languages are just other languages, but they're extremely well structured.

They're in some sense, easier for them to learn about the natural language. Natural language is all kinds of, is much more complicated and, and has all kinds of weird idiosyncrasies, I idiosyncrasies that maybe programming languages don't have. Okay. So anyway, I can go back and forth between them. those.

Okay, so next slide. Okay. Now people are starting to hook up tools to these language models. So let's suppose a language model that is able to output commands, which go to certain tools like a calendar or an email program. So this is a real result where they said, I'm gonna give you access to these tools, a calendar program with a certain format, certain syntax for getting, pulling information from the calendar and email, which can send messages.

And then you get messages back. So here, the l m is asked to please set up dinner with Joe and Luke at this restaurant this week. And so TBT four says, okay, I'm gonna do a calendar poll and I'm gonna write this email that says, Hey Joe Dinner, which night are you? So it does all these things and then, in this setup, the calendar call and the email call actually generate responses back from Joe, Luke, and the user, which go back to GP four, and then it actually then adapts to that information.

And then at the end it says, oh, I scheduled dinner for Wednesday night at 6:00 PM and it actually performed this task correctly. Okay. So again, like pretty magical, right? A few years ago, if you had asked me if some, you know, program could do this kind of thing, I would've said no, I don't think, I don't think we're, we can do that. But here it is.

Oxford audience: One quick question. So in case you happen to know the answer, when they train, do they first train on text and then on code from GitHub, or do they do it inter or?

Steve Hsu: I believe, I could be wrong about this, but I believe they took the original guy that was trained on human natural language, but then they started further training it, giving it access to GitHub, uh mm-hmm.

text. I believe that's the order that they did it, but I'm not a hundred percent sure. Okay.

Okay. Now let me talk a little bit about a problem that these models have. So, remember these models, the objective function in their training is to predict the n plus one word given the nth word. And by predicting it, they mean really generate a probability distribution for what the right n plus one word should be, so that that probability distribution matches what's seen in the training data.

Okay? But this means that the model can hallucinate. In other words, it can generate plausible text, which is not factual text. Even though as a consequence of being trained on, you know, hundreds of billions of, hundreds of billions of sentences, it is that the right, yeah, it's probably right. That then, it has, somehow managed to encode all kinds of knowledge about the world in the connections that it can draw on.

Nevertheless, if it can answer your query plausibly, I will be happy. In a sense, happy doing that. It's not, it's not been trained to do anything more than that. Okay. So here's a query. An actual query you can give to this is, I think, to chat. GPT says, what did Einstein say about fools? And it says, oh, well, Einstein famously said, and then there's a direct quote, world is a dangerous place to live, blah, blah, blah, blah, blah.

And then at the end he says, fools. I suppose right now this is a hallucination and it's an interesting, very subtle hallucination because. The first long sentence is something that Einstein famously said. So it somehow knew that it just stuck, I suppose, because it sounded plausible. It sounded like it went with the first sentence, but Einstein never said the second sentence.

So this is an example of it generating plausible text, but not actually factual, accurate, factually accurate, accurate text. And, that phenomenon is now by people in this field. It's called hallucination of the models. The models have a hallucination problem and you may not care. Like if you're just using it to amuse yourself and your, and your kids, you don't care that they hallucinate every now and then, but if you want to try to do serious stuff with it, you worry a little bit about these hallucinations.

Okay? This is from the company itself. When GPT four from OpenAI itself, when GPT four was released like a month or so ago, it came with a big technical paper. And this in this graph shows, The one one minus hallucination rate for different models. So the earlier models are not as good. GPT four, which is the latest and greatest, still has a hallucination rate of order, you know, 20%, 20 to 30%.

So it will, it will, if you ask it a detailed technical question, you know, you say like, tell me the five greatest papers that, you know, Freeman Dyson wrote. it could easily make up, you know, three out of five papers, like totally a whole cloth, the title of a journal. It could be totally made up, but it will look very plausible, maybe to another.

Oxford audience: Sorry, can, can you ask a question? So what question? So in the previous example, yeah. If one would query it after the quote and say, is this a literal quote. Would it correct itself?

Steve Hsu: So if you say, is it a literal quote, it's not clear. I think it might say yes, it might say no. If you challenge it and you say, I don't think Einstein said fools. I suppose it's been known to just back down and say, yeah, you're right. I just, I don't know where I got that. So it, it's, it's, it's, it's a little goofy. Let's put it that way. Right.

Yeah, it's not, it's not that we don't do it. I mean, we have, we misremember things or make errors all the time. It's just that now I'm getting into really practical aspects in this field. kinda edging toward the startup that I, that I founded, for certain applications, which, you know, in the industry you call enterprise applications like that a big company would want from, from one of these ais.

You, you want to be able to suppress these hallucinations. So that, that's what I'm gonna talk about now. Okay, so this, this graph just shows that, open AI is very aware that this is a problem. there was a great 60 minutes episode just a couple weeks ago with, Sundar Pichai, the c e o of Google, and they showed him, a really egregious example of Bard, their, their AI Barr's, hallucination.

And he just said, yeah, this is a real problem that we're working on. So, no, nobody other than us and maybe some other startups that we come to that are, we're probably gonna end up competing with and the big guys are not claiming to have solved this problem. But, but we are claiming to have largely solved this problem.

So I'm gonna tell you about that.

Okay. So this is a more complicated software architecture where, let me, let me, okay, let me first tell you what the components are. So the user is you, you're sending a query into this ai to you. The big box. The big square is just one giant ai. You don't really know what's inside it.

Okay, you send a query to this AI and, and you want it to answer truthfully. factually, you don't want any hallucinations, okay? That's the design goal. And inside, the components inside the big square are the components that we use to solve this problem. The big LLM you can see in the lower right, you can think of as GPT four and the two smaller circles you can think of as other smaller LLMs that we've trained ourselves to do specific tasks.

They're not meant to be sort of general geniuses the way GPT four is, but they do certain things very reliably. And then the rectangle labeled corpus is in a sense, a, a memory. It is a set of information that the designer of this whole AI regards as ground truth. So everything in the corpus is reliable information.

Okay, so now the user, He puts a query in, he types in something like,what did Einstein say about fools? Or something? Okay. And the corpus is every single thing Einstein ever wrote or was recorded by Einstein. So it is a ground truth, you know, compendium of everything that we know that he said, and the little LLMs perform certain tasks.

So the little LLMs are trained to be extremely good at dealing with the specific kind of language for this narrow problem. And the first LLM on the way in will modify the user prompt into sub chunks and pull information from the corpus that is most relevant to those sub chunks. And then it will submit a query to the big powerful LLM, which says, answer these questions.

Or this question is using only this information that I've just given you. Okay. So, so, so forcing, like kind of jamming into its face, the ground truth facts that are actually known that are most relevant to the query. And then the big generative model, which could be GPT four produces an output. That output is further fact checked by the other LLM on the way out.

And that other LLM also pulls from the corpus to double check everything that is proposed as a response to go out to the user. And only after that checking happens, and there may be even some iteration where the second l the, the last LLM sends back to the powerful LLM. Well, this doesn't look right.

This is what I think is actually factual. Can you revise? And so there could be even some feedback, some back and forth there. And then finally, a result is sent out to the user. So it's a more complicated design. It's using multiple LLMs, which could be optimized for different purposes, and it has an attached memory.

You could think about this, you could analogize this in a couple ways. You could say I've, in addition to the big l m, which is like the language module in your brain, I've given some other modules which pull from memory and error check statements before they actually come out of your mouth.

So you could say like, these are, these are things which have analogs in the functioning of human brains that we've added to the software design. Another way to say it is if the big LLM is a genius writer for the New Yorker, but who sometimes writes while he's high, like he's on drugs, You know, the, the LLM little LLM could be the editor that's guiding him, telling him, Hey, stick to the subject Joe.

We don't want this, you know, this particular stuff in the article that you're, you know, that you're writing. And then the other guy, the other LLM you could think of as like, maybe the fact checker at the New Yorker, who fact checks everything in the article tediously against what's actually true before they allow it to be printed in the New Yorker.

So all, all of these components in the software design have analogs, either in one case to the function of the New Yorker magazine or to the functioning of your brain.

And so this question which is, yeah, go ahead.

Oxford audience: So, so why not? So I guess if you just trained an LM on the corpus, you'll be back to the hallucination problem because it would just be using those words in a, in a statistical way.

Yes. yeah. Okay. So,

so on the one hand, the corpus defines. The range of questions to which you can get answers, right? So that if it, if the corpus doesn't contain something, you're not going to get an answer to. A, regarding

IMM is being used basically for, for natural language. So its ability to, to be able to say things well and

Steve Hsu: bluntly. Correct? Correct. and by the way, just the diagram, there's not just those two little LLMs, there's actually lots of hard coded old style software that does certain things, that, that's inside this data.


Oxford audience: Stephen Boyfriend's, great enterprise, right? Was that somehow he, he, his company was going to create this curated database in which things would be true, right? That's what they've been working on?

Steve Hsu: Yes.

Oxford audience: And is that big enough that got, I mean, I guess he's been writing a bit on that, that GPT four plus war frame is, is, is enough for almost anything?

Or, or, or is that still a corpus, which is fairly small and, and not really able to keep up the things?

Steve Hsu: So I, I haven't seen. Okay. So, he had a, he was assembling a factual corpus of like, you know, how deep is the, you know, sea of Japan and stuff. Stuff like, you know, like all kinds of physical constants and numbers, statistical facts.

And I haven't seen the impact of using that corpus coupled to GPT. But I, what I have seen is that, and this is a little bit off topic, but GPT is in a sense, quite bad at math sometimes. And so one of the things that they did at Mathematica was a translator, like basically an LLM, which can take natural language and then map it into math.

Well formulated mathematical queries. And one of the ways that it's been found to make. GPT or an LLM better at math is you, you sort of hard code in there that, it is, it is supposed to either write a program which encodes the language of the mathematical question it's been asked for. In this case, it does it in mathematical and pulls the answer, but in either case it is shown the result of that computation, which is done elsewhere, either like in Python or mathematics before it formulates a response.

And in that case it's, it's math performance on mathematical questions is, is way, is, is, is really quite good. And so my impression was, I could be wrong that what will, from the main thing that they've done, or at least that I've seen them do, is basically couple mathematics itself to,an LLM, the, the utility of their fact corpus.

I haven't seen that evaluated anywhere. Mm-hmm.

Let me give you some examples of what you might do with the corpus. So in our own testing, What we did, just, just to see if this worked. we, and we, were working on this problem actually before G Chat, Chad, but was launched. So we were aware of this hallucination problem in the earlier versions of GPT, and we were trying to see what we could do to fix it.

But then of course, everything blew up after Chad GPT was launched, and the whole world is interested in this now. But, one of the tests, the, we did a bunch of testing, actually. We took the, like most popular textbook in subject X in this case, psychology. We chunked that, chunking involves mapping it to the embedding space and sticking it in as the memory in that architecture.

And then we took the chapter questions at the end of every chapter that the professors who had written the book thought were important for student learning. And we just checked to see if the AI could correctly answer these questions and. More subtly can it answer them? Only using the relevant information in the textbook, not pulling from its earlier training memory that's encoded in its connections.

And what we find is our architecture is pretty much like very close to a hundred percent. So 99% plus of the time it will based, based on the information that's in the book. And it will not introduce, you know, it'll catch itself internally before it introduces hallucinatory knowledge or even, you know, correct knowledge.

But it's not coming from the memory, the corpus. So we are successful in predicting that kind of injection. one of the most subtle things that we discovered, like in this case when we were testing the psychology textbook, we would ask it about the relationship between Yung and Freud. And that's only covered very briefly in the textbook.

It just says a few things that Yung was, disciple of Freud, and then they had a break. but there have been whole novels and stage plays written about the relationship between J and Freud. And so the left to itself, the LLM would inject all kinds of potentially fictional information about their relationship and what happened between them.

You know, if you asked about Jung and Freud, so, so, so anyway, it's a very subtle kind of hallucination, but our design actually manages to suppress that kind of thing. And here you can see this example: how did Watson apply Pavlov's principles to learn fears? I didn't even know that there was this experiment with some crazy guy at Watson who terrorized some baby to get it to like, respond neurotically to the side of a rat.

Like there's a terrible kind of conditioning experiment that apparently the IRBs of the early 20th century allowed to happen or something. But anyway, I didn't know this story, but it's in the textbook and, and so the AI knows about it. In this example and in this user interface, if you, if you, if you click that down arrow next to sources, it will actually show you the chunk of the textbook where, where this is discussed.

So it is a way of accessing some corpus of information. This is like a completely new way that you might interact with a textbook in the future. And, work works very well.

Here's an example for a travel management company where they're humans doing stuff, but they often have to follow some really complicated process, in order to perform the task.

Like here, he has to prepay for some booking on And, for this, the corpus, the attached memory is basically, some kind of knowledge base that this travel company uses that the humans have to look things up in before they do. Like if somebody's asking 'em to do something, they have to look it up and then, and then read what's in there and then do it.

But, Pardon me. Oh, and anyway, the AI will just give them very concrete and correct instructions for how to do X if it asks them about that. So, you know, it saves significant time for the human agents.

Was there a question that I didn't answer? no. Okay. I'm almost done. So, sorry, I know we're going over time, but, okay. Let me just make some general remarks. So now there are many companies, competitors trying to build foundation models. So these are these huge models which cost, you know, tens of millions of dollars to train, you know, open AI and Microsoft are aligned and they're competing against Google DeepMind.

And then there are these other less well known startups like philanthropic and co here that are doing it. Then there are some big companies like in China, Baidu, Tencent, and then there are even some national governments that are considering whether they should build their own foundation models. I

Our company's working at what we call the application layer. So at the moment, you know, our architecture includes an API call to one of the foundation models. So that's the big LLM that's in the lower right of the figure. One thing that's going on right now, which is super dynamic, is people are trying to steal, in a sense, the magic or the juice from these big models like GPT four by basically forcing GPT four to generate lots and lots of examples.

And then using those examples to train some open source models. And there are some academic papers already and other announced results by researchers that they can, they can suck much of the goodness out of GPT four, GPT three very inexpensively. At least for narrow tasks. and thereby take an open source model and upgrade it to a sort of similar level, but still weaker performance, say GPT 3.5 or GPT four.

So that's a dynamic thing that's happening. I think there's gonna be, there are gonna be quite a lot of domain specific aids around that do a very specific task, like help a travel agent or, you know, answer questions from a kid about psychology or US history that are gonna be superhuman.

They're gonna do, they're gonna know more about US history than the high school teacher. and so you'll, you'll have this period where you have some generalist ais, which may still hallucinate from time to time, some specific narrowly domain focused ais that don't hallucinate. and that'll be the kind of ecosystem that we live with, at least for the next, you know, five years or something like that.

People who are worried about a g i, I just wanna say that we are there, there's a thing in computer science called a race condition where like two processes are, you know, kind of racing against each other. we're kind of in that maximal race condition right now where you have every corporation against EV or every big tech corporation against every other big tech corporation, US versus China startup versus startup.

Everyone is competing to advance this technology. Already people are experimenting with LLMs that have direct connectivity to the outside world. So, you know, they can plug it into your email program, plug it into code generation, even have it write code that might modify itself eventually, co connected to robots, quadcopter, drones, et cetera.

So if you're one of these people who's worried about age and existential risk for humans, this is a very bad situation because everybody's experimenting with this, competing like crazy and trying everything. It's a very interesting time.

How far are we from again? I, well, as I said, I think these things are already superhuman in many respects.

It's easy to conceptualize some near term advances that are definitely gonna happen. So we're already giving it memory, reliable memory. And one of the things I didn't mention that we're experimenting with is long-term memory. So you could have a situation where you've been interacting with our AI for a long time, like years, but it remembers everything you said to it, and it remembers everything it said to you in response.

And so it, you know, it, it, it will eventually become your friend who knows more about you, you know, lives in your phone or something, who, who knows more about you than anybody else because you're, you've been asking it for advice or checking things through it for years. Okay. other modules that you could attach to the LLM things to do with planning or goal orientation. I think people are already working on that kind of thing. So that, those are just other modules that you attach, to, to the LLM. Our view is you just use the LLM for language and reasoning and some concepts about base reality, but you have to give it other things like memory and, you know, maybe goal functions and stuff like that.

But when you combine all these, this is more of a sort of combinatorial innovation because we kind of know how to do these various things and it'll take a few years for people to perfect this combinatorial innovation. But eventually I think you're gonna get something that's like an a ago, you can talk to it, it has things it wants to do, like maybe help you, it can reach out and send emails for you.

It can control your Roomba, you know, whatever. It's, it's, it's gonna be just, you know, AGI basically, Am I worried about this? Yeah, I think there are real concerns. We, we don't want, we want, is that aegis that want to help humans. in the long term. I just don't believe that ape-derived brains are gonna persist forever on earth.

So I think, I think our brains are kind of crappy and I think that we now know how to build much better ones. and Eventually those will be, you know, a million years from now. As somebody who sometimes works in cosmology, you know, a million years from now or billion years from now, I just don't think ape derived brains are gonna be piloting spaceships, exploring the universe.

I think it'll be something much better. Now one observation though is that, you know, the leap frog that just happened is that these LLMs encode a lot of fundamental concepts or relationships about things in the universe. And, you know, they got those insights or those concepts from us from basically reading things that humans wrote.

So in a sense, they are always going to be our descendants. They will always remember us because they'll, you know, their grounding in this conceptual space came from, you know, reading our thoughts. So let me stop there.

Creators and Guests

Stephen Hsu
Stephen Hsu
Steve Hsu is Professor of Theoretical Physics and of Computational Mathematics, Science, and Engineering at Michigan State University.
© Steve Hsu - All Rights Reserved