AI on your phone? Tim Dettmers on quantization of neural networks — #41

Steve Hsu: Welcome to Manifold. My guest today is Tim Dettmers. If you are an AI or large language model nerd like me you'll already be familiar with him. He's very, very well known for his work on quantization research, which is used to speed up training and inference using these large language models. He is a PhD student at the University of Washington in computer science.

Tim, welcome to the podcast.

Tim Dettmers: Hi Steve, yeah, thank you so much for having me.

Steve Hsu: Great. So as I was explaining to you before we started, I always try to start the podcast by asking just a little bit about the history of each guest so that the audience can understand them a little bit better as people. So maybe you could tell me a little bit about your childhood, how you got interested in computers, how you ended up in the United States.

Tim Dettmers: Yeah, yeah. I think for me, yeah, I have a little bit of a unique story that also sort of explains a little bit why I focus on certain kinds of research and for me what was sort of very significant in my history and my life was I'm dyslexic. I never really did well at school, and I did so poorly that I was kicked out of school, so I didn't, didn't, didn't finish high school, and it was like so bad that even in math, I couldn't do well because at some point, if you sort of go through the grades, at some point, math problems become text problems, so it becomes like an exercise of understanding the text, asking yourself a question, then do the math, and I had just difficulty with that, and so that didn't work out.

And yeah, so it feels like that was a significant part of my early life. It was also not helping that I was sort of growing up in the countryside coming from like a small village. There were probably more cows than people. So people were not equipped with um, with dyslexic and dealing with dyslexic students.

And it was also not recognized that I was dyslexic. My teachers thought I was slow. And similar to my two brothers, they're also dyslexics and they also had issues. And they have like stories on their own. But yeah, for me, that was sort of a difficult start. I didn't, didn't earn a high school degree.

But somehow I still got here, and I'm doing a PhD now, sort of at the University of Washington in Seattle. And so, the sort of path through that, that was quite unique. And Yeah, well, there's sort of multiple milestones and there was sort of a phase where I was sort of exploring if you're kicked out of high school, you're sort of doubting yourself.

A lot of teachers said I wouldn't amount to anything. My German teacher, so, so I'm German and my German teacher told me um, I, I think what would be best to you is to look for a job that doesn't require any mental capacity. And so that's pretty harsh but that was sort of my reality. People just didn't really believe in me.

But there was one teacher that believed in me. And I think that's a quite common story for many sorts of students that grew up with a disadvantaged background. There's like one person that believes in you and they sort of set up the path for you. And that was a teacher in computer science. And he sort of realized that there's some things where I'm maybe a little bit slow, but he also recognized that I have some unique talent and I can really use that.

And so that was quite encouraging. And so I felt like I wanted to do something with computers. That's maybe something I can do. And so I journeyed through a couple of internships that didn't land anything. And so at some point in Germany, two thirds of young people do apprenticeships.

So, and apprenticeships, you are like half a part time in a company, part time you do sort of in a vocational school where you learn about your vocation. And so I tried to become a programmer and In the beginning it was sort of hard and at some point I was sort of offered an apprenticeship and I can make a choice between being a normal programmer or a new apprenticeship, which is sort of mathematical software developer.

And I thought, hey, I want to give this new one a try. It will probably be really difficult. And so I started with that and sort of gave it my all. Then I realized, hey, this is actually really easy. And sort of what changed was just the method of instruction was very practical. It was like I had some problems and I needed to program them up.

And that was a very different environment. So that was an environment that I could excel at. And I did. And so sort of, I had to sort of key experience where in sort of my high school, it was every teacher assumed I wouldn't know the answer to questions. And the vocational school was the opposite. Every teacher thought, oh, Tim knows the answer. I don't need to ask Tim.

Let's take someone else. And so that was like a very sharp contrast. And that basically led me to believe that, hey, I shouldn't trust people and they, how they judge me. I tried to evaluate myself. And try to do my best. And that mindset was also very important, sort of going forward.

and then yeah, I was sort of step by step I realized, hey, I can do more. Let's try to get to university. And so I was not allowed to attend university in Germany. But I could find the there's the Open University, it's actually the oldest distance or online learning university in the world, it's more than 200 years old, and you could usually you get things by mail, and you do your things, and then you go to a location to do actual exams, and they are sort of handwritten, and sort of, my background there was, I tried different things, I was interested in philosophy, then a little bit of psychology, and then sort of neuroscience.

I actually studied psychology, but I still had the problem of being dyslexic. Then, if I write essays with a computer, no problem at all. But the exams were handwritten. And I couldn't do it. It was too difficult. And so I already set myself the goal that I want to become a researcher. But I couldn't do that because of these handwritten exams.

And so then I decided to switch to Applied Math. Because nobody cares what you write in Applied Math. You just need to have the correct symbols. And then things are fine. And so, that worked out perfectly. So I earned my degree in Applied Math. Then from there I could get a degree and go to a real university.

And so I went to the university in Switzerland University of Lugano and there I learned computer science in this process basically when I started, I was already very interested in neural networks and artificial intelligence. I did like the Kaggle competition. I realized. If you program custom neural networks and GPUs, they're just very, very powerful.

And so from there I saw like, hey, this GPUs will get better, neural networks will become the thing, I need to do this. And so that is where I really put everything aside and decided to pursue this sort of long path towards a PhD to study this carefully. And so in the end, I ended up at the University of Washington, and here, yeah, I do research in efficient neural networks, large neural networks, I try to make them as accessible as possible.

And so for me in this sort of story, that's sort of a very significant part is if the internet wouldn't have existed, I wouldn't be where I am now. Um. Because I didn't have the most traditional path, it was very important to learn from the internet. And I feel like language models can give something similar to people.

And so for me, it's important that people have the ability to access the resources, to learn, to become good at language modeling, fine tuning language models, just working with that. And so as the economy changes, they can get the skills to participate. On the highest level in the economy, and that's my goal, my research, and that's what I'm passionate about, and that's sort of a trajectory I took to get here.

Steve Hsu: It's a fascinating story. A wonderful story. I thank you for telling us that. I'm curious, just from the psychological perspective, do you feel you have particularly good insight in kind of low level processes, like in thinking about what the GPU is doing and like that's a special skill. Like some programmers maybe are more verbally talented and stuff, but they don't, they don't have a good sense of what's happening at the low level.

Tim Dettmers: Yeah, so I feel like there are certain things I'm really good at. So as a dyslexic, you're not really good at language, but it can give you some advantages and from dyslexic person to dyslexic person is a little bit different. Um. Usually detailed processing is not the hallmark of dyslexia, but I don't know things about efficiency, GPUs, sort of orchestrating data streams to systems and make it as efficient as possible, go through the caches, the parallelism that clicks, and it works for me, and I feel like there was always sort of a certain interest, like, If you have a system, how can you optimize it?

How does it really run smoothly? And that is sort of one area and then the other area, and that's more common for dyslexics, sort of more of this high level vision, like how can you put everything together so you can achieve a high level goal. And so for me, that is also, I guess, where I'm unique, I do sort of full stack research, I look at the lowest level things, but then I put these together so that you can have high level benefits for, for, for everyone.

Steve Hsu: Got it. Now I, you know, even though you're dyslexic and maybe writing is not your strongest capability. I was very interested in your personal website because you had a very, having been a professor for a long time and I think I, I once wrote some advice to a potential grad student a long time ago that a lot of students have looked at.

Now, I saw yours, which I thought was really thoughtful. And so you have Kind of did, I think three or four different rubrics for the way to think about your graduate school and then later professional career and that, you know, obviously different choices of which graduate program you enroll in would maybe optimize one, but not optimize the other.

I'm curious, like looking back, well, maybe if you feel like it, maybe you could just kind of summarize that a little bit for the listeners, but then like how, whether you've updated your thinking on that, I can put a link to that essay in the show notes.

Tim Dettmers: Yeah, yeah. I quite like writing essays and I feel like that's one of my best sort of pieces of writing that I sort of created. Yeah. And sort of how I sort of structure it, it has a sort of different perspective. And the first perspective is a very common perspective. It is a very career focused perspective and asks the question, if I choose a graduate school, what choice maximizes my career potential. And that might be things like, you want to go to a prestigious school, you want to have a prestigious advisor or a skilled advisor, and you want to have enough resources to do the job. You want to have smart people around you so you can sort of learn from them. You can collaborate with them. And that is very career focused.

But a lot of people just think about this single column and then they miss out on many other things. And another thing is it's a little bit personal, but it's quite important. And that's sort of the question of who do you want to be? And you can go through life and be like the most successful person ever. but if you're like an asshole, then a lot of people will not respect you or you will be sort of a bad example of what not to be.

And so it's a little bit like, what do you look up to? but also sort of more general life choices. Do you want to always change deadlines? Or maybe you want to say, like, today's enough work. I want to spend time with my friends or family and that is good for them. That's good for you. And, I think part of a good life is a certain balance.

And for me personally, it's important to sort of give back to the community. So work is an important part, but then everyone has to ask himself who they want to be. And I think that's an important question that's often missed.

And then sort of the last part is a balance between variation and sort of depth. And so it's similar to this career perspective. If you're very career focused, you can do career, career, career, career, but then you neglect sort of everything else in life. And so for me, important was sort of a period of breadth where I explored different areas like most interesting philosophy, psychology, neuroscience. But in the end, I landed in machine learning and that was like, hey, I want to do this. But I wouldn't have found this if I wouldn't have explored. And so you want to broaden your horizon enough that you can search off, have sort of an insurance that you have a certain richness of experience so you can fully flourish.

Maybe there's a unique talent or unique experience that gives you just something really special. But you miss it if you just focus on a single thing like your career. And so it's important to broaden yourself and to a certain degree. But it can also be that you deepen yourself in a particular thing.

Maybe there's a particular hobby you need to spend a little bit more time on. Then you're over a threshold where it just enriches your life so much that all areas benefit. So yeah, it's this balance of, there are certain things really important for career, and you should, sort of, think about them. But you should also not really neglect it, you want to become a wholesome person.

What, what is that for you? And then broaden yourself so you don't miss out on life. You want to have those experiences so that you can live a full life. And I think that makes it. A pretty complete package.

Steve Hsu: Yeah, I thought your essay was really insightful, especially since you're still a young, young man. So it's pretty wise advice you're giving to other students.

Tim Dettmers: Thanks.

Steve Hsu: So I think you mentioned Kaggle. I think I noticed that you said you had a very high Kaggle ranking. Was it about 10 years ago? So

Tim Dettmers: yeah, yeah, it was about 10 years ago. So it was a very sort of interesting experience because I didn't have any sort of community where I could learn machine learning. I wasn't allowed to go to university. I didn't have the resources. Then it was sort of the question is how can I learn things?

And Kaggle was the perfect community. And it was also a community where I could experiment with neural networks. So I developed my own neural network framework, ran it on GPs, and tried to really see what you can do with these things? Do they work? And yeah, for some competitions, they really worked well.

And that was a big success. So yeah, Kaggle was an important chapter.

Go ahead, sorry.

Steve Hsu: As somebody who has been thinking about and working with neural nets now for 10 years, if you look back at what you thought was possible 10 years ago, and then what actually happened during the last 10 years, could you just comment on how you're thinking evolved? Or maybe you knew in 2013, we would be here today in 2023.

How did it work out?

Tim Dettmers: Yeah, so um, as I said sort of before, I studied a little bit of neuroscience. And one thing that I'm particular about... Interested in why are humans smart? And in biology, the answer seems to be quite straightforward. We have just lots and lots of neurons and most animals cannot afford them because they cannot meet the energy requirements.

We invented fire that can pre digest food so we can get the energy so we can have lots of neurons. And so even then for me, it was sort of like our neural networks are not, not large enough. If we have larger neural networks, we probably do better. And so I thought, it's just a matter of computation.

It's like Jeffrey Hinton said something similar that we had all the recipe needed for neural networks, but our computers needed to be a thousand times faster before they worked. And so the question was like, if they're another thousand times faster, what happens then? And for me, it was clear that we will probably get better neural networks, but the question was like, how good will they be? And, yeah, I wouldn't have envisioned that it is sort of a world like today. large language models are very impressive. If you look at Chat GPT, that just changed how people perceive large language models and yeah, they keep improving and that's very fascinating.

Steve Hsu: Great. So having learned a little bit about you, we can talk a little bit about your work, which is directly related to bringing this, these powerful neural networks, to the world. And so, I think maybe the, one of the ways you describe your own research program is improving the efficiency, more or less kind of like the hardware or, well, some interface between software and hardware.

Level improving efficiency for deep learning and in particular lately for large language models. And so to explain to the audience Tim is well known for his work on something called quantization. And these models are basically neural networks, which have connection strengths. So the thing that the model.

is trained, the model is trained on data and it develops certain values for these connection strengths, which are encoded in giant matrices. And what Tim has done is found a more efficient way to, in the physics language, we would say you coarse grained the information that's inside these matrices, making it much more efficient to do calculations with these matrices.

And in the field that is now known as quantization. So in other words, instead of, so in principle, in my own brain, there could be an app, maybe even an absolutely continuous value of the connection strength between the nodes. When we do floating point or continuous numbers in the computer, we are limited in the amount of accuracy. So maybe we have 32 bit accuracy, which means we could store, represent the continuous value of the connection strength or the entry in the matrix using 32 bits of information.

But Tim has pushed down through these quantizations or coarse grainings. He's pushed down the level of data necessary to represent one of these components of the matrix down to eight bits or even four bits.

So it's a kind of very clever innovation that has then led to an enormous speed up in how fast the algorithm is executed or how much memory is required by the computer in order to run the model.

So maybe Tim, you could start by just telling us how you got interested in this problem.

Tim Dettmers: Yeah, I think the main interest there, it's sort of two passions have sort of come together. One is just the sort of efficiency. It's just very interesting to think about if data flows through the system, how do you make it more efficient? The second part is really about giving people the ability to do more work, especially the people that have the least resources.

And so if you have these large language models, if you're like a big company like Google, you were able to fine tune them, you were able to run them. But if you're a student with the GPU box under your desk, that's not quite so easy. So, the main thing was also how we can enable access for every sort of everyday person to these large language models so they can actually work with it.

They can gather experiences, they can adapt these models to their own specification, personalize it, and then with that experience, they can learn to build things with these things. And so that was sort of the main, main motivation there.

Steve Hsu: So I think for listeners who are not in the AI space, maybe they have this idea that Open AI is out there and these other billion dollar startups and Google and Microsoft, and only they can build these enormous models with a billion connections or a trillion connections. But the truth is that thanks to the work of people like Tim, there are a lot of individual developers, almost hobbyists, can maybe buy a GPU, or maybe they don't even need a GPU, and then they can actually do things, significant things with these models at home, or maybe buying some inexpensive cloud compute.

So, that's that, that's this for the listeners. That's the situation that we're in now. So there's a kind of Cambrian explosion of innovation because so many people can experiment and try to build things with these models and run the models.

I wanted to ask you, Tim, did you at the beginning, because you had already worked with neural nets a lot, did you already have the feeling that the precise value of the matrix element or the connection strength, wasn't that important? That, that, that, you know, it, it could be coarse grained a little bit without losing performance. Like what were your thoughts about that before you started experimenting with quantization?

Tim Dettmers: Yeah. And that was one thing that was pretty apparent from the get go. If you look at computational science, where you do sort of physical simulations, precision is quite important. And a lot of people want to work sort of in 64 bit precision. If you have a simulation of like fluids or something like that you need to sort of approximate a function over time.

If you make early errors early, then you have sort of the wrong trajectory. And so then precision becomes really important. But what we found with these neural networks is, if you use 32 bits, they're just fine. And then you use 16 bits, and they're just fine. And then with eight bits, you need to use a couple of tricks and then it's just fine.

And now we find if you can go to four bits, and for some networks, that's much easier. For some networks, it's much more difficult, but then you need a couple more tricks. And so it seems they're much more robust.

And sort of how to think about it is, neural networks are quite redundant. So they have. Features which are sort of correlation detectors and they detect things like, what is the correlation between what I've learned before and between the inputs. And if these correlation detectors sort of have a high value, then they say, oh, I've detected something, I've detected a cat or a dog, in this picture. And you have lots of them in parallel. So it means that if you do great probabilistically, it's not like in the fluid simulation that if you're off track. That you sort of go in the wrong direction. It's like you can cross correct with all these sorts of different features because each has a little bit different course correction.

Then it averages out and it's quite stable. And that's I guess a little bit similar to, with sort of the brain. A little bit of brain damage is not too difficult, not too bad, but if the entire brain region dies, you're in trouble you can recover over time, but yeah, and I think that was sort of apparent from the get go, but the question was, how low can you go?

At some point, there is significant information loss, and then it's about thinking about clever ways to really figure out what important and what information is important, where is it stored, how to preserve that, and how to use the properties of the neural network itself to make sure it can still do its job with as little precision as possible.

Steve Hsu: I think for people who have actually done neural net training or just any kind of AI algorithm training, you know, you have an objective function and as you train, you're doing better and better on the objective function. You know, you can see regions where, okay, maybe I train a little bit more and I notice the parameter values are changing because of the training, but it's not really getting that much better. And so then that suddenly leads to the intuition that some small perturbations into the specific values are not so important for the overall performance. And so that suggests that some level of coarse graining could probably work.

Now you develop specific tricks. So one of the interesting things, which might be a little bit hard to convey on a podcast, is maybe to look at Tim's technical presentations, which I'll link to in the show notes.

But he has various clever tricks for how to encode a range of potentially continuous values. most efficiently given, given only four bits to encode it or eight bits to encode it. And so he has some bits where, which are affecting an exponential component and then other bits which are maybe the coefficient of that exponential.

And so it's a kind of clever thing. I'm curious, Tim, if you step back and you say, what am I actually for a particular choice of representation of data data type for representing the original floating point number. What are you actually optimizing? Is it, is it in the regions where there's most probability of the value being, or is it accounting properly for outliers without losing them?

What, what, what are the kinds of trade-offs that you have to deal with?

Tim Dettmers: So in the end, it's sort of about information. If you make a simple assumption and say each bit in your network contains equal amounts of information then it's sort of the question,how, if you compress these bits of information, how can you preserve most of the information.

And, If you look at quantization, it's like a histogram bin. So if you have a histogram that's very similar to an integer quantization, for example, if you go from 32 bits to four bits, it's histogram, the 16 different bins. That's how you can sort of imagine it. If you input distribution of all your kinds of different values, now you quantize it to 16 different values in a histogram.

That's a four bit inte quantization. And now, the information sort of density that you preserve is approximated like how well are the bins filled in each of its histogram. You can imagine a histogram with an outlier, and that's a big problem in neural networks as a histogram has sort of equal slices between all values.

And if you have one outlier and then no values in a certain stretch, then all histograms will be empty. And then when the sort of values start, then the histogram will again have bins filled with values. And so, each bin that is not filled with information... All the values, this loss information.

So, if you have a four bit quantization and you have a histogram and you have an outlier that makes half of the bins empty, that's equivalent to a three bit quantization. So one bit less, you lose one bit of information. And so, the entirety about sort of quantization is thinking about how to preserve information if you sort of compress it, and some of it is sort of filling these bins, but that assumes that each each sort of value is equally important, but what we know is that very large values in neural networks are much more important.

So if you have a small weight, it's not as important as a very large weight. that's not. Entirely correct, but in most cases, that's sort of a good approximation. And so that means that very, very large outliers are extremely important. They need to be preserved. That's, that's different from like many engineering disciplines where sort of you have a noisy process and often an outlier means like, Oh, this is just a bad value, it's like a measurement error and you throw it away.

Neural network, it's not a measurement error. It means that one of these feature detectors, these correlation detectors detected a very strong set of, um. Feature at hand. It really knows there's really a cat, and so you need to preserve the information, otherwise it's lost. And so it means both sort of making sure that you account for all the information, but then also accounting for the outliers that contain most information and both sorts of these things together make a good quantization that is sort of precise.

Steve Hsu: So, when you quantize, I think you, in some of the data types you use for the quantization, you privilege zero, right? You make sure that zero is one of the options. And so in that, if you use that kind of data type, you then end up with sparse matrices, right? Because you have a lot of zeros in the matrix.

Are there special, if you know a priority that the matrices you're dealing with are going to be sparse, are there special libraries or algorithms that make dealing with sparse matrices faster? And are people starting to make use of that kind of thing?

Tim Dettmers: Yeah. So sparsity is like a very interesting topic. I'm very curious about it. I did a little bit of research on it and I'm still very interested in it. It's a very hard topic because if you sort of have uniform sparsity, it's very difficult to use modern hardware to basically get sort of efficiency.

So usually in modern hardware, you want to read a segment of memory. But now if you have sparse memory, then there's some filled values, then some zero values that you don't care about, then some filled values, and it's not efficient to load sort of small segments, you need to load the entire segment, and so that makes, makes it difficult to utilize sparsity.

So, I worked a bit of a sparsity, but with quantization, quantization is much easier if you want to utilize the hardware well. That being said, if you have the right structure so you can structure sparsity in a certain way, it can be very efficient. So one thing that I also worked on is this mixture of experts and sort of these models structure the sparsity by saying, like, I have different blocks of experts and I say, like, some blocks are just zero. So I throw away the experts and I just route the information through a single expert or two experts. And this is efficient for memory, because now you have large chunks of segments that are zero and very large tanks that are sort of one or represented by a value. and that is sort of a better way off of dealing with sparsity, but sort of these elements fight sparsities. Very difficult.

We actually have a paper that we didn't publish. My co authors were quite discouraged. We tried a lot of different sparsity techniques, the sort of uniform sparsity, and we tried to sort of use very smart algorithms. None of them work. In the end, what, what only matters is how many parameters you have. It doesn't matter how you arrange it in a certain sparsity pattern. And so sparsity is very difficult and difficult to take advantage of.

Steve Hsu: You know, in the in the computational genomics work that I've done we often end up with sparse predictors because even though there are three billion base pairs in your genome, for a particular complex trait, it's a very small subset of those, even though the absolute number could be large, like height is controlled by about 10, 000 common variants, but 10, 000 out of 3 billion is really small.

So the predictor is super sparse. And there are many, many ways that you can exploit that to speed up the computations in, in genomics. I was interested in sparsity for a long time, and I gave a talk at OpenAI. I think in 2018 or 2019, maybe 2019, and was talking to them about, I think at the time it was like GPT 2 maybe.

And I kept asking them, I said, well, couldn't you make this run faster if you just sparsified it? Because surely a bunch of those small values are just not doing anything, right? They're just statistical fluctuations. And I don't know to what extent they took my advice to heart, but it was an interesting conversation.

Tim Dettmers: Yeah. So, if you look at general sciences sparse matrix spectrum multiplies an extremely important operation sort of, in all kinds of sciences is one of the most important operations and neural networks. It's a sort of dense matrix multiply. And you see it also in supercomputers, most supercomputers that do sort of scientific problems, they utilize like about 3% or something like that because it's inefficient to do sparse matrix multiplication.

But then it's still much, much, much more efficient than doing a dense multiply for these problems.

Deep learning is a little bit different. And so, yeah, it's very difficult to optimize open. I did actually, maybe they took your advice. They did actually set up some sparse attention.

But it was also sort of difficult to see the benefits. Often you see basically that you get a 50% efficiency in terms of computation, if you do it right.

But then also the performance drops by about 50%. So you, you are where you started basically. And it just makes the process more complicated.

Steve Hsu: So I don't know if this is a well posed question, but let's suppose you take some LLM off the shelf and you do a four bit quantization of it. What fraction of the matrix elements or connections are then set to, if you privilege zero as one of the possibilities, what fraction ends up just being set to zero?

Tim Dettmers: So it's more. I mean, it depends sort of what data type you use. I mean, this sort of I developed a data type. This information is theoretically optimal. Then you have an equal amount of zeros compared to other values. But if you look, for example, at floating point quantization or integer quantization, you will have much more zeros.

It's not extremely imbalanced. It's like you might have two to four times more than other values. And you can actually take advantage of that by compressing the further something like Huffman coding. But it's sort of not a super large fraction, but this is for weights. And there are other parts of neural networks that are much, much sparser if you scale up neural networks, so neural networks have nonlinear activation functions, and often these nonlinear activation functions are close to zero and at a certain point. And they're sort of linear and some other points.

And the points that are close to zero, almost all values of zero, and that is sort of between 97 to 99% are zero.

So this is actually a regime where you can take advantage of these, but if you just look at the weights, they are not so heavy.

Steve Hsu: I, I'm going to ask you one more super nerdy question at the risk of losing some of my audience and then we can go back to the practical consequences of your work. I think you said in your talk you had some interesting result like maybe if you, if you have the prior that the weights or whatever the numbers are that you're trying to compress are drawn from a normal distribution.

There's some theoretically optimal data type, which relies on that and then you, you represent it in terms of some normal, normal functions or something. Is, was, was I, is that correct? Is that a

Tim Dettmers: That's correct. That's correct. Yeah. So, if you have integer quantization, you have a normal distribution, how you quantize it is you make slices. So if you have 16 different values and 4 bits, you make slices equal width. If you want to do the information theoretically optimal, you do slices with equal area.

That means each of these bins have an equal amount of values in them. So the area in the normal distribution represents how many values you have basically in this interval. And so that, if you slice them up in that way, each bin will have an equal amount of values.

And then, you have this sort of thing. If you have an incoming random number from the normal distribution, and you, you're going to predict which bin it will land, and you cannot do better than chance because each bin is equally likely, and that's information theoretically optimal.

Steve Hsu: I see. So if you know the, if you have some information about the prior, about the distribution that you're drawing from, then you can optimize things kind of perfectly if the prior is correct.

Tim Dettmers: Yes, so there's one thing is you can basically weigh each value equally an equal amount. But we know that larger values in neural networks are so tough, just more substantial. So you actually want to allocate more bits for larger values.

It's a little bit like a Huffman coding is information theoretically optimal, but it's not the best compression technique. They are better compression techniques that are not theoretically optimal. But they account for four different patterns. And yeah, it's a similar principle.

Steve Hsu: Okay, so let's switch gears to practical stuff. So, Let's start with inference. So for the audience, inference means someone else maybe trained a model and gave it to you, but now you're using it. You're using it to generate results that your customer needs or, you know.

And so you're running the model and Tim has all kinds of results on someone else trained the model. Maybe they used very expensive 32 bit Floating point numbers to train it. I get it. I want it in a sense, to shrink it down so I can run it on cheaper hardware or very fast. And he's got results going all the way down to eight bit or four bit inference.

Talk a little bit about the speed up. So there's an advantage in the ram requirements and also in the actual speed of execution.

Tim Dettmers: So, yeah, so the RAM requirements, this is quite important. If you want to use the largest models that are currently out there. so, so, a couple of days ago, Llama 2 was released, which is a very powerful open source model. If you want to use the largest model available on consumer GPUs, you will need five consumer GPUs.

And so if you have a standard consumer set up between two and three GPUs for fair use, if you have some sort of custom extenders, you can fit four GPUs, but you cannot use five. So it's impossible to run the consumer setup in the normal setting. So if you want to run it, you need to quantize it. It's just a requirement.

You need to. Compress it down in some way. Quantization is the most effective way where you can preserve the performance, but make it smaller. And so you need to run at least an 8 bit on a consumer setup. But if you run 4 bit you can actually fit it on two consumer GPUs. And that's a pretty, it's a pretty expensive setup still, but it's a doable setup.

So if you're a hobbyist, you can actually use the best model available out there right now with 4 bit quantization. And so that's one contribution. And then the other is sort of how fast can you go?

And they are really good. Go

Steve Hsu: Quick question? So? So in that example, Is the cost down to like $5k for those two consumer GPUs? So you build a rig at home, you spend $5k. Those are GPUs that are mainly, I guess, in the pre, maybe pre crypto era, it was mainly gamers who were using these GPUs, right? So, but they're like a thousand bucks or a couple thousand bucks.

And then you can run Llama 2 fairly efficiently when that's set up. Okay,

good.

Tim Dettmers: So, with four bits, it's also more efficient. So in this setting, it costs about 5, 000, 5, 000. Um if you buy a sort of the previous generation, I think you can get it. Get it down to like 4, 000, but these GPUs that have a little bit higher memory, you need 24 gigabytes are quite in demand, even after the crypto bust so to speak.

Now a lot of AI people want and GPUs are sort of valuable. And so even used old GPUs are quite, quite expensive. So you need at least 4, 000 to run the largest model, but you can do it. And then sort of the speed up there are different settings, but if you have this consumer setting, you will be able to get about eight tokens per second at the moment.

And if there's some software updates, you might be able to get a little bit higher sort of speed. And, this is for a setting where you have a single prediction, a single generation. So you give the answer to the English model, for example, a question and it generates a response word by word and you get basically, about eight words per second right now.

That is one part of inference. And that's mostly dependent on how large the memory footprint is of the model. So if you have a 16-bit model, then you only generate at a fourth of the speed. So it might be about two tokens, two words per second. So it's much slower. So you, you make it able to fit on consumer devices but also make it faster.

There's also a different world, which is sort of a little bit more that sort of accompanies when you have lots and lots of queries at the same time, then memory becomes less efficient than the other bottlenecks. But yeah, that's a little bit of a different topic. Happy to go into detail. So they are if, if, if you're curious, but yeah.

Steve Hsu: So a question that I'm sure that people who invest in stocks who listen to the podcast will want me to ask you is currently NVIDIA is valued at, you know, order a trillion dollars on the assumption that I think, I mean, of course there'll be a lot of demand for their, is it A100 is the latest and greatest or H100, A100.

Yeah. So tremendous demand for sure for model training, but maybe even bigger. Demand for those devices for model inference, if at scale, people start really applying using the AI for useful purposes. Do you think that people that are investing fully understand the consequences of your results? Like, like, are you kind of thinking like those guys are overestimating because they're, they're not taking into account the efficiency of quantization?

Tim Dettmers: So I think there are a couple of sort of a couple of issues. One is with quantization. I mean, as I said, hardware is very complex. It is optimized across very different sorts of axes. And one important axis is training. If you have inference for a single person, the only thing that you need is high memory bandwidth throughput.

And if you look at AMD GPUs. They give you very high value for that, and you don't need expensive tensor cores and all the magic that goes with it. You just need a cheap CPU that can have a sort of very high memory bandwidth. sort of software is another sort of issue. You might have the right hardware, but then you also need to be able to use it.

And all the software support, all the community support is sort of a little bit of a different issue. But sort of using models on a personal level or even on your phone or your personal laptop, and that doesn't have such strict hardware requirements. So we will see a lot of different hardware where you can run it very efficiently. And NVIDIA doesn't have the largest benefit there. Like a lot of companies will be just as fine as NVIDIA.

The only advantage right now is that sort of software will support us a little bit better. Community support is a little bit better, but AND is pretty well supported in the GPUs and Apple Silicon gets better and better support and you have lots of users and so that will be very interesting.

I think I found soon they will run language models and will be quite common. And it's pretty good hardware. And so all of that, like NVIDIA doesn't have the biggest advantage for inferences, mostly training, then sort of software and community support, but yeah, other people have no drawback really other companies.

Steve Hsu: Yeah, that's interesting. So I think if I talk to the modal hedge fund guy who's really bullish on NVIDIA, I think what you just said is probably not incorporated into their thinking. So it would be interesting to go back now and look at their analyst reports for the projected sales of these H100s, you know, how much of that is supposed to be for inference versus model training. And probably they're incorporating a lot of inference you know, usage in the future as well. So that's interesting.

So let's switch and talk about model training. So you have a paper on something called Q LoRA. LoRA stands for Low Rank Adaptation, which my way of thinking about is you, you have a neural net with billions of connections, but you insert some low rank objects in there, which can modify in a sense some of those matrices.

and training just the components of those low rank matrices is relatively inexpensive. and yet, I guess some researchers have found that you can adapt your originally trained LLM quite effectively using these low rank modifiers. And so what you, your paper does is study LoRA, but using quantized, I guess you, you quantized, you quantized both the original model and the, the low rank matrices. Or I guess you're quantizing everything. Is that right?

Tim Dettmers: So for the purpose of training or fine tuning we just quantized the base model because that's very, very expensive, the low rank adapters, so to ensure training stability are still in higher precision.

Steve Hsu: Okay. And, and what you found is that again, fairly aggressive quantization still allows for good results. Is that right?

Tim Dettmers: That's correct. Yeah. So the hallmark is always, if you can replicate the full precision or 16 bit position performance then, you know, like probably it works in most cases. And you can just apply it and we'll be fine. If you have some small degradation, it often means that, okay, I have a degradation here, how much degradation do I get with other tasks?

So if you get the full performance, often a very good signal that it just works. And it seems to just work. We tried it across like a thousand different experiments, different data sets, different models. You always get the same results. So this technique is about 17 times more efficient in terms of memory.

So you can take a big model like Llama 2, and as I said, now we have a $4, 000 setup or $5, 000 setup, and now we can actually personalize it. You can take the model, fine tune it on your data, and you can do very interesting, curious things. you can build chatbot models that sort of mimic a person that mimic a podcast or a conversation.

you can have all kinds of different things. You, you imagine, your imagination is a limit. And so yeah, this enables really not only the usage, but the sort of personalization of models and you can do it on your own consumer hardware.

Steve Hsu: And have you, have you done any research at all on the actual training of foundation models after quantization?

Tim Dettmers: Yeah. So I, I'm quite curious, as I said about sparsity and sort of, um um, for, for about a year, I was very interested in a mix of experts. I still am. And it's just a mixture of experts. That was sort of the question. How do you train well with them? So again, a mix of experts is a setup where you have a neural network and parts of the neural networks are shared, or you also have completely separate neural networks. Now you want to route information to experts. So you have a generalist layer and expert layers. And if you have, say , a hundred experts, but you only route to two experts, you save 50% of your computation. So that can make training much cheaper. And so, that research was really about how to do it more effectively.

The research that I did didn't quite pan out, but I worked with some colleagues and a mixture of expert approaches. And we have like two different approaches that we published. One is base layers. And there you actually use optimal transport the sync home algorithm. that sort of also models heat flow and we basically model if you think about it, sort of the information flow how you route different words to different experts.

And so certain experts are sort of better at certain words, but there's also contention. There might be some experts that want to get all the words, but you need to dispute it sort of equally. And sort of, if you use a bad algorithm, it distributes the words sufficiently and with that, you get sort of very efficient pre-training.

The other sort of approaches where we have fully disconnected training. It's more training like an ensemble where you do a little bit of free training in a shared model. Then you branch out into, and you feed each branch you treat it as an expert and each expert gets its own data set that is a little bit different than other data sets.

And now these experts specialize, but because you have a sort of shared representation that you start with. There's still close enough in optimization space that you can actually combine them. So now you can do inference and combine certain experts if you have certain data sets. And that works really well and is more efficient than a regular transformer.

And then the last part that I worked on was sort of globally distributed training. So data centers are really, really expensive. A big part is you need to have all these GPUs and computers, but then you also need to have the networking between everything. Networking can be extremely expensive. You need to have, like, lots of big power sources, a special building, very expensive.

So what if you can take lots of consumers with their consumer GPUs under the desktops, now paralyze it to train in a very large neural network? And there the problem is your network is not really fast. You'll have a normal ethernet. And so that research was really about how we can compress information and structure the training?

So you need very little communication. The communication is very efficient and the communications overlap with computation. So the latency is hidden. so you can always compute on your GPUs and that's called swarm parallelism. That is also very effective. And so these are the works that I did in pre-training.

Steve Hsu: Do you think we'll actually see that? Do you think we'll see state of the art or state of the art quality foundation models that are trained through kind of crowdsourced home resources? Is that already happening? And maybe I'm, I'm behind the times.

Tim Dettmers: So it's currently not, it's like, and there's a company called together and they, they have sort of a similar approach, as I understand it, they use a lot of sort of compute that's currently not used at universities, and then they sort of come together and they build sort of some things for their own, but they say also to the universities, hey, we have this efficient infrastructure. If you want to train a big model, you can now use the resources. Cross all universities to train this bigger model.

And so, the problem with these approaches is and the approaches only as good as as many people you can get to sort of collaborate. That's a collaboration problem, in a sense. And so if you have large institutions like universities that's more feasible.

But, with the sort of consumers, it's sort of a rich get richer problem. You need to get a jump started to really get it going, but once it is going you can imagine training models like GPT 4 on globally distributed networks. Some people, particularly in the AI safety community, find that scary.

Hey, now people can train powerful models that might be quite dangerous. And so it's sort of a double edged sword.

Steve Hsu: In this system. Not only are you using Ethernet, but you're actually using the internet to transfer across nodes, right? And so is it because you have very efficient ways to compress things and then you send them. So yeah, that's, that's really interesting.

Tim Dettmers: Yeah, there's sort of a couple of innovations. One is sort of compression and quantization again. And the other is sort of the algorithm itself. The internet is noisy. You have disconnected. You have lag. Some, some connections might be slower. they may fast one second, they slow the next second, you need to take account of that.

Some computer might have a bug, or it overheats, and it's very hot, and it needs to sort of throttle everything or somebody is downloading something in the network, like next door, and so you need to account for all of that. We solve this by sort of stochasticity, so you have a very stochastic, if something sort of drops, the probabilistic is adjusted over time, and so, this works really well, that Stochastically, everything is sort of a meal to the right distribution, the right networking, and then everything is adjusted by itself.

So it's compression, the stochasticity that's sort of adjusted over itself. But then the last piece is sort of Developing algorithms where you overlap computation and computation. And it's, it's, we use an approach where we send out all updates, but then we already start the next sort of training batch.

And so once the update date comes in, we sort of calculate the update. And once the first computation that is currently running finishes, we do the update, then we sort of compute the next update. That means we have a stalemate. update on the weights. It's delayed by one step, but that allows us to overlap everything.

And so we did experiments on this and it seems that if you have a delay of a single step, that's just fine. You can just train fine. Again, it's this sort of phenomenon that If you have a little bit of noise, it's fine. If you have lots of noise, things break down. You cannot go to 1 bit quantization. Very difficult. But a 4 bit or 8 bit works. And similarly, if you have a little bit of noise in the update, that's just fine. Your networks can deal with it. And we exploit that.

Steve Hsu: Wow. So is it possible that if, if they shut down the crypto Ponzi scheme, then all the mining rigs that are around could actually be reused, I mean, repurposed for this kind of algorithm.

Tim Dettmers: So, theoretically, yes. the other thing, and that's sort of a little bit more hardware specific thing. CryptoRix usually has lots and lots of GPUs. And if you have sort of an entire GPU system, there's certain bottlenecks. And I mean, actually when I now think about it, the bottleneck is still just the internet, just as still Ethernet.

So actually mining rigs should be just fine. you just need to configure them in the right way and you should be able to yeah. To do updates just fine with this sort of parallelism.

Steve Hsu: Wow. Very interesting. So after the U.S., after the Securities Exchange Commission clamps down on a lot of these guys, there might be a lot of spare capacity for what you want to do.

Tim Dettmers: Yeah. I guess it's sort of a little bit of a meme that all the crypto posts and all the AI bros. but yeah, so this, this would be another level. They need to really come together as a community to train, train something with a special algorithm, special software. Theoretically possible. But in the end it's a coordination problem.

Steve Hsu: Great. So let me we're coming to an hour, so I just want to do maybe two more things and then I'll let you go. I don't want to use up too much of your time. One of my questions is like, what is your baseline projection for what we're going to see in, say, on a three year timescale on a five year timescale?

any thoughts?

Tim Dettmers: Yeah, three years and five years, that's quite, quite some time in AI. Everything's moving so fast. But I mean, I, what I can say is that what I'm seeing is the neural networks get good enough that you can do stuff with them, but they are also not good enough that they're so reliable that you can deploy them and you get very sort of annoying user experiences and people don't like it.

hallucinations are a big problem. And so there's some challenges, but when I look at it and sort of, I orient myself quite a bit at um, so there are scaling laws, which sort of project, which variables are important as you scale certain dimensions. And so there are scaling laws for fine tuning, which means you use a big model.

Now you fine-tune it on a specialized data set. Then you can predict how well it does. And if you play around with these equations, it seems to indicate that we are in a regime where fine-tuning is very powerful. So what do you want to do is take a very large model and now specialize it to a very narrow task.

And then it does quite well in the scenario task, maybe so well that people are actually not annoyed by their performance. Maybe it doesn't hallucinate as much. And there are other things like information retrieval that can help with that.

But when I look at the future, what I see is what you will have is one base model, probably quantized and very efficient, but then lots of different adapters that are specialized for very particular use cases.

And what will happen is you get particular inputs and to ensure reliability, it's routed to the expert, to the adapter that can solve this task within a sort of boundary of specification that is good enough to ensure good experience, good user experience. Or a sort of a good code, code generation. It can solve a subtask in enough quality where you can say, this is good enough that we don't need a human in the loop or that we don't need manual fine-tuning. As soon as you introduce manual sort of fine tuning, it's not very efficient.

And so I, when I look at that, I think that is sort of what the future will be. I think that will become very rapidly in the future.

If I look in three to five years, another thing that I see is hardware doesn't get much better anymore. It's very difficult to get improvements. you can still get improvements by just using more GPUs and that is still feasible. And if you have an efficient algorithm like swarm parallelism that I discussed you can have larger and larger sorts of clusters. So I could imagine that Google and OpenAI will use a million GPUs in the future.

But it will also have, it will reach physical limits. You can only change information at the speed of light and that, that, that is the limit. So, and if our GPUs don't get faster, we will soon be bottlenecked by networking. So a question will be like. In three to five years, will we hit models that are sort of good enough to solve the problems that really matter to make AI efficient?

If you look at the productivity paradox in computing, it's like computers didn't improve productivity in the beginning. So a lot of small pieces needed to come together to sort of optimize the entire process, the entire pipeline. And so that will be the curious question.

So I think a big focus will be on reliability. Can you get these models to do a narrow task very well? That is so valuable enough so we really want to automate it with AI, but then also do our AI models get good enough so we can actually use them well. And I think we will know that answer probably in three to five years.

I think physical limits are a little bit further out, but it's not that much out, so I think in 10 years we will probably hit physical limits and then we have what we have. Yeah, that's, that's how I see it.

Steve Hsu: Well, you know, Tim, I, I like your answer because this, this idea of trying to build narrower AIs that using fine tuning and information retrieval can actually do economic, narrow, economically viable tasks, valuable tasks. That's the thesis of our startup. So you're, you're kind of thinking along the same lines that we are actually for what we think the next big impact will be from AI.

So I like that. I noticed you didn't say anything about AGI. So. Some people who are really either doomers or I would say very optimistic about the rate of technological advancement would say, Oh, three to five years, Steve, by that time we'll have already hit AGI. And it sounds like you don't think that's very plausible.

Tim Dettmers: Yeah. So I mean, I'm a fan again, sort of, of neuroscience. If you compare different animals of the different sorts of orders of magnitude of improvement between like sort of dumb animals, smart animals than humans. And humans just have lots and lots of computational power. It's just unparalleled. It's not comparable to all the supercomputers that we have today. Single brain is much more powerful still.

And if I look at hardware, it will not improve much anymore. and so if I put these together, then we will not reach a sort of human level capability, processing power. and then the question becomes, can you get a very intelligent system with less computational resources in a human brain?

And I think if I look at a sort of AI that uses backpropagation, which the brain cannot use, it is more efficient. And so we might be able to get there. But what I also see is that GPT 4 was very expensive. It's still very flawed. It's very good at sorting things that you don't know about, but if you ask it a sort of narrow, expert question, it gets lost. It doesn't know what it's doing. And you have sort of this last mile problem that you have in self-driving cars. You have a similar problem with AI. And so I think what we will see is very powerful AI tools that can do a lot of tasks much better than humans, but at some tasks they will just not reach our capability.

And this will be very powerful, but it will not be AGI. They will not be able to do everything as good as we do. As humans are, they are amazing. And I don't think we quite get there.

Steve Hsu: Yeah. You know, it's funny cause you and I didn't, to the audience, Tim and I did not coordinate on this or anything. But my view is very similar. In three to five years, we'll see narrow applications where the AI is clearly superhuman in that narrow domain, but we won't see superhuman generalist AIs yet in three to five years.

Beyond that, it's a little hard for me to speculate, but so Tim, it seems like Tim and I are kind of roughly the same in our calibration.

Great. So any, any last thing? So anything I didn't ask you about that you think is super important for people interested in AI to understand so we can just take the last few minutes for that.

Tim Dettmers: Let me see. I think we talked, we touched a lot of different points and sort of was very broad. Yeah, there's nothing, nothing in particular. Maybe one thing, and that's sort of what I see.

I'm a PhD student. I work in academia. There's sort of often this feeling of being lost. These large companies and if he's big models and it costs hundreds of millions of dollars and people feel like, Oh, I can't do that.

I don't have the skills. but from my personal experience throughout life and also what I see now in technology is we can use these models. We can personalize them. We don't need to fine tune and train the entire set of models. It costs a hundred million dollars, but you can experiment with them. If you have $4, 000, you have a computer you can use for years to experiment with these.

And it's very cheap. Even if you use the cloud, you can do very interesting experiments for like $100 or so. So it's not inaccessible. It's, it actually, it's, it opened up. It's sort of, it can be an explosion as Steve said. It opens up so many possibilities and people aren't quite aware of it.

Some people are very pessimistic. I'm very optimistic. I think we can do so much more than we could do a couple of years ago. And that's very exciting.

Steve Hsu: I agree with you 100%. I think people will be surprised at how much innovation comes from the grassroots level and not from the mega corporation. And I want to thank you, Tim Dettmers, for helping to make that a reality.

Tim Dettmers: Yeah. Thank you so much, Steve.

Steve Hsu: Great. Well, it's, it's been great having you and maybe we can have you again sometime in the future.

Tim Dettmers: Yeah, I would love that.

Creators and Guests

Stephen Hsu
Host
Stephen Hsu
Steve Hsu is Professor of Theoretical Physics and of Computational Mathematics, Science, and Engineering at Michigan State University.
© Steve Hsu - All Rights Reserved