Polygenics and Machine SuperIntelligence; Billionaires, Philo-semitism, and Chosen Embryos – #102
Steve Hsu: And so for example, there was a super wealthy Chinese guy, Xo Bo who was profiled in this Wall Street Journal article. Now he reportedly has something like a hundred or more sons that have been born through surrogacy and IVF.
Furthermore, it was reported that he prefers eggs from Jewish women. So he is a philo-semite not an anti-Semite. He is a philo-semite. He wants eggs from Jewish women. And, so this news got a lot of attention. If you can see my screen, you can see a little capture from a video. I think in that video there are about a dozen sons. They're, they're little kids, little toddlers.
Very cute. We'll see whether he really has a hundred. I'm not sure that's really true, but that was what was reported in the Wall Street Journal. This is a picture of Xo Bo from the Wall Street Journal article. Here's another snapshot, if you can see my screen of some of his sons.
They're all very cute. They all seem to be pretty similar in age. And no doubt he used, donor eggs and surrogates and who knows, possibly polygenic prediction, polygenic selection of embryos to produce these sons. So I think 2025 marks a number of important changes in the landscape for polygenic selection of embryos, reproductive, advanced reproductive technologies.
Steve Hsu: Welcome to the year end 2025 episode of Manifold. Today I'm going to discuss two scientific topics. One is polygenic prediction and embryo selection, and the other one is AI or generative AI. Okay. And I'm gonna cover some interesting and dramatic changes that have happened in these two areas in 2025.
I'm gonna break this into two components. Try to aim for about 30
minutes for each one. I will start with genomics. I'm going to have some slides on the screen, but as usual, I will try to make the episode as understandable as possible for people who are only getting the audio and can't see what I have on my screen.
So let's dive right in and talk genomics. I haven't actually covered this very much on the podcast in a while. Largely because the situation has not really been changing very rapidly. The field has sort of reached a fairly mature level of sophistication, which is largely data limited.
So our improvement in quality of ability to predict phenotype from genotype is primarily data limited. We have good algorithms, we have plenty of compute. And it's just a matter of accumulating enough data to push things forward. But some interesting things happened in 2025
that I want to go over.
Now on the screen, what I have is an illustration from a Journal of the American Medical Association, JAMA. This is a paper about a very big study, almost 30,000 women, and from the perspective of polygenic prediction or polygenic risk scores one could view this as a validation. Of the predictive power of polygenic scores specifically for breast cancer.
So in this study an RCT type study, they assigned women into a group for whom screening was allocated based on. The estimated risk that that individual woman had for breast cancer, and the biggest driver of that is the polygenic score for that woman. So they have access to the DNA of an individual, and then they compute the polygenic risk score, which depends on about a hundred different loci or individual snips on the genome. And they then put that woman into a high risk, medium risk, low risk kind of category. And based on that, some of the women were allocated more resources things like mammograms, biopsies, and the people who were low risk. Were allocated fewer resources in terms of additional screening.
And then there was a control group for whom, rather than risk allocated screening, they just did annual screening. So the second group is treated more or less the way women are treated currently under the standard of care. And the other population is a group in which the way those women are treated is DNA informed, so that's the study that was conducted. We're not that interested in the study itself it's just an example of a situation where they're starting to incorporate polygenic scores into adult care. In this case, they're using the polygenic risk predictor for breast cancer. Most women who are high risk for breast cancer and typically those women would have a family history of breast cancer.
Most of those women are high risk because of the aggregate effect of many SNPs in the polygenic predictor. So they have many different genetic loci in their DNA that increase their predisposition for breast cancer. That is distinct from the fraction of a percent of women who are carriers of a rare mutation like BRCA.
So those women are also high risk for breast cancer, but typically because of a single gene mutation. And, and that is the minority of women who are at high risk for breast cancer. They're about 10 times more women, who are high risk for breast cancer than there are actual BRCA carriers in the population.
So in this study, what got a woman placed into the high risk category was typically a high polygenic score and not her BRCA status. And again, I'm, I'm going over this, particular paper because it gives you an example of the different types of validation of polygenic scores that are going on in preparation for the rollout of the use of polygenic scores in adult medicine and adult healthcare.
And so as a side effect of the study this is a, the graph that I have on the screen shows the cumulative incidence of breast cancer, but for each different risk category of women. So this bluish curve in which there is a much higher incidence, several times higher than for an average risk woman.
That is the set of women that were classified as high risk, primarily because of their polygenic score. And you can see there's meaningful differentiation between women that are high risk. Medium risk and low risk based entirely on a computation that depends on the status or the state of about a hundred SNPs in their a hundred loci in their genome.
And I just pointed this as an example of a very well conducted study. Large n large study population, which showed efficacy or validity of polygenic risk prediction. And I belabor this point because when we talk to, when we talk about polygenic risk scores for adults, generally, there isn't a kind of visceral or emotional reaction. But when we get into using this kind of technology for reproduction, for IVF for embryo selection, then there is a typically or quite often an ideological emotional response in which some subset of people who really should know better but are just ideologically opposed to the use of this technology in reproduction for the selection of embryos.
They will often claim that PGS doesn't work. That the scientific status of our ability to predict disease risk or to predict a phenotype like height or intelligence is not well established. And in fact, it's quite the opposite. It's so well established that it's about to be integrated into healthcare for millions of Americans.
I should add that in this study, the polygenic predictor for breast cancer that was used is one produced by an academic consortium, but the leading company that provides breast cancer, genetic screens for millions of women worldwide that company's called Myriad. Myriad also uses polygenic risk prediction in its latest screen, I think it's called My Risk. And so in that case, you already have millions of women who have a My Risk score, which depends on polygenic risk prediction for breast cancer. So to the people who are critical of embryo selection on the basis of science who want to claim that the science is not mature or the science doesn't work.
Those people generally don't want to let you, let on or let you know that the efficacy of polygenic prediction of phenotype or disease risk has been validated now in hundreds of research papers very large studies like the one that I've just described to you. Okay. So if you're at all unsure about the scientific status of polygenic risk prediction, well, the best way to find out is to actually look through the scientific literature and look at the studies that have been done.
But as I said, there have been a huge number of studies typically establishing the validity and the predictive power of polygenic scores. So I consider this not really anymore an open question of science there. Obviously there are so people resisting that conclusion, but mostly they're resisting it for ideological reasons and mostly they're resisting it in the context of embryo selection or reproduction. And, and typically they're, they're sort of not focusing at all on applications in, in adult context. Okay, so let me go on to the next screen now. One of the big steps forward that happened in 2025 was a large study that was conducted my lab, my research collaborators in my lab were participants in this.
The main source of data was the Taiwan Biobank and the Taiwan Precision Medicine Initiative. The paper which was published in nature October, 2025 is called Population Specific Polygenic Risk Scores for People of Han Chinese Ancestry. And this is the first study that builds polygenic predictors, which are of quality equal.
To the ones that already existed for people of European ancestry, but this time for a non-European ancestry group. So in this case, east Asians. So it's now the case that we have very strong polygenic predictors for most important, disease conditions in phenotypes, but which actually work for people of East Asian heritage.
And so that, that was a big step forward. In 2025, and this will open the door for more aggressive use of embryo selection in East Asian and other Asian populations. So I just wanted to bring that to your attention. Here is a slide. This is a slide from my talk that I gave at the Berkeley Genomics Project, reproductive Frontiers meeting.
I think that was during the summer of 2025. these are my slides, as I said. And there I discussed the pre-print version of this paper. So this is the pre-print appeared in 2024. The paper finally appeared in nature in October of 2025. Here on the slide you can see that we analyzed over half a million genomes, so that's the largest. Non-European aggregation of genomes that's been studied to build polygenic risk scores. And on this slide it says, under review at Nature Genetics. That was still the case when I gave this talk, but now the paper has been published.
This shows a related paper which is specifically about height prediction. You can see on the slide it says few centimeter height prediction accuracy. So now for East Asians, we can easily identify outliers for short stature and outliers who are gonna be much taller than average. And if you can see this slide, you can see there's a pretty high degree of accuracy here in these predictors.
The JAMA study and this nature paper, these are examples of just continuing progress in the field of polygenic prediction most of the progress tends to be, as I mentioned at the beginning, due to improvements in access to data, so larger data sets or better data sets. And in terms of algorithms we, we are already at a point where we have pretty mature methods to build predictors once we have access to data. Data in which the genome of the individual and that individual's disease status or pheno to quantitative trait value are known. So how tall is the person or what is their cognitive score? Okay, so let me skip ahead another couple slides from this presentation.
This presentation was recorded as at part of the, as part of this Reproductive Frontiers meeting which was held at Light Haven in Berkeley over the summer. I hope I have the date correct. May, may, maybe I'm off and it was it was longer ago than last summer. But anyway, it wasn't that long ago. In any case, my talk, which is about an hour long is available online. I'll try to remember to put a, a link into the show notes so that if you want to listen to the whole talk you can.
So the slide that I have up now is about cognitive ability and on the right is, a visualization of data, actual data, which is from this old Cold War era study called Project Talent in which they took psychometric measurements of a hundred thousand ninth graders. So a hundred thousand ninth graders. I think that is, you know, maybe at the time there were 2 million ninth graders in the entire United States. So this is something like measuring 5% of all ninth graders in the United States.
This is a kind of serious type thing one could get done in the Cold War which, you know, today wouldn't necessarily be done. If you look at the plot, it's a scatter plot and the three different, uh, coordinates are in standard deviation units. The spatial ability score, the mathematical ability score and the verbal ability score.
Nowadays it's very hard to find measurements of spatial ability. It's almost some kind of thought crime to even talk about something as with something that sounds strange to the woke mind as spatial ability. Now, what's interesting about this plot is that the data forms a kind of ellipsoid where the, which exhibits a sort of a positive correlation between each of these cognitive scores. And so people conditional on being, say, high on the math score or high on the spatial score, or high on the verbal score. Conditional on that, the probability that you're also above average on the other two scores is increased. and so that's what this ellipsoid structure represents. And you could think of the major axis of this ellipsoid as something like the general factor of intelligence. So so when you just see representations of data like this it makes it appear that just due to the correlation structure of cognitive ability in the population, there is some kind of G score, some kind of, single value that you could report as someone's overall cognitive ability. You can think of that as a single number , way of compressing these three different scores, spatial, mathematical, and verbal. So that, that's just the figure that I, I like and I puton the right side of this slide.
On the left the slide says best SNP predictor correlates about 0.5 with actual IQ. Okay, so that's a new advance that happened in 2025. This is a claim from the company Herasite which is doing embryo selection. We collaborate with Herasite so in some cases they're using the data that comes from the genomic prediction, genotyping of the embryo in order to compute their embryo scores.
So they're the ones who have claimed to have constructed a new better IQ predictor or G Factor predictor, which has a correlation of about 0.5 with the actual underlying cognitive ability. This result has not been replicated by other groups. It appears only in a white paper from Herasite. I believe what Herasite claims to have done is that they use the UK Biobank data and they construct a kind of synthetic estimator of fluid intelligence score which is built out of other variables like the income or socioeconomic status or education level of an individual.
And they use that to crudely predict the fluid in score. I think there are about a hundred thousand or maybe 150,000 individuals in the UK Biobank dataset who have had at least ac crude fluid in test score, which is I think only like 12 or 13 questions. So again, pretty noisy data, but Herasite claims that in out of sample verification, the predictor correlates about 0.5 with actual IQ .
I think people who are skeptical about IQ in general or the G factor and then furthermore, people who are skeptical about the heritability of that construct are surprised that one could build such a strong predictor. I would guess that with even better data, higher quality data than what Herasite used one could do even better one could possibly get to something like correlation of 0.7 between the actual underlying cognitive score and the prediction value. Now, already in simulations, Herasite claims that if you have something like 10 embryos to choose from, and you pick the one with the highest polygenic score for cognitive ability, you could increase the expected score of that embryo, the, the top polygenic score embryo relative to the average embryo or the randomly selected embryo from the 10. You could get an increase of about 10 IQ points, something like five to 10 IQ points, which is getting to the point where it's pretty significant. With height, if you were selecting the tallest that the embryo predicted to be tallest, say, among 10 male embryos you could get something like a few inches to maybe two or three inches of increase on average in the height of your child.
So it's getting to the point where the gain from selection on a reasonable number of embryos is at the place where people would care about it. You know, a few inches of height, five or 10 points of IQ that's getting to the point where you know a a family that's already going through IVF sort of becomes a no brainer to do this kind of screening.
So far these topics that I've been discussing are purely scientific.
Let me talk a little bit about embryo selection and IVF and what has been going on. Not from a purely scientific point of view, but competition in the space and also the sociology of families going through IVF and IVF clinics.
So here I have on the screen a subway ad from New York City which in this case it's from a company called Nucleus Genomics, which is also in this space. And we have in the past been a collaborator also with Nucleus. On this advertisement, if you can't see my screen, it says, have Your Best Baby, and then has a picture of three babies.
And so Nucleus has been aggress, have been, has been advertising very aggressively. Herasite has been writing white papers, aggressively stating the gains in traits like height and iq, which are quite controversial. And so we're starting to see possibly a breakthrough in the Overton window or in public consciousness about what is possible through embryo selection in IVF. Now, our company, genomic Prediction, which is actually the original company, that first genotyped embryos for embryo selection and also computed polygenic scores for embryos. We have been deliberately conservative in this area. We have never offered cognitive ability prediction. We have never offered height prediction, and we have never offered cosmetic trait prediction even though we can do all of these things.
And the main reason is because, a, we, we think, we thought society was really not ready for it. And also because individual IVF clinics, individual IVF doctors were quite nervous about it. And so for us to work with, I believe, to date we've worked with something like 300 IVF clinics around the world and we've genotyped something like 200,000 embryos.
So I think we're orders of magnitude, at least one order of magnitude, maybe one and a half orders of magnitude beyond any of the competitors in the space. But in order to get there, we had to be relatively conservative in the polygenic scores that we offered. And we only offered to date score prediction of disease risk.
So major diseases, things like heart disease, diabetes, breast cancer, prostate cancer, but also some psychiatric conditions like schizophrenia. That was sort of the limit of what most IVF doctors could tolerate and some of these new companies that, that are coming into the space are sort of gambling that the Overton window has shifted.
Now there are academic studies that suggest this is the case. So there are surveys of the general population IVF families and also IVF physicians. And all of those show that since we founded the company Genomic prediction in, I believe it was 2018, so that was seven years ago. Since we founded the company, there's been a significant shift in acceptance levels of this technology to the point where I would say the majority, a pretty strong majority, are okay, now approve of embryo screening for polygenic disease risk . And a reasonably large minority approve of screening for traits like intelligence. If you aggregate the set of people who are in favor of allowing embryo screening for intelligence or are sort of neutral or at least not strongly opposed that is the majority of the population. So it's a minority of the population now that's strongly opposed to intelligence selection.
And so we may see, it could be that 2025 will be that inflection, the beginning of that inflection point where you start to see public acceptance of screening on traits like intelligence and we'll just have to see. Time will tell. Now as I mentioned earlier, one of the breakthroughs of 2025 was, good polygenic predictors for people of Asian ancestry.
And in that population, I believe there's always been a very strong positive approval of even on selection, even for selection of intelligence. And so that. You know, as we, as we begin to be able to service that population with better and better predictors I think the overall fraction of IVF users that are comfortable screening embryos for traits like intelligence is going to slip, is gonna move into the majority.
So it is gonna become, I think, accepted. Whether that takes just a year or two, or that takes another five years, I don't know, but I, I think one can't deny that it's gonna happen. And so for the people who I think have been following genome prediction for the last seven years you know, you can see that we've been moving very, very carefully along these lines.
If at some point we begin operating in a particular society or culture where that society and culture is strongly in favor or approves of intelligence or height selection, things like this, then we may decide to offer it ourselves. So stay tuned for that kind of development. Now another sign of aggressive use of increasingly aggressive use of these technologies is among the super high net worth elites.
And it's been reported throughout the year that lots of super high net worth Silicon Valley types have been using polygenic embryo selection in reproduction. Just recently the Wall Street Journal reported on Chinese billionaires coming to the US and doing aggressive IVF, often using donor eggs and surrogacy. And so for example, there was a super wealthy Chinese guy, Xo Bo who was profiled in this Wall Street Journal article. Now he reportedly has something like a hundred or more sons that have been born through surrogacy and IVF.
Furthermore, it was reported that he prefers eggs from Jewish women. So he is a philo-semite not an anti-Semite. He is a philo-semite. He wants, eggs from Jewish women. And, so this news got a lot of attention. If you can see my screen, you can see a little capture from a video. I think in that video there are about a dozen sons. They're, they're little kids, little toddlers.
Very cute. and, we'll see whether he really has a hundred. I'm not sure that's really true, but that was what was reported in the Wall Street Journal. This is a picture of Xo Bo from the Wall Street Journal article. Here's another snapshot, if you can see my screen of some of his sons.
They're all very cute. They all seem to be pretty similar in age. And no doubt he used, donor eggs and surrogates and who knows, possibly polygenic prediction, polygenic selection of embryos to produce these sons. So I think 2025 marks a number of important changes in the landscape for polygenic selection of embryos, reproductive, advanced reproductive technologies.
We will have to see how this evolves into the future. if I were to make some predictions, since it's the end of the year, I guess it's appropriate for me to make some predictions I think I can make several. One is that we will continue to see more and more validations of the core technology. So I think it'll become increasingly untenable for some bioethicist or some ideologue who just doesn't like embryo selection to claim that polygenic prediction of risk scores or polygenic prediction of traits like height or IQ just don't work. I think it's already scientifically untenable, but people can still get away with it talking to journalists, but I think it'll, it's gonna become increasingly untenable to have that position. If you wanna have any level of scientific credibility. We're gonna continue to see improvement in the quality of predictors to the point where the gain from doing embryo selection is going to be very, very obvious. The size of the gains are gonna be something that people can't ignore. I think that the AsiaPac Pacific or generally the East Asian and South Asian markets are going to grow very fast because there's no cultural resistance in those parts of the world to using this technology.
And now finally, the technology is at a point where we can do pretty strong polygenic prediction across a variety of traits and disease risks for those populations. So my prediction is that market will grow very fast in coming years. And then a third prediction is that elites will continue to be the leaders the most aggressive in using this technology.
And I suspect that the general population will start to appreciate that well. super rich people like Elon Musk. I'm not saying Elon specifically is doing this, but people like Elon Musk or are doing this or Xo Bo the guy whose babies are still on my screen. They're doing it and I think this is gonna change the attitudes of average people. And they'll go from, oh, this is some icky, weird thing that we don't understand. And I'm afraid to say publicly that I prove of it because some woke gold is going to yell at me that I'm a eugenics or that I'm a Nazi. I think that's gonna gradually go away, and instead it will be replaced by kind of fomo, fear of missing out. So that someone who isn't super wealthy ultra high net worth, but merely high net worth or merely affluent is gonna ask themselves as they go through IVF, Hey, what am what are we missing out on? What is my family missing out on in terms of ensuring the health and wellbeing of my children? That someone like Peter Thiel or Elon Musk actually is engaging in. And so this 2025 might mark that inflection point. So maybe next year at this time, I'll report back on what happened in 2026.
Let's move on now to the second topic for this Yearend episode. We just completed our discussion of what happened in genomics, particularly in polygenic prediction of complex human traits and embryo selection. Now we'll talk about AI in 2025. And of course, AI is probably the single biggest topic in all discussion in media of 2025. It's already starting to change our lives. It's changing the way that professors like myself teach our courses. It's changing the way that scientists do research. and it's created what some people call an enormous investment bubble. And in the remarks I'm about to make. I'm not gonna talk about the things that which are most commonly covered, like the AI bubble, Nvidia, the hype cycle.
I'm gonna focus, on really an area that's I think, maybe best described as AI research and, and talk about the improvements of the highest level capabilities of the models that happened in 2025. And I think this is not really information about this is not available, I think to the average person.
I think the average person is stuck more or less listening to a bunch of hype that comes from self-motivated. AI founders were possibly looking at some benchmarks, but not necessarily being able to interpret very well or intuitively what those benchmark scores mean. I'm going to try to shed some light on the question of how much did models.
Really improve in 2025. So if I take the best performing models available right now end of December, 2025, and compare them to what was available at the beginning of 2025 I, I would claim there's a very, very significant. Qualitative difference in the performance of those models. So the last episode of Manifold that I released was about theoretical physics with generative AI.
So that was manifold episode 1 0 1. This one that I'm recording right now is 1 0 2. And let me just briefly review what was said in 1 0 1 or, or what happened with theoretical physics research and generative AI. I actually published a paper of original physics research, which was largely driven by AI.
In other words, the core idea in that paper, which is about state dependent or non-linear modifications to quantum mechanics. I published that paper in physics Letters B, and it might be the first physics paper in which the core idea came from an AI in this case, from GPT five. I wrote a companion paper to go with the actual physics research paper.
The companion paper is maybe more of interest both to AI researchers and to theoretical physicists. And then I had a discussion with two other theoretical physicists who are interested in exactly this topic and actually wrote either on their substack or blog or in the form of an actual scientific preprint or scientific paper, a critique of the work that I had done.
And so on the screen, you can see a link to the previous episode of Manifold, where I discussed this. This is my own Substack page. If I scroll down here, you can see the posts on X that I made on this topic, the actual preprint, the actual paper that I wrote that was published in physics letters. B is called Relativistic Covariance and Nonlinear Quantum Mechanics: Tomonaga-Schwinger Analysis. So that's pretty esoteric title. I won't get into the physics content of this paper. I'm gonna focus more on the AI here. Here's a link to the companion paper in which I described the process of the contribution of the AI model to the actual work that was accomplished.
And here we have a YouTube recording. This is about I think it was about 80 minutes of discussion with myself and two other professors who, both of whom wrote reactions in some sense to the work that I had done. This is an earlier episode of Manifold, number 97 in which I interviewed Professor Lin Yang of UCLA, who has a background both in computer science and physics.
And Lin his research in one of his research publications and he shared a version of I, I would say scaffolding around an AI that allowed using for that AI any of the leading models that was available say in mid 2025. So for example, GPT-5, Gemini 2.5, grok, any of those models I think Claude as well, perhaps he was able to, using this scaffolding, which I'll describe a little bit get that model to perform at the gold medal level. So he took the on the International Math Olympiad. So he took the most recent problems from the IMO, immediately when they were, were released, so when the competition happened. So those problems were presumably not in the training data for the model. And he showed that by building the scaffolding around any of these off the shelf commercial models, he could get them to the point where they could write correct proofs for five out of six of the IMO problems. And the, the, the architecture that he used to elicit this level of performance from the models which I call, and I think he also calls a generator verifier pipeline. I also used that in the physics research that I warmed.
Now, interestingly since that work was done, since he wrote his paper and since I wrote my paper just in the intervening weeks DeepSeek released a version of their model 3.2 Speciale. It's a funny name, 3.2 Speciale which has a very large token budget. And without scaffolding, without this generator verifier structure that both Lin Yang and I used. That version of the newest version of the model is actually able to also perform at the gold medal level just by itself.
Okay. And this isn't true of any of the other off the shelf models. Typically, those off the shelf models would get maybe one problem correct out out of six, and certainly not five out of six. But now we have examples like DeepSeek 3.2 Speciale, and also the scaffolded models, the models that are run through this generator or verifier pipeline, which can perform at that level.
And, and that is a level which is extremely high. So the, at this, at this sort of highest level of model performance on contest math, they're performing at the gold medal level. So that's like you know, maybe one in a hundred thousand humans can do that. And on a more sophisticated set of competition problems from, say, the US Putnam exam which include sort of higher level undergraduate level mathematics, not just high school level mathematics. These models are also performing at, I wouldn't say completely super human level comparably to really the best human problemist who are trying to do these very difficult contest math problems. So it's an example of what the peak level performance of these models is able to solve these extremely difficult competition problems.
Able to assist humans in coding. So maybe the most, um, economically impactful use of models right now is in the software industry, using them to actually write code. Debug code. Um, that's become a very big thing. Um, and in my case, um, the model's able to actually produce some interesting ideas and analysis of those ideas for theoretical physics.
So, so that, that's what I want to talk about for maybe the next 20 minutes, is this highest level performance of the models and, and where I think that is gonna go in this in the future. And one of the things that I want to emphasize is that this dramatic improvement in this high level capability happened on the timescale of basically one year.
So, so the models were not particularly good at this kind of thing a year ago. Now they're extremely good to the point where very, very few humans can compete with them. And this is in an area which in the case of software writing code, software development is very economically impactful. In the case of solving math problems or understanding physics papers very, very impactful for the progress of frontier science.
Okay. So most professors, if they had a grad student that could do math as well as an IMO Gold medalist they would be extremely happy. That would be a, a great, find, a great addition to their research team to have a student like that in their group. But now you, you can have access to that if you just turn on DeepSeek 3.2 Speciale or you rig up this generator verifier scaffolding around an off the shelf commercial model.
Also I should mention in my work in this area, I worked with a team at DeepMind who had built something called co-scientist, which is also a kind of scaffolded, scaffolded, enhanced version of their best Gemini model and with a very large token budget. And that thing co-scientist also I believe is becoming quite useful to research scientists.
I am not saying that the models are at a point now where they could actually replace a highly experienced research scientist. That is still a necessary ingredient in producing new, novel or important research results. But I would also make the case that AI's are becoming very useful for this kind of activity.
I mentioned this conversation that I had with here, pictured here is professor John Jonathan Oppenheim, who's at University College in London, and the interlocutor who, led the discussion between Jonathan, myself and myself, Malia at who's at IIT. After we recorded this, we were, I, I think, I don't recall whether it's on the actual discussion that was recorded.
It might have been something we discussed afterwards, but I think Malia and I were a little bit more positive about the use of AI in research than Jonathan. And we sort of were discussed continuing to discuss this after we stopped recording. I think that's right. And I said something like. I think within a year or two, or at least a relatively short amount of time, something like 70% of all the younger physicists will be regularly using AI to assist them in research, to assist them in a non-trivial way in their research.
To my surprise, Malia said, who's younger than we are. Younger than both Jonathan and myself said something like, oh, I think that's already true. So, in other words, he said among say, physicists under 35, probably 70% of them already are using AI quite aggressively in their research. So I think most average people who aren't research scientists at their frontier find this to be pretty shocking.
I think if you were an attorney or an accountant or some kind of white collar knowledge worker you might think oh, having seen only the one shot performance of model. So I just, I just give the model a prompt. I maybe upload a legal case or some spreadsheet and I ask the model to do some stuff.
I think there would be some fraction of people who would see the potential and say, yeah, this thing could be extremely powerful and replace a lot of human labor already. But there would be other people who would say, oh, I see a lot of mistakes. This thing makes mistakes. I think it's still, you know, not ready for prime time.
You know, it could be useful in some very narrow situation, but not broadly speaking, and it's not gonna replace that many hours of human labor. I think the key issue in those claims or those remarks is that these people only have access to the one shot performance of the model. So one shot performance of the model means you just type, you maybe upload some information to it, then you write the prompt and then you get the answer.
But this generator verifier pipeline or more involved reasoning capability like Speciale has, or what people in the industry are calling agentic. Workflows where multiple instances of the instances of the model are collaborating and critiquing. The, the response before it's shown to a human. I think most of the people who have that kind of reaction that I described are not aware of the significant improvement of the model performance when you scaffold it or embedded in this kind of generator, verifier, or a agentic pipeline.
'cause very few people actually have seen. output from models that are used in that way. Again, if you want to see dramatic, example, of this, look at Lin Yang's paper there's a whole GitHub repository in which he shows you the, I think he has about three pages of prompting to put the model into the mode of a verifier or to put the model in the mode of a, a generator of a proof or refiner of a proof in response to verification. So different roles that the model's playing in the pipeline, and definitely a lot of tokens are burned through the process, but there's a dramatic difference of, instead of being able to solve, say, one out of six problems, it's able to write a correct proof for five out of six problems.
So that's a, that's a huge delta. Okay. Based on this extra scaffolding, which most people who purport to give you some opinion about AI capabilities have never seen, have not really looked carefully at, okay. So one of my main take home points is that for people who have experimented with this kind of GenFi generator verifier pipeline, they've already seen a qualitative bump in what models can do.
But that's a very small fraction of people who are working with the models at that level and also have the core expertise to tell the difference between model is only as good as first year grad student at solving these math problems or physics problems, or understanding these physics papers or maybe a first year law school student versus model is actually doing very non-trivial things if I scaffold it in this way.
Okay. So that gain in quality is going to make its way relatively soon into the off the shelf model. And special is the first example of that. So DeepSeek 3.2 Speciale has IMO gold medal capability without any of this extra scaffolding. And I haven't tested it on physics knowledge, but I, I suspect it's, qualitatively a jump beyond what the other off the shelf models will do in terms of analyzing a physics paper or doing some symbolic calculation. So I think we can already say with confidence that most people who are commenting on the current capabilities of AI's have not actually themselves seen or appreciated the peak performance of these models that's actually available already in 2025 and will become available in off the shelf models for sure, I think in 2026. Okay. Now, beyond that, I think what is going to happen is both the baseline capabilities of the models through pre-training and through RL are going to continue to increase.
We are not seeing a slowdown in the rate of improvement of these models. And furthermore, as we get, make them better and better at agentic collaboration in a pipeline so that you break the problem up into small pieces. You have different agents, generator, verifier agents, if you wanna call them that, attacking different parts of the problem, the verifiers, checking the solution not showing any of that to the human user until it's been processed through.
Potentially huge number of tokens, you know, millions of tokens of inference, and then showing you the final answer. I think we don't know how good that will be, you know, just say how much of an improvement we'll see, in that capability just in the next year or so. And it could, it could be quite dramatic.
So I wouldn't be that surprised if by the end of 2026 the models are extremely good at math and physics in general science and possibly extremely good at analyzing legal documents combing through spreadsheets, looking at financial statements and such. I think it's, it's definitely within reach to have a qualitative jump over the next year in the peak capabilities of these models could be very expensive in terms of inference costs. So it could be that this kind of deep processing or deep research uses an order of magnitude more tokens than a typical one shot response that you're getting right now, even from a very good thinking model.
It might be another order of magnitude more in inference tokens used, but I believe there will be a significant quality jump corresponding to that additional inference and additional scaffolding. So my prediction for 2026 is that we're gonna see continued improvement. Now if you, if you, if you dig down into the model training and you ask, well, how, how are they actually getting this improvement?
I can make a couple of comments. So, so let me take a very specific case, which is the use of models to do symbolic math. Okay. Now here, I don't mean necessarily solving a tricky IMO problem. What I mean here is you're a physicist, or an engineer and you have some symbolic math you need done.
Like, you need the, the, the AI to do an integral for you. Folded in with some other calculations, maybe solve some algebraic equation, maybe make a plot those are things that the model in the last, in the, the best models in the last year have made huge strides in. So a year ago, if you asked it to do some symbolic calculation, generally if it were a very obvious textbook symbolic calculation that it's seen before, like it's actually seen literally that calculation done or something very similar, then a year ago it would have a decent shot of maybe giving you a result, but it would also potentially make a mistake.
But I believe what's happened in the last year is that as the labs, all the labs have prioritized making the models better at math and, and science and, and they focused a lot on reinforcement learning. The models already have a decent understanding of the underlying concepts. So if you ask it what an electron is or what a photon is, or what a derivative is, or an integral or matrix multiplication, the models have some within their, you know, trillion parameter structure.
They already have some understanding in some sense, or at least encoding of those core concepts. And I believe through RL what's happened is that Yeah. One can give the models as they go through their reasoning steps. One can give the models eval functions training evals in which a symbolic math problem is given to them.
Again, not a specifically intentionally tricky problem, but one in which a set of manipulations needs to be done. Maybe each individual manipulation is relatively straightforward, but there's a chain of them, and then the thing comes back with a result. One can generate synthetic data for reinforcement learning for that set of problems by just using symbolic engines like Mathematica.
Okay. So there, there for, for non-scientists, there are, there are already existing programs like Mathematica, which was Steve Wolfram's the product of his company, Mathematica. That's already heavily used by quantitative scientists. So if you need to do an integral, you need to do a plot, you need to even like numerically solve some simple differential equation. A lot of people are doing it in Mathematica or in other open source symbolic math packages that are similar to Mathematica. And generally the error rate there is close to zero because when, when mathematics is doing an integral for you or it's simplifying some algebraic expression, it is following known algorithms. So it's, it's not, guessing answers or anything like that. It's following some procedures to get the answer. And generally, if it does succeed in giving, in, giving you an answer, the answer is correct. But using something like Mathematica or just some symbolic math package, I can produce an almost infinite amount of training data in which I can force the large language model, which already has baked into it's connections through the pre-training already. Pretty good understanding of what an integral is, what a derivative is, you know, what a vector is, what a matrix is or at least a compressed compression of those concepts is in its connections. I can use RL with synthetic data from symbolic calculations to make it really good at symbolic calculations, so to, to get it into.
In a sense, a habit, an RL induced habit of doing those symbolic computations step by step, doing them carefully chaining together five steps to actually reach the right result. Okay? And so I, I think that's something that happened in the last year. So if you're a physicist or an engineer and you're using the language models a lot and you're using them to actually do calculations.
There's been a tremendous improvement in their ability to do symbolic, just, again, not super, super hard Olympiad math not necessarily generating a proof, although it has also independently improved in those areas. I'm talking about pretty prosaic things, which, you know, a human could go through these calculations and do them, but it might take hours.
And now the LLM is capable of doing them, you know, very fast within a matter of, seconds. Okay, so that's an example of a, a very specific capability, but a capability which is central to progress or just day-to-day research activity. which suddenly the models in the course of basically about a year went from pretty unreliable at doing it to pretty reliable.
I, I wouldn't say necessarily they have a 99% accuracy at this stage yet, but I think the accuracy went up substantially. And it's to the point now where it's quite useful to the researcher. So if you just ask it to do some symbolic calculation, sure. You, you still need to check and look at the results to see whether it made a mistake, but it's quite likely that it didn't make a mistake.
And now you could reply, well, you could have already done that with Mathematica, but actually in, in the case of Mathematica, you have to enter the calculation that you want the that the program to do in a very rigid formal. Syntax. So you have to say in, you know, open bracket, you know, into grand measure, dah, dah, dah, dah. You right? You, you have to do all this in a very, it's almost like writing a little program. Whereas the models, LLMs can understand the context of the task that you're trying to give them, they can pretty much figure out what you want them to do. And so it's much easier just to write to them, not completely in English, but in some, the same kind of way that you would talk to a grad student or a, a research collaborator and the model will understand and then do the symbolic calculation properly.
Okay. So I'm just giving that as an example of something that on a timescale of a year. The models made tremendous improvement in that capability. There's no reason not to expect continued improvement like that so that, oh, there's some standard kind of analysis that's done in, in physical chemistry or you know, molecular biology or something like this.
And, you know, through the efforts of people in the labs trying to source high quality data, create high quality evals. Subject the model both to better pre-training, but also post-training, reinforcement learning to give them that specific set of skills. All of a sudden the research utility of the models just gets that much better.
And I'm sure that similar things are happening in coding. I think even more energy is going into making them better at writing software. Debugging software, understanding libraries, things like that. So I guess my prediction is that 2026 could be the tipping point where through these kinds of training efforts, both pre-training and post-training efforts, but then also additional scaffolding, that scaffolding will increasingly look like agents, different instances of the model, would that have been prompted differently or maybe even are themselves qualitatively different from each other, collaborating in a pipeline. Okay. That's the generalization of this gen generator verifier pipeline that I was talking about. That aggregate agentic capability, I think will also potentially improve dramatically in 2026.
So I think a year from now. I am predicting continued rapid advancement in the capabilities of these models. I'm not predicting a slowdown, even though there may be lots of challenges, like, oh, it can't increase the pre-training data set by an order of magnitude, easily, et cetera, et cetera. But I think with synthetic data with human input to generate good evals, I think they're still substantial.
Progress that these labs can achieve in the model. So I, I think a year from now, we're gonna be amazed at how good the models are. I've gone on now, I think just over an hour. I wasn't trying to cover the whole AI space. I just wanted to cover one particular thing that I noticed.
I could come back later perhaps and talk about US China competition. What's going on with semiconductors and Nvidia? How is this AI bubble gonna play out? What's gonna happen with data centers? That's not my purpose here. Maybe I'll come back and do an episode maybe with a guest to talk about those other topics.
Thank you very much for being a Manifold listener in 2025. It's been a great year for me. So many fascinating things happening in the world. It's a great time to be alive. I hope that all of you are doing well. Thanks so much. Have a wonderful holiday and a Happy New Year.
Creators and Guests