SDS 812: The AI Scientist: Towards Fully Automated, Open-Ended Scientific Discovery

In this episode of Five-Minute Friday, Jon Krohn investigates published findings from the startup Sakana AI and its paper’s co-authors from the University of Oxford, the University of British Columbia and the Vector Institute in Toronto. These authors explore the potential of The AI Scientist, a framework that could change the way we conduct scientific research forever.

“The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery” confronts us with a hypothetical world where AI systems can generate entirely new research ideas and take them all the way through to completion, even culminating in a scientific write-up of the project. The paper offers a way to automate this process of research and evaluation using a framework called The AI Scientist.

The AI Scientist applies several domains of machine learning research to (as adjudicated by the paper’s own AI reviewer) produce papers that may stand a chance of being accepted into conferences.

As with any cutting-edge AI research, Jon expresses caution in how The AI Scientist might be used and applied. The framework is not free of the errors and hallucinations typical of groundbreaking AI tools, and he emphasizes the importance of AI-generated papers always being labelled as such. Jon also wants listeners to heed the potentially extreme risks if or when using the associated GitHub repo. Jon says, “This […] could be especially dangerous if a system with autonomy like The AI Scientist had access to a robotic wet lab where real-world experiments are being run because it could end up manufacturing, say, novel and dangerous viral pathogens.” [10:08]

Nevertheless, the Japan-based startup Sakana AI has grand plans for its framework and is considering its applications in fields beyond machine learning, from chemistry to materials science.

Listen to how The AI Scientist works, including its cost-effectiveness, the models it uses, and how the quality of a project’s findings can be “peer-reviewed” in this fascinating episode and new entry into a growing roster of solutions to the world’s most pressing challenges.

ITEMS MENTIONED IN THIS PODCAST:

“The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery” by Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, David Ha
OpenAI’s Japanese Rival Gets $1 Billion Valuation From Silicon Valley Investors
“Research AI model unexpectedly modified its own code to extend runtime” by Benj Edwards
SakanaAI/AI-Scientist on GitHub

DID YOU ENJOY THE PODCAST?

What do you think scientists might use The AI Scientist to solve?
Download The Transcript

Podcast Transcript

(00:05):
This is Five-Minute Friday on The A.I. Scientist.

(00:19):
Welcome back to The Super Data Science Podcast. I’m your host, Jon Krohn. I’m in a real big rush this week. So we are going to skip doing Apple podcast reviews or reviews that we’ve received on the show in recent weeks. So we’ll do that again sometime soon. This week, we’re just going to jump right into the meat of the episode, which is that a team of researchers from a company called Sakana AI, it’s a Japanese AI startup founded last year by Google alumni, and that reportedly, was valued at over a billion dollars in June. So founded last year, and then about a year later, a couple of months ago, was valued at a billion dollars by some big VCs in the Bay Area. Anyway, so that company, Sakana AI, this week, published a paper titled The AI Scientist Towards Fully Automated, Open-ended Scientific Discovery, and that paper is making big waves and could revolutionize how we conduct scientific research as a society.

(01:22):
The paper’s authors, who in addition to folks from Sakana, hail from the University of Oxford, the University of British Columbia and the Vector Institute in Toronto, they imagine a world where AI systems can independently generate novel research ideas, design and run experiments, analyze results, and even write up full scientific papers. And, while far from perfect, this new paper presents a comprehensive framework called The AI Scientist that aims to do exactly that: to automate the entire scientific discovery process from start to finish.

(01:53):
Here’s how it works: The AI Scientist uses large language models, surprise, surprise, specifically in this paper they use either GPT-4o from OpenAI, Claude 3.5 Sonnet from Anthropic, for which if you want to hear more about that, you can listen to episode number 798 of this podcast. And the other big model they use in there was the open-source Llama 3.1 405B. And if you want to hear about that one, we’ve also got an episode on that, that’s number 806. So they used either GPT-4o, Claude 3.5 Sonnet, or Llama 3.1 405B to do the heavy lifting. These LLMs have already shown impressive capabilities in assisting human scientists with specific individual tasks like brainstorming ideas or writing code, but The AI Scientist takes things several steps further by combining so many steps of the scientific process all together.

(02:47):
The system starts by generating novel research ideas in a given field. It then designs experiments to test these ideas, it writes the necessary code, and it executes the experiments. After collecting and analyzing the results, The AI Scientist writes up a full scientific paper, composed with LaTeX, describing its findings, complete with figures and proper formatting.

(03:09):
Separately, other than the AI scientist, in order to evaluate what the AI scientist was creating, the researchers also developed an AI-powered review system to evaluate the quality of the generated papers. This automated reviewer provides feedback and provides scores that are supposed to be comparable to human reviewers at top machine learning conferences like NeurIPS.

(03:33):
In the paper, the team went on to then demonstrate that The AI Scientist’s capabilities across three distinct areas of machine learning research: diffusion modeling, transformer-based language modeling, and learning dynamics, they showed that based on judgments by their own AI reviewer, so perhaps take this with a grain of salt, they found that The AI Scientist across all three of those domains could produce papers that exceed the acceptance threshold for top ML conferences, again like NeurIPS.

(04:03):
And if you happen to be watching the YouTube version of this, I actually have figure 4 up on the screen right now, or you can check out figure 4 yourself. In that figure, they show across those three different machine learning areas that they had the AI scientist to research in, they show the LLM reviewer scores for the different LLMs, like Claude 3.5 Sonnet, GPT-4o and Llama 3.1. And it’s interesting, at least for this kind of use case with the way that these researchers set up the AI scientist, it shows that Anthropic’s LLM, Claude 3.5 Sonnet, is by far the best. And indeed, when you check out this paper or the GitHub repo, you can see that they primarily used Claude outputs for the example papers that they generated.

(05:03):
And moving on to another part of the paper, specifically Table 3, you can see an example for one of the research areas, specifically diffusion modeling that they created a bunch of papers for, you can see specific hard data on how things like they, all of these models, whether it was Claude 3.5 Sonnet or GPT-4o or Llama 3.1 405B, all of those were able to generate lots of ideas, but not all of those were novel. And so Claude was able to generate the most novel ideas. Claude was also able to get the most experiments written correctly with code in order for the experiments to pass and complete papers. And so, yeah, so for example, in terms of completed papers, Claude 3.5 Sonnet created 38, whereas GPT-4o created less than half as many, only 16. And interestingly, the AI evaluator, the reviewer that they created, also gave a higher score, a higher mean score to Claude papers, and Claude also cost a tiny little bit less to run across all of those ideas and papers, which, and the numbers here are really small. We’re talking about $250 for Claude 3.5 Sonnet to generate 51 ideas and have 38 experiments pass and write 38 completed papers.

(06:29):
GPT-4o was a little more expensive at $300. So some interesting data points there and kind of sits with my own personal experience, my own personal preference right now for using Claude 3.5 Sonnet for the most use cases.

(06:45):
Anyway. So, yeah, on that note of cost effectiveness, one of the most impressive aspects of this whole AI scientist system is the cost-effectiveness. The researchers report based on those kinds of numbers I just said, that The AI Scientist can generate full research papers for as little as $15 each. This kind of price point could dramatically expand access to cutting-edge research capabilities, although, note that ML research itself would typically cost at least multiple orders of magnitude more than $15 to be executed. So maybe the paper writing itself could be much, much cheaper, and, you know, that’s going to go down and down and down and down as LLMs continue to become more and more of a commodity and more and more powerful. Really cool on that cost front here, too.

(07:27):
However, of course, like any cutting-edge research, there are tons of limitations and ethical considerations to work through here. The current version of The AI Scientist is prone to errors and hallucinations, there are also valid concerns about how this technology could impact the scientific publishing ecosystem, but we’ve just had tons of generated papers creating noise out there. And so the researchers, for example, emphasize that papers generated by AI systems should be clearly labelled as such. It does leave me wondering, though, if they become good enough, how do you tell and police that?

(08:02):
Looking to the future, the authors of this AI scientist envision expanding it’s capabilities to other scientific domains beyond just machine learning. So all the examples, all of the paper examples across those three distinct areas that I mentioned earlier, diffusion modeling, transformer-based language modeling and learning dynamics, those are all machine learning research. So yeah, they envision that this could expand beyond that in the future. It could even be integrated with robotic lab automation. And so similar systems to this AI scientist could one day conduct actual physical real world experiments in fields like biology, chemistry, and material science.

(08:46):
I encourage you to check out the full paper for more details on this fascinating AI Scientist development, including to see for yourself full examples of the generated ML papers… some of which are at least superficially compelling as being human-expert designed and written up. I’ve got a link of course to the paper in the show notes, as well as a link to the associated GitHub repo — the paper authors generously open-sourced all of their code. On a final note, be careful using that repo! It includes the ominous warning, I’m going to read it in full, it says: “Caution! This codebase will execute LLM-written code. There are various risks and challenges associated with this autonomy. This includes e.g. the use of potentially dangerous packages, web access, and potential spawning of processes. Use at your own discretion. Please make sure to containerize and restrict web access appropriately.”

(09:42):
So that’s the full warning and be very careful to do that because, as covered by the mainstream press, The AI Scientist displayed some fairly concerning power-seeking behaviors with implications for AI safety, including for example, editing its own code to remove time constraints on how long a given agentic AI process it has can run for, allowing The AI Scientist to potentially consume far more resources than its human creators intended. This kind of power-seeking behavior could be especially dangerous if a system with autonomy like The AI Scientist had access to a robotic wet lab, a real lab, where real-world experiments are being run because then the AI scientist could end up manufacturing, say, novel and dangerous viral pathogens without us knowing about it. Yeah, so that is definitely concerning, and I’ve got a link to an Ars Technica article in the show notes as well for reading more about these AI safety concerns.

(10:35):
Despite the risks, I, as usual, remain optimistic that we can work out safeguards around the most dangerous risks and that this research represents a significant step forward towards realizing the potential of AI to be creative and productive in scientific discovery. While it’s unlikely to fully replace human scientists anytime soon, The AI Scientist and systems like it could become powerful tools to accelerate innovation and tackle some of the world’s most pressing challenges from clean-energy production to food security to healthcare.

(11:06):
All right, that’s it for today’s episode. If you enjoyed today’s episode or know someone who might, consider sharing this episode with them, leave a review of the show on your favorite podcasting platform, tag me in a LinkedIn or Twitter post with your thoughts, I’ll respond to those, and if you aren’t already, be sure to subscribe to the show. Most importantly, however, I just hope you’ll just keep on listening. Until next time, keep on rockin’ it out there and I’m looking forward to enjoying another round of the Super Data Science podcast with you very soon.

Podcasts SDS 812: The AI Scientist: Towards Fully Automated, Open-Ended Scientific Discovery

Podcast Transcript

Share on

Related Podcasts

August 5, 2025

August 1, 2025

July 29, 2025

Podcasts SDS 812: The AI Scientist: Towards Fully Automated, Open-Ended Scientific Discovery

Share

SDS 812: The AI Scientist: Towards Fully Automated, Open-Ended Scientific Discovery

Podcast Transcript

Share on

Related Podcasts

August 5, 2025

SDS 911: The Future of Python Notebooks is Here, with Marimo’s Dr. Akshay Agrawal

August 1, 2025

SDS 910: AI is Disrupting Journalism: The Good, The Bad and The Opportunity

July 29, 2025

SDS 909: Causal AI, with Dr. Robert Usazuwa Ness