The Mirage of Autonomous AI Scientists
Science as AI’s killer application cannot succeed without scientist-AI interaction: Introducing Hypogenic.ai.
Science is not inherently valuable. Most species on earth assign no value to scientific discoveries: a tree doesn’t prioritize climate research, and a bacterium doesn’t seek to understand evolution. What we choose to study, what we consider an important problem, and what we deem as a breakthrough all reflect human values, priorities, and needs. Without human judgment, discovery loses its meaning, its purpose, and its worth.
Yet recent efforts to build “AI scientists” often miss this fundamental point. Multiple labs are racing to develop systems that autonomously generate hypotheses, run experiments, and write papers (e.g., Sakana AI and AI Researcher). The implicit assumption is that more automation equals faster progress. But if science derives its value from human judgment, can we truly automate it away?
I believe the answer is a resounding “No.” AI will not replace human scientists; its real potential lies in reshaping how science is done. As AI expands the search space and takes over routine production tasks, the role of scientists will shift toward selection and evaluation. To look beyond the mirage of the “autonomous AI scientist” and reimagine how science moves forward, I’ll make the case in three parts:
Accountability in selection and evaluation is what separates science from AI slop.
Technical advances must support selection and evaluation, not just production.
Science is about resource allocation, not just automation
To see why the idea of autonomous AI scientists is a mirage, we need to reframe science as a problem of resource allocation rather than pure intelligence. AI leaders typically portray science as an intelligence problem: build smart enough AI, and breakthroughs follow. However, this is unlikely to be the case. Even if there are genuine breakthroughs in millions of papers that AI generates, scientists may never identify and recognize them given their limited attention. Moreover, AI scientists can be costly. Computational research already consumes vast amounts of compute, but many disciplines also depend on resource-intensive real-world experiments — from biology labs to climate fieldwork. AI doesn’t make those cheaper; instead, it risks multiplying the burden by generating ever more hypotheses that require testing. As Kapoor and Narayanan elegantly argued, this is precisely because it creates more work for the resource-constrained scientific process. This observation that actual progress has been constant or even slowing despite the dramatic increase in rate of publication is known as the production-progress paradox.
But the real issue isn’t just the expense. At its core, science is a problem of resource allocation: deciding what matters among infinite possibilities with limited time, attention, and funding. Every choice reflects priorities at multiple levels: funding agencies choose which proposals to fund, scientists choose which idea to pursue, which hypothesis to test, which experiments to run, and which papers to read or write.
These choices can’t be delegated to machines. They’re inherently social and value-laden. What makes a problem “important” depends on personal goals, ethical considerations, and collective priorities. Even if AI becomes effective at making some of these choices, it cannot be held accountable for those choices. Accountability requires human judgment and ownership.
This resource allocation perspective points to two key roles: selector (making resource allocation decisions) and evaluator (gathering information to inform those decisions). As AI takes over more production tasks, these become the central responsibilities of scientists. We will need new infrastructure to support these roles.
Accountability in selection and evaluation is what separates science from AI slop
If resource allocation is the bottleneck, then accountability is the principle that keeps those allocations meaningful. That’s why selection and evaluation will become the scientist’s defining roles. What does it mean to be a selector or evaluator?
With AI scientists handling many production tasks, human scientists can dedicate more effort to selection throughout the research process. This includes choosing among research ideas generated by AI, deciding which hypotheses to pursue from those identified through literature, data, or simulation, and selecting implementation strategies for promising directions. A key point is that selection is not a one-time decision at the start, but a continuous process of judgment as research unfolds. Which directions show promise? Which should be abandoned? Human intuition, values, and priorities are critical in these decisions when resources are limited. This applies not only to advancing human understanding, but even to seemingly clear goals like curing cancer as we still cannot test every possible clinical trial.
As AI generates more hypotheses, experimental plans, and code, scientists must rigorously evaluate these outputs. This means checking hypotheses for novelty, importance, and feasibility, detecting methodological flaws in AI-designed experiments, catching errors in AI-written code before they propagate, and assessing results before they cascade through the literature. This evaluation challenge is an instantiation of the scalable oversight problem: how do we maintain rigorous quality control when AI dramatically increases the volume of scientific output? We already see this challenge in the replication crisis. As AI accelerates generation, evaluation becomes increasingly critical.
Selection and evaluation happen at every step in the iterative process of science. In order for selection to work, scientists must “believe in” the idea. In order for evaluation to work, scientists need to take responsibility for publishing the results. You can’t hide behind “the AI said so.” This accountability is what distinguishes the future of science from a world of AI-generated noise. The Virtual Lab of AI agents is a good example, where scientists determine the goal and work with AI agents, and thorough validation culminates in a nice paper in Nature.
Revisiting the Production-Progress Paradox
The emphasis on selection and evaluation could address the production-progress paradox. Progress in science comes from deep comprehension, not from producing more papers. When scientists invest effort in careful selection and think deeply about which directions matter and why, they build genuine understanding of the problem space. When they rigorously evaluate results and scrutinize methodology, assessing validity, connecting findings to broader context, they deepen their grasp of what the results actually mean.
An analogy is the “forklift at the gym” problem: if you want to build strength, automating the lifting defeats the purpose. Similarly, automating away the process of understanding defeats the purpose of science itself. Our vision avoids this trap through a specific division of labor. AI eases generation and production, expanding the search space exponentially and bringing more possibilities to examine. Humans handle judgment and accountability, deciding what matters, evaluating quality, and taking ownership of those choices. This is using the forklift to bring more weights to the gym, not to lift them for you.
Technical advances must support selection and evaluation, not just production
Current AI research focuses heavily on automating production with better models, faster inference, more autonomous systems. But if scientists’ essential role is shifting to selection and evaluation, we also need tools and systems that help scientists perform these new roles effectively. The advances must achieve three goals:
Augment selection. Tools should enhance scientists’ ability to select, while keeping humans in decision-making roles.
Scale up evaluation. As AI eases production, infrastructure must scale up human evaluation capabilities to match.
Incentivize wise selection over mere production. Create systems that reward good judgment and careful evaluation, not just output volume.
Hypogenic.ai: A First Step
We built hypogenic.ai to support the shift (shoutout to Haokun Liu!). hypogenic.ai currently supports two core features designed specifically to help with the selector role:
IdeaHub is a social platform for idea selection. Scientists can rate and evaluate both AI-generated and human-generated research ideas, comment on them, and indicate interest. The platform enables community evaluation of which directions are promising. Selection is not constrained to AI-generated or popular ideas. Scientists can focus on ideas that receive little attention or are explicitly counterintuitive. You can also create your organization and use IdeaHub to share ideas within your own research group.
Ideation assistant helps with generating ideas. Our research ideation system dynamically engages in ideation and hypothesis generation, handles other requests as normal, and allows you to share the ideas on IdeaHub or keep them private.
We believe that this platform can open up new ways to start research projects and collaborate with people. We are very early in the process. Your feedback is highly valuable.
Addressing Common Questions
Q: Aren’t AI-generated ideas just more AI slop?
This is precisely why we emphasize the selector role. AI can surface possibilities that might otherwise be overlooked (e.g., the famous move 37 by AlphaGo), thus broadening the search space. But quantity does not equal quality: only through human selection and evaluation can those raw outputs become meaningful ideas rather than unfiltered noise.
As noted in Could AI slow science, which also makes excellent observations about positive use cases, AI can potentially make studies more replicable when used appropriately in production.
Q: Does this ruin student training?
This tool isn’t designed for student training in the traditional sense. However in the future, students need to improve their skills in selection and evaluation. This tool can help them see the space of ideas generated by AI and engage in related discussions.
More broadly, training requires its own dedicated tools, not just the same ones used for professional research. Students still need to practice production skills, such as designing experiments, writing code, and replicating results, but these training tasks can be separated from the workflows of professional science. AI makes this separation easier: we can design different systems for learning and practice, while reserving tools like IdeaHub for selection and evaluation in real research. Similar issues also arise in peer review, where training future reviewers need not be built directly into today’s decision-making processes.
Q: Will my ideas get stolen if I share them on IdeaHub?
You have control over the visibility of ideas, which can stay in chat, be visible only to yourself, or be shared within an organization in IdeaHub. Furthermore, I believe that in the long run, credit assignment in science also needs to change, and IdeaHub can be part of that transformation. Identifying and proposing good ideas should deserve credit on its own. Credit shouldn’t only go to production; selection and evaluation matter too.
By creating a record of who proposed important ideas and identified promising directions early, IdeaHub could reshape how we evaluate contributions. Future grantmaking could also explore mechanisms of incorporating tools such as IdeaHub in the process.
What Else Needs to Happen: The Broader Ecosystem
Looking forward, the future of science requires much more. To give some examples:
AI systems should generate effective hypotheses from data, literature, and other computational approaches for human selection.
AI systems should suggest complete implementation plans while preserving human judgment on priorities.
Automation should handle execution tasks while maintaining accountability for key decisions.
Publication mechanisms need rethinking. AI-run venues like Agents4Science can surface new ideas and workflows, and we need to understand how they complement existing publication systems.
Funding mechanisms should allocate resources to encourage and reward selection and evaluation capacity.
Academic systems need reform to value contribution to understanding, creating career paths for evaluators and infrastructure builders.
We are working on a roadmap and curating other relevant resources. Please join us/email me if you are interested!
Conclusion: The Future of Science Lies in Human Judgment
The mirage of the autonomous AI scientist is tempting, but science without human judgment is not science at all. AI can expand the search space, but the bottleneck will shift to human accountability in selection and evaluation. If we build the right tools, incentives, and norms, AI can accelerate discovery; if not, AI may overwhelm the scientific process. The future of science will not be defined by autonomous machines, but by communities of scientists who take responsibility for discovery in partnership with AI.
I am grateful for valuable input from Davi Costa, Raul Castro Fernandez, James Evans, Ian Foster, Cristina Garbacea, Ari Holtzman, Xiao Liu, Hao Peng, Amit Sharma, and Ted Underwood.
If you find this article helpful, please cite:
@misc{tan2025mirage,
author = {Tan, Chenhao},
title = {The Mirage of Autonomous {AI} Scientists},
howpublished = {Communication \& Intelligence (Substack)},
year = {2025},
month = {October},
url = {https://cichicago.substack.com/p/the-mirage-of-autonomous-ai-scientists}
}




Wow, the part about human judgment being central to what makes science valuable really stood out to me, it's such an insightful take on the whole AI scientist discussion. Thank you for this brilliant perspective; it's so easy to get lost in the automaiton hype and forget that our human values are the real drivers of worth in discovery.