Prompting as Scientific Inquiry
Turns out asking nicely is a research method
Ari and Chenhao
Large language models have given us an unprecedented window into machine intelligence, and the primary way we've been exploring that window is through prompting. Yet despite prompting's instigation of most major breakthroughs in understanding LLMs, it's often dismissed as "dark magic" or unscientific hackery. This is an aesthetic bias that is holding scientists back.
In our recent paper, we argue that prompting deserves recognition as legitimate scientific inquiry—not a workaround, but a fundamental method for understanding and controlling LLMs. The confusion stems from conflating two very different activities that both involve writing prompts: prompt engineering and prompt science.
Prompt Engineering vs. Prompt Science
Prompt engineering is the brute-force optimization of prompts for specific tasks—finding what works without caring why. Prompt science, on the other hand, uses prompts to discover and test hypotheses about how LLMs actually behave: first through exploratory prompting to discover new behavior and then through prompt studies. Prompt engineering is about performance, prompt science is about understanding.
Nearly every emergent capability we've discovered in LLMs was first unlocked through prompting. In-context learning, chain-of-thought reasoning, self-improvement through augmented data—these weren't predicted by architecture papers or found through weight analysis. They were found via experimentation with model inputs and outputs.
Behind the hand-waving dismissals lies an uncomfortable truth: prompting reveals model capabilities we never knew to look for, while interpretability has, so far, largely confirmed hypotheses we've already formulated. We say this as researchers who love interpretability and publish interpretability papers—the point is that we should recognize what different tools can give us.
The Great Rebranding
The bias against prompting is part of our field's persistent preference for optimization over exploration, for mathematical elegance over empirical discovery. Figure 2 shows a timeline with a predictable pattern: every time a prompting method proves genuinely transformative, it gets distanced from the "prompting" label. Chain-of-thought becomes "inference-time compute." Structured prompting becomes "programming foundation models." Data augmentation via synthetic data gets framed as a training methodology, with the recipe for such data often avoiding the word “prompting.”
Won’t Mechanistic Interpretability Tell Us What’s Actually Happening?
No. Or—kind of, but not in a satisfactory way.
As Figure 3 emphasizes, prompting and mechanistic approaches are complementary methods operating at different levels of analysis. Mechanistic interpretability excels at the implementation level, showing us how specific weights and activations realize particular functions. Prompting shines at the computational level, revealing what capabilities models have and why they might have developed them.
If we discovered an intelligent alien species, we would learn many things from playing simple card games with them and observing their reasoning patterns that we would have trouble learning by dissection. Similarly, prompting allows us to probe LLMs through their optimized communication channel, revealing capabilities and limitations that might remain hidden in their weights and activations if we don't know what to look for.
LLMs have been handed to us on a silver platter, from this perspective: they already speak approximately the same language we do, shouldn't we be excited to be able to probe them with it?
Addressing the Skeptics
Figure 1 outlines the common critiques against prompting, and we address each systematically in the paper. Critics often point to prompt brittleness, lack of mathematical formalism, or concerns about generalization. But these objections miss the point. Prompt brittleness may make prompt engineering more difficult, but it makes the science of prompting significantly more meaningful. Understanding how human intentions in prompts don't perfectly transfer to model outputs is a key mechanism for cataloging model capabilities, rather than merely an error to be corrected.
The probability distributions emitted by LLMs are as much an objective part of their mechanistic description as anything else. Prompt science is falsifiable and scientifically rigorous when it is done correctly.
If we only had prompting at our disposal—if LLMs were truly black boxes—we would likely have made greater strides in probing their behavior systematically, designing experiments with rigor and creativity to uncover their internal structure through inputs and outputs alone.
What’s Next?
If you choose to take prompting seriously as a form of scientific inquiry, there are a number of directions we believe are still understudied, even in such a booming field.
Prompting as mechanism discovery. Imagine using systematic prompt perturbations to map exactly where model capabilities break down, potentially revealing the hidden structure of how knowledge is organized internally. This could give mechanistic interpretability researchers much clearer targets to investigate.
Prompting for Effective Control of LLMs. What if we could identify a minimal set of "control primitives" in prompt space: basic building blocks that could be composed to achieve any behavioral modification we want? It's an open question whether such primitives even exist, but the implications for AI safety and alignment could be profound. The implications if they don’t exist is that control is limited to a certain subspace, which is perhaps even more interesting.
Prompting as future-proofing. Whatever tomorrow's AI systems look like—different architectures, new modalities, radically enhanced capabilities—we'll likely still need language to communicate our intentions. That’s because it’s the best way that humans know how to articulate their desires. Understanding the fundamental principles of how minds communicate across intelligence gaps through prompting could prove invaluable as AI systems become vastly more capable than humans.
The symbiosis between prompting and mechanistic interpretability will grow if prompting is taken more seriously. Prompting discovers behavioral anomalies and mechanistic work formalizes them so that they are mathematically manipulable. Without behavior discovery to provide hypotheses and targets for interpretability, we are unlikely to scale interpretability to the kinds of explanations and guarantees many of us are dreaming of. To do that we need to accept that prompting is classical exploratory science at its best—it's a key component in the science of LLMs, and it's time we started treating it that way.
@article{holtzman2025prompting,
title={Prompting as Scientific Inquiry},
author={Holtzman, Ari and Tan, Chenhao},
journal={arXiv preprint arXiv:2507.00163},
year={2025}
}


