Formalising Prompts as First-Class Research Objects
Why treating prompts as typed, portable artefacts changes how we reason about LLM behaviour — and how promptel implements this idea.
Formalising Prompts as First-Class Research Objects
Most LLM-powered systems treat prompts as strings. They live inside application code, concatenated at runtime, shaped by whatever felt right during a debugging session at 2am. They are rarely versioned independently of the code that hosts them. They are almost never tested in isolation. And when something breaks — when the model starts producing garbage after what should have been a routine deployment — nobody can point to the diff that caused it.
This is not an acceptable state of affairs for a core component of system behaviour.
We argue that prompts deserve the same engineering rigour that we apply to source code: explicit types, version control, portability guarantees, and formal specifications. This is the thesis behind promptel, our declarative prompt specification language, and the extraction tooling in blogus.
The Problem with Prompts-as-Strings
Consider a typical LLM integration. Somewhere in your codebase, buried inside a function, there is a string literal or template that constitutes the actual instruction to the model. It might look like this:
def summarise(text: str, model: str = "gpt-4") -> str:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a concise summariser. "
"Output exactly 3 bullet points. No preamble."},
{"role": "user", "content": f"Summarise:\n\n{text}"}
],
temperature=0.3
)
return response.choices[0].message.content
This code conflates at least four concerns: the prompt text, the model selection, the inference parameters, and the I/O plumbing. The prompt itself — the actual instruction that determines system behaviour — is tangled into the application logic. You cannot version it independently. You cannot test it without invoking the full function. You cannot port it to a different provider without rewriting the integration code.
Worse, there is no schema. Nothing declares what text should look like, what the expected output shape is, or what constraints the prompt assumes. The contract between the prompt author and the model is entirely implicit.
Prompts as Typed Artefacts
promptel takes a different approach. A prompt is a standalone, declarative specification:
# summarise.prompt.yaml
kind: Prompt
version: "1.0"
name: summarise
description: "Produce a 3-bullet summary of input text."
input:
type: object
properties:
text:
type: string
minLength: 50
description: "The text to summarise."
output:
type: array
items:
type: string
minItems: 3
maxItems: 3
system: |
You are a concise summariser.
Output exactly 3 bullet points. No preamble.
user: |
Summarise:
{{text}}
parameters:
temperature: 0.3
max_tokens: 256
providers:
- openai:gpt-4
- anthropic:claude-3-sonnet
- local:llama-3-8b
Several things are different here. The prompt has a name and a version. Its input and output are typed using JSON Schema. The template variables are explicit. The inference parameters are declared alongside the prompt, not buried in application code. And the provider list makes portability an explicit concern — this prompt is designed to run on multiple backends.
This is what we mean by treating prompts as first-class objects. They have identity, structure, and a defined interface.
Why Types Matter
Typed inputs and outputs enable several things that string prompts cannot support.
Validation before inference. If the input schema declares minLength: 50, we can reject trivially short inputs before they ever reach the model. This is not just an optimisation — it is a correctness guarantee. A summarisation prompt given a three-word input will produce nonsense. Catching this at the schema level is cheaper and more reliable than hoping the model handles it gracefully.
Output verification. If the output schema declares an array of exactly three strings, we can validate the model’s response structurally. Did it actually produce three bullets? Is each one a string? This enables automated regression testing without requiring human evaluation for every case.
Contract-driven development. When the prompt’s interface is explicit, downstream code can depend on it. A function that consumes the summariser’s output knows it will receive three strings. If someone changes the prompt to produce five bullets, the schema violation is caught before deployment, not in production.
Portability and the Provider Problem
One of the less discussed problems with prompts-as-strings is provider lock-in. A prompt tuned for GPT-4’s instruction-following style may produce poor results on Claude or Llama. But because the prompt is embedded in provider-specific code, switching providers means rewriting the integration layer.
promptel addresses this by separating the prompt specification from the provider binding. The providers field is a declaration of compatibility, not an implementation detail. The runtime layer handles the translation — mapping the prompt template to each provider’s message format, adjusting parameters where semantics differ.
This does not solve the fundamental problem that different models respond differently to the same text. But it does make the problem tractable. You can now test the same prompt specification against multiple providers and compare results systematically, rather than maintaining parallel implementations that diverge silently.
Extracting Prompts from Existing Codebases
For most teams, the prompts-as-typed-artefacts vision runs into an immediate practical obstacle: existing codebases already contain hundreds of prompts embedded in application code. Rewriting everything from scratch is not realistic.
This is where blogus fits in. blogus is a Python tool that performs static analysis on codebases to extract prompt patterns. It identifies LLM API calls, extracts the prompt text and parameters, and emits promptel-compatible specifications.
The extraction is necessarily heuristic — prompts constructed dynamically at runtime cannot always be fully resolved statically. But in practice, a large proportion of prompts follow predictable patterns (string literals, f-strings with known variables, template rendering) that blogus can handle.
The output is a catalogue of the prompts a codebase actually uses, each with its inferred type information and provider bindings. This catalogue serves as a starting point: review it, correct the inferences, and you have a versioned prompt library without a ground-up rewrite.
Implications for Reproducibility
Reproducibility is a persistent challenge in LLM research. Results depend on the exact prompt text, the model version, the inference parameters, and often the system prompt or few-shot examples. Papers frequently omit one or more of these, making replication difficult.
promptel specifications are designed to be self-contained. A .prompt.yaml file captures everything needed to reproduce a particular LLM interaction: the full prompt text, the template variables, the parameter settings, and the intended providers. You can commit it to a repository, attach it to a paper, or share it as a standalone artefact.
This is not a complete solution to reproducibility — model versions change, API behaviour drifts, and stochastic sampling means exact output matching is often impossible. But capturing the prompt specification precisely eliminates one of the most common sources of irreproducibility: ambiguity about what instruction was actually sent to the model.
What This Does Not Solve
We want to be clear about the boundaries of this approach. Formalising prompts does not make prompt engineering easier. It does not automatically produce better prompts. It does not eliminate the need for empirical testing against actual models.
What it does is make the engineering process around prompts tractable. Version control becomes meaningful when the artefact has structure. Testing becomes possible when inputs and outputs have types. Portability becomes a concrete property rather than a vague aspiration.
The analogy to programming languages is deliberate. We did not stop writing buggy code when we adopted type systems and version control. But we made bugs easier to find, changes easier to track, and collaboration easier to manage. That is the proposition for prompts.
Next Steps
In the next post in this series, we will discuss the prompt lifecycle in detail: how prompts move from initial drafting through testing, deployment, monitoring, and eventual retirement. We will cover the specific problem of prompt drift and show how blogus and promptel work together to detect and manage it.