What is a declarative prompt specification language?

A declarative prompt specification language is a way of writing prompts (and entire LLM behaviours) by describing the desired output structure, constraints, and format — rather than providing step-by-step natural-language instructions. Declarative prompt specifications are typically expressed in a structured language (YAML, JSON, or a custom DSL) so they can be parsed, type-checked, versioned, and ported across LLM providers without rewriting. promptel is Skelf Research's open-source declarative prompt specification language.

What is the difference between a declarative prompt and an imperative prompt?

A declarative prompt states the desired outcome — the output structure, constraints, and format — without specifying the reasoning steps. An imperative prompt spells out the steps the model should follow. Declarative prompts are portable across providers and reviewable by non-engineers; imperative prompts are typically embedded in code and tied to a specific provider.

How do you make prompts portable across LLM providers?

Use a declarative prompt specification language like promptel. The prompt is expressed as a structured YAML file with explicit input/output schemas, model parameters, and constraints. A runtime (promptel-js, or compatible) executes the same prompt against OpenAI, Anthropic, or a local model with no code changes.

Why are prompts first-class engineering artefacts?

Because prompts encode system behaviour. If they are ad-hoc strings, they cannot be versioned, reviewed, tested, or audited. If they are first-class artefacts (typed, versioned, portable, auditable), they become reproducible research objects — the same way source code is.

Formalising Prompts as First-Class Research Objects

Most LLM-powered systems treat prompts as strings. They live inside application code, concatenated at runtime, shaped by whatever felt right during a debugging session at 2am. They are rarely versioned independently of the code that hosts them. They are almost never tested in isolation. And when something breaks — when the model starts producing garbage after what should have been a routine deployment — nobody can point to the diff that caused it.

This is not an acceptable state of affairs for a core component of system behaviour.

We argue that prompts deserve the same engineering rigour that we apply to source code: explicit types, version control, portability guarantees, and formal specifications. This is the thesis behind promptel, our declarative prompt specification language, and the extraction tooling in blogus.

The Problem with Prompts-as-Strings

Consider a typical LLM integration. Somewhere in your codebase, buried inside a function, there is a string literal or template that constitutes the actual instruction to the model. It might look like this:

def summarise(text: str, model: str = "gpt-4") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a concise summariser. "
             "Output exactly 3 bullet points. No preamble."},
            {"role": "user", "content": f"Summarise:\n\n{text}"}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content

This code conflates at least four concerns: the prompt text, the model selection, the inference parameters, and the I/O plumbing. The prompt itself — the actual instruction that determines system behaviour — is tangled into the application logic. You cannot version it independently. You cannot test it without invoking the full function. You cannot port it to a different provider without rewriting the integration code.

Worse, there is no schema. Nothing declares what text should look like, what the expected output shape is, or what constraints the prompt assumes. The contract between the prompt author and the model is entirely implicit.

Prompts as Typed Artefacts

promptel takes a different approach. A prompt is a standalone, declarative specification:

# summarise.prompt.yaml
kind: Prompt
version: "1.0"
name: summarise
description: "Produce a 3-bullet summary of input text."

input:
  type: object
  properties:
    text:
      type: string
      minLength: 50
      description: "The text to summarise."

output:
  type: array
  items:
    type: string
  minItems: 3
  maxItems: 3

system: |
  You are a concise summariser.
  Output exactly 3 bullet points. No preamble.

user: |
  Summarise:

  {{text}}

parameters:
  temperature: 0.3
  max_tokens: 256

providers:
  - openai:gpt-4
  - anthropic:claude-3-sonnet
  - local:llama-3-8b

Several things are different here. The prompt has a name and a version. Its input and output are typed using JSON Schema. The template variables are explicit. The inference parameters are declared alongside the prompt, not buried in application code. And the provider list makes portability an explicit concern — this prompt is designed to run on multiple backends.

This is what we mean by treating prompts as first-class objects. They have identity, structure, and a defined interface.

Why Types Matter

Typed inputs and outputs enable several things that string prompts cannot support.

Validation before inference. If the input schema declares minLength: 50, we can reject trivially short inputs before they ever reach the model. This is not just an optimisation — it is a correctness guarantee. A summarisation prompt given a three-word input will produce nonsense. Catching this at the schema level is cheaper and more reliable than hoping the model handles it gracefully.

Output verification. If the output schema declares an array of exactly three strings, we can validate the model’s response structurally. Did it actually produce three bullets? Is each one a string? This enables automated regression testing without requiring human evaluation for every case.

Contract-driven development. When the prompt’s interface is explicit, downstream code can depend on it. A function that consumes the summariser’s output knows it will receive three strings. If someone changes the prompt to produce five bullets, the schema violation is caught before deployment, not in production.

Portability and the Provider Problem

One of the less discussed problems with prompts-as-strings is provider lock-in. A prompt tuned for GPT-4’s instruction-following style may produce poor results on Claude or Llama. But because the prompt is embedded in provider-specific code, switching providers means rewriting the integration layer.

promptel addresses this by separating the prompt specification from the provider binding. The providers field is a declaration of compatibility, not an implementation detail. The runtime layer handles the translation — mapping the prompt template to each provider’s message format, adjusting parameters where semantics differ.

This does not solve the fundamental problem that different models respond differently to the same text. But it does make the problem tractable. You can now test the same prompt specification against multiple providers and compare results systematically, rather than maintaining parallel implementations that diverge silently.

Extracting Prompts from Existing Codebases

For most teams, the prompts-as-typed-artefacts vision runs into an immediate practical obstacle: existing codebases already contain hundreds of prompts embedded in application code. Rewriting everything from scratch is not realistic.

This is where blogus fits in. blogus is a Python tool that performs static analysis on codebases to extract prompt patterns. It identifies LLM API calls, extracts the prompt text and parameters, and emits promptel-compatible specifications.

The extraction is necessarily heuristic — prompts constructed dynamically at runtime cannot always be fully resolved statically. But in practice, a large proportion of prompts follow predictable patterns (string literals, f-strings with known variables, template rendering) that blogus can handle.

The output is a catalogue of the prompts a codebase actually uses, each with its inferred type information and provider bindings. This catalogue serves as a starting point: review it, correct the inferences, and you have a versioned prompt library without a ground-up rewrite.

Implications for Reproducibility

Reproducibility is a persistent challenge in LLM research. Results depend on the exact prompt text, the model version, the inference parameters, and often the system prompt or few-shot examples. Papers frequently omit one or more of these, making replication difficult.

promptel specifications are designed to be self-contained. A .prompt.yaml file captures everything needed to reproduce a particular LLM interaction: the full prompt text, the template variables, the parameter settings, and the intended providers. You can commit it to a repository, attach it to a paper, or share it as a standalone artefact.

This is not a complete solution to reproducibility — model versions change, API behaviour drifts, and stochastic sampling means exact output matching is often impossible. But capturing the prompt specification precisely eliminates one of the most common sources of irreproducibility: ambiguity about what instruction was actually sent to the model.

What This Does Not Solve

We want to be clear about the boundaries of this approach. Formalising prompts does not make prompt engineering easier. It does not automatically produce better prompts. It does not eliminate the need for empirical testing against actual models.

What it does is make the engineering process around prompts tractable. Version control becomes meaningful when the artefact has structure. Testing becomes possible when inputs and outputs have types. Portability becomes a concrete property rather than a vague aspiration.

The analogy to programming languages is deliberate. We did not stop writing buggy code when we adopted type systems and version control. But we made bugs easier to find, changes easier to track, and collaboration easier to manage. That is the proposition for prompts.

Next Steps

In the next post in this series, we will discuss the prompt lifecycle in detail: how prompts move from initial drafting through testing, deployment, monitoring, and eventual retirement. We will cover the specific problem of prompt drift and show how blogus and promptel work together to detect and manage it.