Open Science in AI: Why We Publish Everything

The case for radical openness in AI research — reproducibility, falsifiability, and community trust through 24 open-source projects.

Open Science in AI: Why We Publish Everything

There was a time when the frontier of AI research was legible. Papers appeared on arXiv with full method sections. Training code was hosted on university servers. Datasets were downloadable. If you wanted to reproduce a result, you could — not easily, not always on the first try, but the information was there.

That time is ending. The most capable models now come from corporate labs that publish capability announcements rather than research papers. Training data is undisclosed. Model weights are proprietary. Evaluation benchmarks are run internally, with no way for outsiders to verify the claims. The most cited “papers” in the field increasingly read like product launches with equations.

This is not a complaint about profit motive. Companies are entitled to their business models. But it is a problem for science. Science requires that claims be falsifiable, that methods be reproducible, and that other researchers can build on your work without asking permission. When the most influential work in a field fails all three criteria, the field has a problem.

Skelf Research exists, in part, as a response to this problem. We publish everything: code, data, methodology, and — critically — our failures. Every one of our projects is open-source. Every claim we make is backed by an experiment someone else can run. This is not altruism. It is methodology.

Open Source Is Not Enough

Let us be precise about what “open” means, because the term has been diluted to near-meaninglessness. Releasing code on GitHub is a start, but code without context is just code. A repository full of Python scripts does not constitute open science any more than a pile of lab equipment constitutes a published experiment.

Open science requires three things beyond open source:

Runnable experiments. Not just code, but the full pipeline: data, dependencies, configuration, and instructions sufficient for a competent researcher to reproduce the result. “It works on my machine” is not reproducibility.

Documented hypotheses. Every experiment tests a claim. That claim should be stated explicitly before the results are presented. Post-hoc storytelling — running experiments and then constructing a narrative that explains whatever happened — is the most common form of soft fraud in machine learning research. We try to state what we expect and why, then report what actually happened.

Measurable claims. “Our approach is faster” is not a claim. “Our approach processes 10,000 queries in 340ms on an M2 MacBook Air, compared to 890ms for the baseline” is a claim. It can be verified. It can be falsified. It can be compared against future work. We aim for every performance assertion to be a number, with the benchmark code to produce that number sitting in the same repository.

These are not novel ideas. They are the basic standards of experimental science, applied to a field that has largely abandoned them in the rush to scale.

How Openness Works Across Our Research

Abstract principles are easy to endorse and hard to practise. Here is how openness plays out concretely across four of our research domains, through five projects that span the range of what we study.

LLM Cognition: promptel

promptel treats prompts as typed, versionable artefacts. The specification format is YAML with JSON Schema types for inputs and outputs, declared inference parameters, and explicit provider bindings.

The openness point here is specific: prompts are the most consequential and least inspectable component of modern AI systems. When a company says their AI assistant “follows instructions carefully,” the actual instruction — the system prompt — is proprietary. You cannot inspect it, test it, or compare it to alternatives.

promptel makes prompts concrete objects that anyone can read, modify, and test. Every prompt specification in our repositories is a complete, self-contained description of an LLM interaction. You can run it against your own models. You can diff it across versions. You can write your own test cases against the declared schema. The prompt is no longer a trade secret — it is a research artefact.

Safe Computing: memista

memista is a SQLite-backed approximate nearest neighbour search engine written in pure Rust. The research question is whether you need a dedicated vector database or whether lightweight embedded search can serve the same purpose.

For this question to be answerable, the benchmarks must be reproducible. memista ships with its benchmark suite: the datasets, the query sets, the measurement methodology, and the baseline comparisons. When we claim a particular recall rate at a particular query latency, you can verify that claim on your own hardware with your own data.

This matters because vector search benchmarks are notoriously sensitive to dataset characteristics, hardware, and measurement methodology. A claim of “95% recall” that cannot be independently verified is not a finding — it is marketing.

Safe Computing: fastC

fastC explores a restricted C dialect where the compiler enforces safety invariants that AI code generators can reliably satisfy. The research sits at the intersection of compiler design and AI code generation.

A compiler is one of the most verifiable artefacts in computer science. You can test it against any program. You can inspect every transformation it performs. You can write adversarial inputs designed to break its guarantees. FastC’s openness is not just about sharing source code — it is about inviting adversarial testing. Every safety claim the compiler makes is a claim that can be attacked, and we want it to be attacked. That is how compilers get better.

When we say fastC rejects a class of memory-unsafe programs, you can write your own unsafe program and feed it to the compiler. If it accepts it, we have a bug report. If it rejects it, we have evidence. This is what falsifiability looks like in systems research.

Edge Intelligence: llamafu

llamafu runs full language models on mobile hardware via Flutter and llama.cpp, with zero cloud dependency. The research question is about the practical limits of on-device inference: how much model can you run, at what speed, at what quality, on consumer hardware.

Performance claims about on-device inference are particularly prone to irreproducibility. Results depend on the exact device, the thermal state of the processor, the model quantisation, and the measurement methodology. Latency numbers from a lab benchmark on a cool device differ substantially from real-world performance on a phone that has been in someone’s pocket.

llamafu publishes not just the application code but the benchmarking harness, the model configurations, and the device-specific results. When we report inference latency on a Pixel 8, someone with a Pixel 8 can run the same benchmark. When we report memory usage for a 7B parameter model at Q4 quantisation, you can verify it. The measurement is the artefact, not just the code.

Formal Optimisation: savanty

savanty bridges natural language and constraint satisfaction — you describe an optimisation problem in English and receive a formally verified solution via mathematical solvers.

This is perhaps the domain where openness matters most, because the output has a mathematical guarantee. When savanty claims that a solution is optimal, that claim is verifiable by definition — you can check it against the formal constraints. But the more interesting openness question is about the pipeline: how does natural language become a formal specification? Where does the LLM interpretation introduce error? What classes of problems does the pipeline handle reliably, and where does it fail?

We publish the full pipeline, including the failure cases. The test suite includes problems where the natural language to formal specification translation goes wrong — where the LLM misinterprets a constraint, where the generated formulation is technically satisfiable but does not match the user’s intent. These failure cases are as valuable as the successes, because they define the boundaries of what the approach can do.

The Cost of Openness

We are not naive about the trade-offs. Publishing everything has real costs.

Competitive disadvantage. Every technique we develop is immediately available to anyone, including well-funded competitors who can execute on it faster than we can. There is no moat in open research. An idea we spent months developing can be replicated in weeks by a team with more resources.

Documentation overhead. Making research reproducible takes significant effort beyond making it work. Cleaning up code for public consumption, writing benchmark documentation, setting up CI pipelines that verify reproducibility — this work is invisible in the output but consumes real engineering hours. We estimate that making a project genuinely reproducible (not just open-source, but actually runnable by someone else) adds 30-40% to development time.

Premature exposure. Publishing work-in-progress means publishing things that are incomplete, sometimes wrong, occasionally embarrassing. There is a permanent public record of every dead end and mistaken assumption. This is uncomfortable. It is also, we believe, how science is supposed to work.

Why It Is Worth It

The costs are real but the benefits compound.

Community contributions. Our open repositories receive bug reports, benchmark comparisons, and occasionally code contributions from researchers we have never met. These contributions are disproportionately valuable because they come from people with different hardware, different use cases, and different assumptions than our own. A bug found by an external researcher using the code in a way we did not anticipate is worth more than ten bugs found by internal testing.

Reproducibility as quality control. The discipline of making experiments reproducible forces us to be more rigorous. When you know someone else will run your benchmark, you check your methodology more carefully. When your claims will be tested, you make more conservative claims. The external accountability makes the internal work better.

Trust. In a field where hype is the default register, verifiable claims are a differentiator. When we say a system achieves a particular performance number, and the benchmark to verify it is in the repository, that claim carries weight in a way that an unverifiable assertion cannot. Trust is built slowly, through accumulated evidence, and it is the most valuable asset a research organisation can have.

Join the Research

Everything we do is on GitHub. Every project has an open issue tracker. Every claim we make has a corresponding experiment you can run.

If you find a result you cannot reproduce, file an issue — that is the most valuable contribution you can make. If you run our benchmarks on hardware we have not tested, we want to see the numbers. If you think our methodology is flawed, tell us. We would rather be corrected in public than be wrong in private.

This is not a manifesto about how the world should work. It is a description of how we have chosen to work, and an invitation to hold us to it.