angl

The judge

Black-box verification, the repair loop, and why the judge is decidable.

The judge decides whether a generated edition is accepted. It never imports the artifact. It runs the generated judge adapter as a subprocess, feeds each example's input as JSON on stdin, and compares the JSON that comes back to the pinned expectation.

{"args": [...]}  on stdin
   -> the adapter calls the generated implementation
   -> {"ok": true, "result": ...} or {"ok": false, "error": msg} on stdout

Because the protocol is JSON over a subprocess boundary, the judge cannot tell what language the artifact is written in. That is the language-neutrality boundary, and it is why one program can mix Python, Go, Rust, and TypeScript chapters.

Acceptance is binary: every example passes or the edition is rejected. Nothing else grants acceptance. Not code review, not plausibility, not the English.

The repair loop

When an edition fails, the failure output goes back to the compiler agent and it tries again. Rejection is not an error state; it is the mechanism working. The first generated attempt regularly looks plausible and is wrong, and the judge catches it before it ever runs behind a real boundary.

Metatheory, in plain words

  • No determinism required. The contract, not the code, is the source of truth, so the compiler being an LLM is acceptable.
  • The judge is decidable. Finite cases, black-box I/O comparison, always terminates.
  • Composition is sound iff every boundary is contract-verified.
  • The judge is not an oracle. It checks the cases supplied by the source. English alone cannot decide correctness without collapsing back into model self-evaluation.

Contract strength

The toolchain includes contract-strength mutation tests: deliberately mutated behavior must be caught by the examples. This guards the failure mode where a contract is too weak to notice drift.

The flip side is the honest ceiling: the judge can validate a generated edition completely against the written expectations, and it cannot prove behavior the expectations never mention. Treat behavior outside the examples as unspecified until a new example pins it.

Where tests go

There are two different kinds of tests, and they live in different places:

  • Product behavior tests live inside each .angl chapter as executable examples. These are what the judge runs against generated editions.
  • Toolchain tests live in tests/ and protect Angl itself: parsing, linting, formatting, compiling, repair prompts, adapters, verification, book views, and contract-strength regressions.

Generated editions do not get committed with their own hand-written tests. They are disposable build outputs. If behavior matters, put it in the chapter as an example so every future edition has to pass it.

On this page