Benchmarks — AI Agent Productivity
Traditional language benchmarks measure execution speed. For Kōdo, the question is different: how fast can an AI agent go from “code written” to “code verified and deployed”?
Benchmark 1: Error→Fix Loop Speed
Scenario: An AI agent generates code with 10 type errors. How fast can it reach a clean compilation?
Python + mypy
# Agent generates code → runs mypy
$ mypy main.py
main.py:12: error: Incompatible types in assignment
(expression has type "str", variable has type "int")
main.py:25: error: Argument 1 to "process" has incompatible type...
# Agent must: parse prose → regex match error locations →
# guess the fix → rewrite → re-run mypy
# Some errors are ambiguous. Auto-fix rate: ~60%
Kōdo
// Agent generates code → runs kodoc check --json-errors
$ kodoc check main.ko --json-errors
{
"code": "E0201",
"message": "Type mismatch: expected Int, found String",
"span": { "file": "main.ko", "start": 142, "end": 155 },
"fix_patch": {
"replacement": "parse_int(value)",
"start_byte": 142,
"end_byte": 155,
"confidence": "high"
},
"fix_difficulty": "auto"
}
# Agent applies fix_patch directly — no guessing
$ kodoc fix main.ko
Fixed 10 errors. 0 remaining.
| Metric | Python + mypy | Kōdo |
|---|---|---|
| Error format | Prose (regex parsing needed) | Structured JSON |
| Fix mechanism | Agent guesses | FixPatch with byte offsets |
| Auto-fix rate | ~60% of type errors | 100% of errors with patches |
| Cycles to clean build | 2–5 | 1–2 |
Benchmark 2: Correctness by Construction
Scenario: An AI agent generates a division function. How many bugs reach runtime?
Python
def divide(a, b):
return a / b # No compile-time check — ZeroDivisionError at runtime
# Agent can add a check, but nothing *enforces* it
def divide_safe(a, b):
if b == 0:
raise ValueError("division by zero")
return a / b
# Still no guarantee callers handle the error
Kōdo
fn divide(a: Int, b: Int) -> Int
requires { b != 0 }
ensures { result * b == a }
{
return a / b
}
// Calling divide(10, 0) → compile-time error E0301:
// "Precondition 'b != 0' cannot be satisfied:
// argument 'b' is literal 0"
| Metric | Python | Kōdo |
|---|---|---|
| Division by zero | Runtime exception | Compile-time error (Z3 proves b != 0 is violated) |
| Contract enforcement | None (convention only) | Grammar-level requires/ensures |
| Bugs reaching runtime | Possible | Zero for statically verified contracts |
| Agent behavior | Hope the tests catch it | Compiler blocks the build — agent must fix |
Benchmark 3: Trust Propagation
Scenario: A module has 5 functions. One is experimental with @confidence(0.6). How fast is the risk detected?
Python
# No mechanism to track confidence or authorship
def stable_function(): # Who wrote this? How confident? No idea.
return process(experimental_helper())
def experimental_helper(): # Agent generated this at 60% confidence
return risky_computation()
# Risk: experimental code is silently used in production
# Detection: manual code review, maybe never
Kōdo
@authored_by(agent: "claude")
@confidence(0.95)
fn stable_function() -> Int {
return process(experimental_helper())
// ↑ E0260: Calling function with confidence 0.6
// from function with confidence 0.95.
// Add @reviewed_by to acknowledge the risk.
}
@authored_by(agent: "claude")
@confidence(0.6)
fn experimental_helper() -> Int {
return risky_computation()
}
| Metric | Python | Kōdo |
|---|---|---|
| Confidence tracking | None | @confidence scores on every function |
| Risk propagation | Invisible | Transitive — min confidence propagates through call chains |
| Detection time | Manual review (hours/days/never) | Compile-time (instant) |
| Policy enforcement | None | Build blocked until @reviewed_by is added |
| Audit trail | git blame | Build certificates (.ko.cert.json) with per-function scores |
The closed-loop advantage
These benchmarks share a pattern: Kōdo moves verification left — from runtime to compile-time, from human review to automated checks.
For an AI agent operating in a tight loop:
┌─────────────────────────────────────────────────┐
│ Agent writes code │
│ ↓ │
│ kodoc check --json-errors │
│ ↓ │
│ Parse JSON → apply FixPatch → recompile │
│ ↓ │
│ All contracts verified by Z3 │
│ ↓ │
│ Confidence scores > threshold │
│ ↓ │
│ Build certificate generated → deploy │
└─────────────────────────────────────────────────┘
No human in the loop. No hoping tests catch it. No “it works on my machine.”
kodo-bench: Quantitative Agent Evaluation
Scenario: Give Claude an LLM-readable language reference and 150 Kōdo coding tasks. Measure how often it produces correct code on the first try (pass@1).
Setup
- Model: claude-sonnet-4-20250514
- Runs per task: 3 (pass@1 computed with the unbiased Codex estimator)
- Validation:
kodoc check→kodoc build→ run binary → compare stdout - System prompt:
bench/kodo-reference.md(~2000 tokens)
Results by Category
| Category | pass@1 | Notes |
|---|---|---|
| modules | 0.967 | pub fn/struct, meta blocks, encapsulation |
| ownership | 0.911 | own/ref/mut, borrow semantics |
| basics | 0.900 | loops, structs, conditionals |
| agent-traceability | 0.867 | @confidence, @authored_by, @reviewed_by |
| error-handling-advanced | 0.822 | Result chains, custom error enums |
| traits-generics | 0.733 | trait/impl, generic functions |
| contracts-advanced | 0.667 | requires/ensures, refinement types |
| intents | 0.400 | http_server, database, cache, cli |
| contracts | 0.292 | basic preconditions / postconditions |
| patterns | 0.185 | closures, higher-order fns, string interp |
| data-structures | 0.125 | Set, Map edge cases |
| concurrency | 0.083 | spawn / channels (sequential in v1) |
| error-handling | 0.000 | Result/Option patterns (stdlib API mismatch) |
Aggregate
| Metric | Value |
|---|---|
| pass@1 (150 tasks) | 0.602 |
| compile_rate | 0.864 |
| easy tasks (n=44) | 0.841 |
| medium tasks (n=78) | 0.504 |
| hard tasks (n=28) | 0.500 |
Interpretation
What works well: The features most distinctive to Kōdo — ownership, agent traceability annotations, contract-aware modules, and advanced error handling — are also where agents score highest. Once agents have the reference, Kōdo idioms click quickly. The improvement from the first run (0.502) to the current baseline (0.602) came entirely from fixing the reference — better examples for intents, closures, tuples, and the correct stdlib APIs.
What drags the score down:
error-handling(0.000): The basic Result/Option stdlib API (e.g.Option::Some(v)pattern matching) has a known mismatch in this task set — under investigation.concurrency(0.083):spawn/async/awaitexecute sequentially in v1; agents expect parallel semantics and produce wrong output.intents(0.400): Intent block syntax is unfamiliar; richer examples in the reference improved compile_rate from 0.089 to 0.333.
The verdict: 60.2% pass@1 from a 2000-token reference. The goal is 70%+ as we fix the error-handling task set and address v1 concurrency limitations.
# Reproduce these results
export ANTHROPIC_API_KEY=your_key
python3 bench/agent-eval.py --model claude-sonnet-4-20250514 --runs 3
Real-world example
We built a complete Task Management API in Kōdo that exercises all of these features in a single file — contracts, refinement types, agent traceability, closures, JSON serialization, HTTP server, and inline tests. The same project is also implemented in Python, TypeScript, Rust, and Go for reference.