Benchmarks — AI Agent Productivity

Traditional language benchmarks measure execution speed. For Kōdo, the question is different: how fast can an AI agent go from “code written” to “code verified and deployed”?


Benchmark 1: Error→Fix Loop Speed

Scenario: An AI agent generates code with 10 type errors. How fast can it reach a clean compilation?

Python + mypy

# Agent generates code → runs mypy
$ mypy main.py
main.py:12: error: Incompatible types in assignment
    (expression has type "str", variable has type "int")
main.py:25: error: Argument 1 to "process" has incompatible type...

# Agent must: parse prose → regex match error locations →
# guess the fix → rewrite → re-run mypy
# Some errors are ambiguous. Auto-fix rate: ~60%

Kōdo

// Agent generates code → runs kodoc check --json-errors
$ kodoc check main.ko --json-errors
{
  "code": "E0201",
  "message": "Type mismatch: expected Int, found String",
  "span": { "file": "main.ko", "start": 142, "end": 155 },
  "fix_patch": {
    "replacement": "parse_int(value)",
    "start_byte": 142,
    "end_byte": 155,
    "confidence": "high"
  },
  "fix_difficulty": "auto"
}
# Agent applies fix_patch directly — no guessing
$ kodoc fix main.ko
Fixed 10 errors. 0 remaining.
MetricPython + mypyKōdo
Error formatProse (regex parsing needed)Structured JSON
Fix mechanismAgent guessesFixPatch with byte offsets
Auto-fix rate~60% of type errors100% of errors with patches
Cycles to clean build2–51–2

Benchmark 2: Correctness by Construction

Scenario: An AI agent generates a division function. How many bugs reach runtime?

Python

def divide(a, b):
    return a / b  # No compile-time check — ZeroDivisionError at runtime

# Agent can add a check, but nothing *enforces* it
def divide_safe(a, b):
    if b == 0:
        raise ValueError("division by zero")
    return a / b
# Still no guarantee callers handle the error

Kōdo

fn divide(a: Int, b: Int) -> Int
    requires { b != 0 }
    ensures  { result * b == a }
{
    return a / b
}

// Calling divide(10, 0) → compile-time error E0301:
// "Precondition 'b != 0' cannot be satisfied:
//  argument 'b' is literal 0"
MetricPythonKōdo
Division by zeroRuntime exceptionCompile-time error (Z3 proves b != 0 is violated)
Contract enforcementNone (convention only)Grammar-level requires/ensures
Bugs reaching runtimePossibleZero for statically verified contracts
Agent behaviorHope the tests catch itCompiler blocks the build — agent must fix

Benchmark 3: Trust Propagation

Scenario: A module has 5 functions. One is experimental with @confidence(0.6). How fast is the risk detected?

Python

# No mechanism to track confidence or authorship
def stable_function():  # Who wrote this? How confident? No idea.
    return process(experimental_helper())

def experimental_helper():  # Agent generated this at 60% confidence
    return risky_computation()

# Risk: experimental code is silently used in production
# Detection: manual code review, maybe never

Kōdo

@authored_by(agent: "claude")
@confidence(0.95)
fn stable_function() -> Int {
    return process(experimental_helper())
    //                 ↑ E0260: Calling function with confidence 0.6
    //                   from function with confidence 0.95.
    //                   Add @reviewed_by to acknowledge the risk.
}

@authored_by(agent: "claude")
@confidence(0.6)
fn experimental_helper() -> Int {
    return risky_computation()
}
MetricPythonKōdo
Confidence trackingNone@confidence scores on every function
Risk propagationInvisibleTransitive — min confidence propagates through call chains
Detection timeManual review (hours/days/never)Compile-time (instant)
Policy enforcementNoneBuild blocked until @reviewed_by is added
Audit trailgit blameBuild certificates (.ko.cert.json) with per-function scores

The closed-loop advantage

These benchmarks share a pattern: Kōdo moves verification left — from runtime to compile-time, from human review to automated checks.

For an AI agent operating in a tight loop:

┌─────────────────────────────────────────────────┐
│  Agent writes code                              │
│       ↓                                         │
│  kodoc check --json-errors                      │
│       ↓                                         │
│  Parse JSON → apply FixPatch → recompile        │
│       ↓                                         │
│  All contracts verified by Z3                   │
│       ↓                                         │
│  Confidence scores > threshold                  │
│       ↓                                         │
│  Build certificate generated → deploy           │
└─────────────────────────────────────────────────┘

No human in the loop. No hoping tests catch it. No “it works on my machine.”


kodo-bench: Quantitative Agent Evaluation

Scenario: Give Claude an LLM-readable language reference and 150 Kōdo coding tasks. Measure how often it produces correct code on the first try (pass@1).

Setup

  • Model: claude-sonnet-4-20250514
  • Runs per task: 3 (pass@1 computed with the unbiased Codex estimator)
  • Validation: kodoc checkkodoc build → run binary → compare stdout
  • System prompt: bench/kodo-reference.md (~2000 tokens)

Results by Category

Categorypass@1Notes
modules0.967pub fn/struct, meta blocks, encapsulation
ownership0.911own/ref/mut, borrow semantics
basics0.900loops, structs, conditionals
agent-traceability0.867@confidence, @authored_by, @reviewed_by
error-handling-advanced0.822Result chains, custom error enums
traits-generics0.733trait/impl, generic functions
contracts-advanced0.667requires/ensures, refinement types
intents0.400http_server, database, cache, cli
contracts0.292basic preconditions / postconditions
patterns0.185closures, higher-order fns, string interp
data-structures0.125Set, Map edge cases
concurrency0.083spawn / channels (sequential in v1)
error-handling0.000Result/Option patterns (stdlib API mismatch)

Aggregate

MetricValue
pass@1 (150 tasks)0.602
compile_rate0.864
easy tasks (n=44)0.841
medium tasks (n=78)0.504
hard tasks (n=28)0.500

Interpretation

What works well: The features most distinctive to Kōdo — ownership, agent traceability annotations, contract-aware modules, and advanced error handling — are also where agents score highest. Once agents have the reference, Kōdo idioms click quickly. The improvement from the first run (0.502) to the current baseline (0.602) came entirely from fixing the reference — better examples for intents, closures, tuples, and the correct stdlib APIs.

What drags the score down:

  • error-handling (0.000): The basic Result/Option stdlib API (e.g. Option::Some(v) pattern matching) has a known mismatch in this task set — under investigation.
  • concurrency (0.083): spawn/async/await execute sequentially in v1; agents expect parallel semantics and produce wrong output.
  • intents (0.400): Intent block syntax is unfamiliar; richer examples in the reference improved compile_rate from 0.089 to 0.333.

The verdict: 60.2% pass@1 from a 2000-token reference. The goal is 70%+ as we fix the error-handling task set and address v1 concurrency limitations.

# Reproduce these results
export ANTHROPIC_API_KEY=your_key
python3 bench/agent-eval.py --model claude-sonnet-4-20250514 --runs 3

Real-world example

We built a complete Task Management API in Kōdo that exercises all of these features in a single file — contracts, refinement types, agent traceability, closures, JSON serialization, HTTP server, and inline tests. The same project is also implemented in Python, TypeScript, Rust, and Go for reference.