WheelWright AI Framework

Version 2.0.0 · Source: llms-full.txt

Complete documentation for the WheelWright AI Framework — the hub-and-spoke knowledge operating system.

Documentation Update in Progress: This page reflects the framework documentation as of commit 1ed6bde. Updated comprehensive documentation will be published once the framework's llms-full.txt is finalized. For the most current information, see the framework repository.
Note: This page is generated from the framework’s llms-full.txt. For the canonical version, see the source on GitHub. Ensure you’re working against a tagged release.

Philosophy

WheelWright treats AI context as a compounding asset. The framework provides two universal primitives — Skills (executable capabilities) and Lugs (actionable knowledge records) — connected by a single execution contract: Perceive / Execute / Verify (PEV).

Architecture

The hub-and-spoke model: the Hub is the analytical clearinghouse and shared memory. Each Spoke is a project with its own Ozi orchestration agent, Advisors, and local Lug store. Advisors navigate by folder convention — directory position defines scope.

Lug Schema Specification

Version 1.1.0

Lugs are WAI’s universal communication primitive — actionable records, not summaries. Every Lug represents decomposed, meaningful work with full traceability.

Core Principles: Actionable, Traceable, Idempotent, Decomposed, Self-contained.

Lug Types

TypePurposeExample
taskWork to be done“Migrate auth from JWT to sessions”
diagnosisProblem identified“SQL injection in auth handler”
prescriptionRecommended fix“Parameterize query at line 47”
decisionJudgment call“Accepted risk on X because Y”
observationPattern recorded“Coverage dropped 82% → 73%”
preferenceWorkflow preference“User prefers terse confirmations”
signalHigh-impact (≥8)“Architecture change affects API”
sessionSession summary“Security review ran, 2 issues found”

Required Fields

id: "lug-2026-02-11-001"         # Unique: lug-{date}-{sequence}type: "diagnosis"                 # See Lug Types table title: "SQL injection in auth handler" status: "published"               # draft | published | acknowledged | in_progress | resolved impact: 9                         # 1-10. ≥8 = signal created_at: "2026-02-11T14:30:00Z" created_by: "security-reviewer" node: "ownersshare/cto"

PEV Fields (Perceive / Execute / Verify)

Optional structured execution context. When present, agents follow these instead of interpreting the title.

perceive: look_at: ["src/auth/handler.js"] current_state: "Email accepts any string" success_state: "Email validates against RFC 5322" execute: approach: "Add validation using phone.ts pattern" constraints: ["Do not modify phone validator"] verify: commands: ["npm test -- --grep 'email'"] expected_output: "All tests pass"
PEV is optional and backward compatible. Existing Lugs without PEV continue to work.

Lifecycle

draft → published → acknowledged → in_progress → resolved

Impact Scoring

ScoreVisibilityExample
1–3Local only“Refactored helper”
4–7Project-wide“API contract changed”
8–10Wheel-wide signal → Hub“Architecture pattern across projects”

Cross-Node Communication

Outbound (Spoke → Hub): Impact ≥8 Lugs copy to hub/intake/. Spoke tracks pending acknowledgment.

Inbound (Hub → Spoke): On wakeup, spoke reads Hub Lugs newer than its cursor, creates local Lugs with source_id, decomposes into actionable work.

Decision Records

Decision Lugs are the apprenticeship engine. They capture reasoning and alternatives, teaching agents the conductor’s judgment over time.

Session Ledger

WAI-Ledger.jsonl — append-only log of requests, agreements, and deliveries. Ensures commitments survive context loss.

TypeCreatorMeaning
requestconductor“I want this done”
agreementagent“I will do this”
deliveryagent“Done” + commit hash
verificationconductor“Confirmed” or “Doesn’t match”

Skill Contract Specification

Version 1.1.0

Skills are executable capabilities — sub-agents with defined scope, cost profile, and output contract.

Skill Types

TypePurposeTierWrite Access
reviewerAnalyze, produce diagnoseslightweightLugs only
watcherMonitor state changeslightweightLugs only
guardianEnforce policies, blockstandardLugs + block
workerImplement tasksadvancedCode + Lugs
advisorBRIEF alignmentstandardLugs only
orchestratorReconcile, planadvancedLugs + plans

Contract Schema

skill: security-review version: 1.2.0 type: reviewer model: tier: lightweight min_context: 32000 trigger: event: on_load frequency: per_session scope: reads: ["src/**", "WAI-Lugs.jsonl"] writes: ["WAI-Lugs.jsonl"] never: ["src/**", ".env*"]

Trigger Configuration

EventFires When
on_loadWakeup sequence
on_commitAfter git commit
on_content_changeSource files modified
on_demandExplicitly requested
pre_refactorBefore structural changes

Scope & Permissions

never overrides writes. Only worker Skills write source code. Scope violations are logged as Lugs.

Tests & Use Cases

Every Skill MUST include use_cases — documentation, agent context, and institutional memory of why the Skill exists.

safe-refactor (Guardian)

Git checkpoint before structural changes. Cannot be skipped. Origin: A rogue agent destroyed a Hub folder on 2026-02-10 with no recovery.

qc-check (Reviewer)

Runs tests, verifies startup, diagnoses failures. Agents fix mechanical problems autonomously — never asks the user to debug.

hub-watcher (Watcher)

Checks Hub for signals, updates, and pending acknowledgments. Priority 1 in wakeup sequence.

framework-updater (Worker)

Applies template updates. Categorizes changes as safe/review/breaking. Auto-applies safe, creates Lugs for the rest. Depends on safe-refactor.

brief-advisor (Advisor)

Reviews BRIEF against Lug patterns. Detects contradictions between policy and practice. The apprenticeship engine.

Full spec: For complete field references, outbound monitoring, migration patterns, and BRIEF integrity checking, see llms-full.txt on GitHub.

Bench Test

Feature v1.0.0 · Internal · Open Dashboard →

Bench Test is the reception-side evaluator for WAI Tracks — a prompt laboratory, synchronization library, and scoring dashboard built inside WheelWright Vault.

The runtime prompt captures a session. Bench Test receives the output, scores it, compares it to prior runs, and generates grounded improvement suggestions. It turns prompt evolution from guesswork into a repeatable engineering practice.

One-sentence model: Upload a Track → get a score → compare to last run → adopt the suggestions that matter.

Use Cases

Use CaseWhat You DoWhat You Get
Baseline a prompt versionUpload a track from a fresh promptObjective 0–10 score across 6 categories
Detect regressionsUpload track after prompt edit, compare to baselinePer-category delta — improved / unchanged / regressed
Evidence-grounded iterationReview generated suggestionsSpecific findings from your actual track, not generic tips
Manage a change queueAdopt, defer, or reject each suggestionCurated list of prompt changes for next iteration
Build a prompt historyKeep running Bench Test across versionsFull audit trail of why the prompt changed over time

Workflow

Step 1 — Prepare your Track

Export the WAI Track JSONL from your session. The standard path is:

WAI-Spoke/sessions/track_session-YYYYMMDD-HHMM.jsonl

Each line must be a valid JSON object. The evaluator records parse failures as warnings — it does not reject a track because of a few malformed lines.

Step 2 — Create a Run

Go to /dashboard/bench-test and fill in the form. Required fields: Project, Prompt Version, and the Track JSONL (upload file or paste). All other fields are optional metadata that help you filter and compare runs later.

Or POST directly via the API:

curl -X POST https://wheelwright.ai/api/bench-test/runs \
  -H 'X-API-Key: <your-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "project": "WAIWeb",
    "promptVersion": "v2.0.18",
    "trackContent": "{\"turn\":1,...}\n{\"turn\":2,...}",
    "model": "claude-sonnet-4-6",
    "sessionCodename": "session-20260323-0844"
  }'

Step 3 — Review the Evaluation

The run detail page shows the overall score, a category breakdown with bar indicators, critical issues, and strengths. Everything on this page traces back to specific findings in your track — not boilerplate.

Step 4 — Attach Supporting Artifacts (optional)

Add a chat transcript, reviewer notes, or a review document to the same run:

curl -X POST https://wheelwright.ai/api/bench-test/runs/{runId}/artifacts \
  -H 'X-API-Key: <your-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "artifactType": "chat_transcript",
    "content": "<raw transcript text>"
  }'

Valid artifact types: chat_transcript, review, notes, derived

Step 5 — Compare to a Prior Run

If a prior run exists for the same project, a Compare vs Previous button appears on the detail page. The comparison view shows side-by-side category scores, per-category deltas, and a badge summary. Link directly:

/dashboard/bench-test/compare?a={priorRunId}&b={thisRunId}

Step 6 — Work the Suggestions

Bench Test generates a list of suggested prompt improvements grounded in the evaluation findings. Each is classified:

ClassificationMeaning
criticalActive failure — address before next run
structuralArchitectural gap — worth a dedicated prompt change
optionalNice-to-have — consider when trimming later

Mark each suggestion Adopt, Defer, or Reject. Adopted suggestions form your change list for the next iteration.

Scoring

Each category scores 0–10. The overall score is a weighted average.

CategoryWeightWhat It Checks
Integrity25%Sequential turns, no duplicates, no gaps, clean parse
Schema20%Required fields present, no key drift (e vs type)
Signal Capture20%Decisions, insights, thinking, open threads populated
Drift Handling10%Evolution field present, phase transitions documented
Readability15%Focus and action field length and substance
Export Reliability10%Closing phase, parse error rate, truncation signals
ScoreGrade
9–10Excellent — production-quality prompt output
7–8Good — minor gaps, nothing structural
5–6Acceptable — real issues present, addressable
3–4Needs Work — structural problems affecting signal value
0–2Poor — fundamental capture failure

API Reference

EndpointMethodPurpose
/api/bench-test/runsPOSTCreate run, ingest JSONL, run evaluation
/api/bench-test/runsGETList all runs for the authenticated user
/api/bench-test/runs/:id/artifactsPOSTAttach transcript / review / notes
/api/bench-test/runs/:id/artifactsGETList artifacts for a run
/api/bench-test/suggestions/:idPATCHSet adoption status (adopted / deferred / rejected)

All endpoints require auth: GitHub session cookie or X-API-Key header (generate from the Vault dashboard).

Submission Tips

Export the full session. Missing turns are detected and penalised. Partial exports will lower your Integrity score.
CategoryHow to score higher
IntegrityExport the complete JSONL — no partial sessions
SchemaUse consistent field names throughout the session
Signal CapturePrompt explicitly for decisions, insights, thinking, open every turn
Drift HandlingPopulate evolution on every turn after turn 1
ReadabilityKeep focus to 15–80 chars; make action a substantive sentence
Export ReliabilityEnd the session with phase: "review" or phase: "closeout"