WheelWright AI Framework
Version 2.0.0 · Source: llms-full.txt
Complete documentation for the WheelWright AI Framework — the hub-and-spoke knowledge operating system.
1ed6bde. Updated comprehensive documentation will be published once the framework's llms-full.txt is finalized. For the most current information, see the framework repository.llms-full.txt. For the canonical version, see the source on GitHub. Ensure you’re working against a tagged release.Philosophy
WheelWright treats AI context as a compounding asset. The framework provides two universal primitives — Skills (executable capabilities) and Lugs (actionable knowledge records) — connected by a single execution contract: Perceive / Execute / Verify (PEV).
Architecture
The hub-and-spoke model: the Hub is the analytical clearinghouse and shared memory. Each Spoke is a project with its own Ozi orchestration agent, Advisors, and local Lug store. Advisors navigate by folder convention — directory position defines scope.
Lug Schema Specification
Version 1.1.0
Lugs are WAI’s universal communication primitive — actionable records, not summaries. Every Lug represents decomposed, meaningful work with full traceability.
Lug Types
| Type | Purpose | Example |
|---|---|---|
task | Work to be done | “Migrate auth from JWT to sessions” |
diagnosis | Problem identified | “SQL injection in auth handler” |
prescription | Recommended fix | “Parameterize query at line 47” |
decision | Judgment call | “Accepted risk on X because Y” |
observation | Pattern recorded | “Coverage dropped 82% → 73%” |
preference | Workflow preference | “User prefers terse confirmations” |
signal | High-impact (≥8) | “Architecture change affects API” |
session | Session summary | “Security review ran, 2 issues found” |
Required Fields
id: "lug-2026-02-11-001" # Unique: lug-{date}-{sequence}type: "diagnosis" # See Lug Types table title: "SQL injection in auth handler" status: "published" # draft | published | acknowledged | in_progress | resolved impact: 9 # 1-10. ≥8 = signal created_at: "2026-02-11T14:30:00Z" created_by: "security-reviewer" node: "ownersshare/cto"PEV Fields (Perceive / Execute / Verify)
Optional structured execution context. When present, agents follow these instead of interpreting the title.
perceive: look_at: ["src/auth/handler.js"] current_state: "Email accepts any string" success_state: "Email validates against RFC 5322" execute: approach: "Add validation using phone.ts pattern" constraints: ["Do not modify phone validator"] verify: commands: ["npm test -- --grep 'email'"] expected_output: "All tests pass"Lifecycle
draft → published → acknowledged → in_progress → resolvedImpact Scoring
| Score | Visibility | Example |
|---|---|---|
| 1–3 | Local only | “Refactored helper” |
| 4–7 | Project-wide | “API contract changed” |
| 8–10 | Wheel-wide signal → Hub | “Architecture pattern across projects” |
Cross-Node Communication
Outbound (Spoke → Hub): Impact ≥8 Lugs copy to hub/intake/. Spoke tracks pending acknowledgment.
Inbound (Hub → Spoke): On wakeup, spoke reads Hub Lugs newer than its cursor, creates local Lugs with source_id, decomposes into actionable work.
Decision Records
Decision Lugs are the apprenticeship engine. They capture reasoning and alternatives, teaching agents the conductor’s judgment over time.
Session Ledger
WAI-Ledger.jsonl — append-only log of requests, agreements, and deliveries. Ensures commitments survive context loss.
| Type | Creator | Meaning |
|---|---|---|
request | conductor | “I want this done” |
agreement | agent | “I will do this” |
delivery | agent | “Done” + commit hash |
verification | conductor | “Confirmed” or “Doesn’t match” |
Skill Contract Specification
Version 1.1.0
Skills are executable capabilities — sub-agents with defined scope, cost profile, and output contract.
Skill Types
| Type | Purpose | Tier | Write Access |
|---|---|---|---|
reviewer | Analyze, produce diagnoses | lightweight | Lugs only |
watcher | Monitor state changes | lightweight | Lugs only |
guardian | Enforce policies, block | standard | Lugs + block |
worker | Implement tasks | advanced | Code + Lugs |
advisor | BRIEF alignment | standard | Lugs only |
orchestrator | Reconcile, plan | advanced | Lugs + plans |
Contract Schema
skill: security-review version: 1.2.0 type: reviewer model: tier: lightweight min_context: 32000 trigger: event: on_load frequency: per_session scope: reads: ["src/**", "WAI-Lugs.jsonl"] writes: ["WAI-Lugs.jsonl"] never: ["src/**", ".env*"]Trigger Configuration
| Event | Fires When |
|---|---|
on_load | Wakeup sequence |
on_commit | After git commit |
on_content_change | Source files modified |
on_demand | Explicitly requested |
pre_refactor | Before structural changes |
Scope & Permissions
never overrides writes. Only worker Skills write source code. Scope violations are logged as Lugs.
Tests & Use Cases
Every Skill MUST include use_cases — documentation, agent context, and institutional memory of why the Skill exists.
safe-refactor (Guardian)
Git checkpoint before structural changes. Cannot be skipped. Origin: A rogue agent destroyed a Hub folder on 2026-02-10 with no recovery.
qc-check (Reviewer)
Runs tests, verifies startup, diagnoses failures. Agents fix mechanical problems autonomously — never asks the user to debug.
hub-watcher (Watcher)
Checks Hub for signals, updates, and pending acknowledgments. Priority 1 in wakeup sequence.
framework-updater (Worker)
Applies template updates. Categorizes changes as safe/review/breaking. Auto-applies safe, creates Lugs for the rest. Depends on safe-refactor.
brief-advisor (Advisor)
Reviews BRIEF against Lug patterns. Detects contradictions between policy and practice. The apprenticeship engine.
Bench Test
Feature v1.0.0 · Internal · Open Dashboard →
Bench Test is the reception-side evaluator for WAI Tracks — a prompt laboratory, synchronization library, and scoring dashboard built inside WheelWright Vault.
The runtime prompt captures a session. Bench Test receives the output, scores it, compares it to prior runs, and generates grounded improvement suggestions. It turns prompt evolution from guesswork into a repeatable engineering practice.
Use Cases
| Use Case | What You Do | What You Get |
|---|---|---|
| Baseline a prompt version | Upload a track from a fresh prompt | Objective 0–10 score across 6 categories |
| Detect regressions | Upload track after prompt edit, compare to baseline | Per-category delta — improved / unchanged / regressed |
| Evidence-grounded iteration | Review generated suggestions | Specific findings from your actual track, not generic tips |
| Manage a change queue | Adopt, defer, or reject each suggestion | Curated list of prompt changes for next iteration |
| Build a prompt history | Keep running Bench Test across versions | Full audit trail of why the prompt changed over time |
Workflow
Step 1 — Prepare your Track
Export the WAI Track JSONL from your session. The standard path is:
WAI-Spoke/sessions/track_session-YYYYMMDD-HHMM.jsonlEach line must be a valid JSON object. The evaluator records parse failures as warnings — it does not reject a track because of a few malformed lines.
Step 2 — Create a Run
Go to /dashboard/bench-test and fill in the form. Required fields: Project, Prompt Version, and the Track JSONL (upload file or paste). All other fields are optional metadata that help you filter and compare runs later.
Or POST directly via the API:
curl -X POST https://wheelwright.ai/api/bench-test/runs \
-H 'X-API-Key: <your-key>' \
-H 'Content-Type: application/json' \
-d '{
"project": "WAIWeb",
"promptVersion": "v2.0.18",
"trackContent": "{\"turn\":1,...}\n{\"turn\":2,...}",
"model": "claude-sonnet-4-6",
"sessionCodename": "session-20260323-0844"
}'Step 3 — Review the Evaluation
The run detail page shows the overall score, a category breakdown with bar indicators, critical issues, and strengths. Everything on this page traces back to specific findings in your track — not boilerplate.
Step 4 — Attach Supporting Artifacts (optional)
Add a chat transcript, reviewer notes, or a review document to the same run:
curl -X POST https://wheelwright.ai/api/bench-test/runs/{runId}/artifacts \
-H 'X-API-Key: <your-key>' \
-H 'Content-Type: application/json' \
-d '{
"artifactType": "chat_transcript",
"content": "<raw transcript text>"
}'Valid artifact types: chat_transcript, review, notes, derived
Step 5 — Compare to a Prior Run
If a prior run exists for the same project, a Compare vs Previous button appears on the detail page. The comparison view shows side-by-side category scores, per-category deltas, and a badge summary. Link directly:
/dashboard/bench-test/compare?a={priorRunId}&b={thisRunId}Step 6 — Work the Suggestions
Bench Test generates a list of suggested prompt improvements grounded in the evaluation findings. Each is classified:
| Classification | Meaning |
|---|---|
critical | Active failure — address before next run |
structural | Architectural gap — worth a dedicated prompt change |
optional | Nice-to-have — consider when trimming later |
Mark each suggestion Adopt, Defer, or Reject. Adopted suggestions form your change list for the next iteration.
Scoring
Each category scores 0–10. The overall score is a weighted average.
| Category | Weight | What It Checks |
|---|---|---|
| Integrity | 25% | Sequential turns, no duplicates, no gaps, clean parse |
| Schema | 20% | Required fields present, no key drift (e vs type) |
| Signal Capture | 20% | Decisions, insights, thinking, open threads populated |
| Drift Handling | 10% | Evolution field present, phase transitions documented |
| Readability | 15% | Focus and action field length and substance |
| Export Reliability | 10% | Closing phase, parse error rate, truncation signals |
| Score | Grade |
|---|---|
| 9–10 | Excellent — production-quality prompt output |
| 7–8 | Good — minor gaps, nothing structural |
| 5–6 | Acceptable — real issues present, addressable |
| 3–4 | Needs Work — structural problems affecting signal value |
| 0–2 | Poor — fundamental capture failure |
API Reference
| Endpoint | Method | Purpose |
|---|---|---|
/api/bench-test/runs | POST | Create run, ingest JSONL, run evaluation |
/api/bench-test/runs | GET | List all runs for the authenticated user |
/api/bench-test/runs/:id/artifacts | POST | Attach transcript / review / notes |
/api/bench-test/runs/:id/artifacts | GET | List artifacts for a run |
/api/bench-test/suggestions/:id | PATCH | Set adoption status (adopted / deferred / rejected) |
All endpoints require auth: GitHub session cookie or X-API-Key header (generate from the Vault dashboard).
Submission Tips
| Category | How to score higher |
|---|---|
| Integrity | Export the complete JSONL — no partial sessions |
| Schema | Use consistent field names throughout the session |
| Signal Capture | Prompt explicitly for decisions, insights, thinking, open every turn |
| Drift Handling | Populate evolution on every turn after turn 1 |
| Readability | Keep focus to 15–80 chars; make action a substantive sentence |
| Export Reliability | End the session with phase: "review" or phase: "closeout" |