WheelWright AI Framework

Version 2.0.160 · Generated 2026-04-10 · Source: llms-full.md

Complete documentation for the WheelWright AI Framework — the hub-and-spoke knowledge operating system.

Note: This page is generated from the framework’s llms-full.md. For the canonical version, see the source on GitHub. Ensure you’re working against a tagged release.

Philosophy

WheelWright treats AI context as a compounding asset. The framework provides two universal primitives — Skills (executable capabilities) and Lugs (actionable knowledge records) — connected by a single execution contract: Perceive / Execute / Verify (PEV).

Architecture

The hub-and-spoke model: the Hub is the analytical clearinghouse and shared memory. Each Spoke is a project with its own Ozi orchestration agent, Advisors, and local Lug store. Advisors navigate by folder convention — directory position defines scope.

Lug Schema Specification

Version 1.1.0

Lugs are WAI’s universal communication primitive — actionable records, not summaries. Every Lug represents decomposed, meaningful work with full traceability.

Core Principles: Actionable, Traceable, Idempotent, Decomposed, Self-contained.

Lug Types

Type	Purpose	Example
`task`	Work to be done	“Migrate auth from JWT to sessions”
`diagnosis`	Problem identified	“SQL injection in auth handler”
`prescription`	Recommended fix	“Parameterize query at line 47”
`decision`	Judgment call	“Accepted risk on X because Y”
`observation`	Pattern recorded	“Coverage dropped 82% → 73%”
`preference`	Workflow preference	“User prefers terse confirmations”
`signal`	High-impact (≥8)	“Architecture change affects API”
`update`	Framework/template version notification	“Template v3 available, running v2”
`session`	Session summary	“Security review ran, 2 issues found”

Note on signals: Signals are not a separate system. A signal is any Lug with impact ≥ 8. The type field describes what kind of information it carries. Impact determines who sees it.

Required Fields

id: "lug-2026-02-11-001"          # Unique: lug-{date}-{sequence}
type: "diagnosis"                  # See Lug Types table
title: "SQL injection in auth handler"
status: "published"                # draft | published | acknowledged | in_progress | resolved
impact: 9                          # 1-10. ≥8 = signal (visible to other nodes)
created_at: "2026-02-11T14:30:00Z"
created_by: "security-reviewer"    # Agent/skill that created this Lug
node: "ownersshare/cto"            # Node path where this Lug lives

Traceability Fields

# Git linkage
repo_version: "a3f7b2c"           # Commit hash where this work landed
branch: "main"
changelog_note: "Fixed SQL injection per Lug SEC-047"

# Lug lineage
parent_id: "lug-2026-02-10-015"   # Parent Lug if decomposed
source_id: "hub:lug-2026-02-09-003"  # External Lug ID from another node
source_node: "hub"
source_acknowledged: true

# Decision tracing (type: decision)
alternatives_considered:
  - option: "Migrate to sessions"
    chosen: true
    reasoning: "Simpler state management"
  - option: "Keep JWT with refresh tokens"
    chosen: false
    reasoning: "Added complexity, edge cases"

Diagnosis & Prescription Fields

# For type: diagnosis
severity: "critical"               # critical | high | medium | low
category: "security"
evidence: "Line 47 uses string concatenation in SQL query"
affected_files:
  - "src/auth/handler.js"

# For type: prescription (always linked to a diagnosis)
diagnosis_id: "lug-2026-02-11-001"
prescription: "Replace concatenation with parameterized query"
estimated_effort: "15 minutes"
auto_applicable: false

Calibration & Preference Fields

# Calibration (applied when resolved)
resolution: "accepted"             # accepted | deferred | dismissed | modified
resolution_reason: "Applied as prescribed"
resolved_at: "2026-02-11T15:00:00Z"
resolved_by: "main-agent"

# Preference (type: preference)
category: "communication"          # communication | workflow | tooling
observation: "User prefers terse confirmations"
guidance: "Keep verifications to numbered list, <10 lines"
applies_to: "all"                  # all | hub | spoke | specific path

Outbound Monitoring Fields

# Cross-node signal delivery tracking (spoke → Hub)
outbound_submitted_to: "hub/intake"
outbound_submitted_at: "2026-02-11T15:00:00Z"
outbound_acknowledged: false       # Flips true when Hub processes
outbound_acknowledged_at: null

PEV Fields (Perceive / Execute / Verify)

Optional structured execution context. When present, the agent follows these instead of interpreting the title. Simple Lugs don’t need PEV. Complex Lugs where precision matters should carry it.

perceive:
  look_at:
    - "src/auth/handler.js"
    - "tests/auth/handler.test.js"
  current_state: "Email field accepts any string, no validation"
  success_state: "Email field validates against RFC 5322"
  context: "Phone validator in src/validators/phone.ts uses same pattern"

execute:
  approach: "Add email validation using the pattern in phone.ts"
  constraints:
    - "Do not modify the existing phone validator"
    - "Follow project convention for error messages"
  avoid:
    - "Do not use regex-only validation — use the validator library"
  reference_patterns:
    - "src/validators/phone.ts"

verify:
  commands:
    - "npm test -- --grep 'email validation'"
    - "npm run lint"
  expected_output: "All tests pass, no lint errors"
  manual_check: "Try submitting with invalid email — should show error"

PEV is optional and backward compatible. Existing Lugs without PEV continue to work. PEV is an upgrade path, not a migration requirement.

Lifecycle

draft → published → acknowledged → in_progress → resolved
                                                   ↓
                                               (calibration applied)

State	Meaning
`draft`	Created but not yet ready for action
`published`	Active and visible to relevant agents
`acknowledged`	Another node has seen this (cross-node Lugs)
`in_progress`	Work has started
`resolved`	Completed — calibration fields populated

Special cases: Sub-agent Lugs (diagnosis, prescription) are created as published — no draft needed. Session Lugs are created as resolved — they are retrospective records.

Impact Scoring

Score	Visibility	Example
1–3	Local only	“Refactored helper”
4–7	Project-wide	“API contract changed”
8–10	Wheel-wide signal → Hub	“Architecture pattern across projects”

Heuristics: Does this affect other extensions in this project? → 4+. Does this affect other projects? → 8+. Does this change a shared interface or contract? → 7+. Is this a framework-level learning? → 9+. Is this a policy or security concern? → 8+.

Cross-Node Communication

Outbound (Spoke → Hub): When a Lug with impact ≥ 8 is created: (1) write to local WAI-Lugs.jsonl, (2) copy to hub/intake/{node-path}/{lug-id}.yaml, (3) record in local manifest as pending. On next wakeup, hub-watcher checks acknowledgment — surfaces to user if still pending.

Inbound (Hub → Spoke): On wakeup, spoke reads Hub Lugs newer than its hub_lug_cursor, creates local Lugs with source_id, decomposes into actionable work, updates cursor.

Hub Processing: Hub reads all items in hub/intake/, evaluates wheel-wide patterns, creates Hub-level Lugs, moves processed items to hub/intake/processed/, aggregates patterns across spokes into observation Lugs.

Decision Records

Decision Lugs are the apprenticeship engine. They capture reasoning and alternatives, teaching agents the conductor’s judgment over time.

Session Ledger

WAI-Ledger.jsonl — append-only log of commitments and their resolutions. Ensures commitments survive context loss and agent crashes.

Origin: During WAI v2 migration, a crash caused context loss. The resuming agent reconstructed intent from partial memory, renamed core concepts, skipped phases, and declared completion. The ledger was built so commitments survive any context loss — on resume, the agent reads the ledger and verifies: “Was this fulfilled?”

Type	Creator	Meaning
`request`	conductor	“I want this done”
`agreement`	agent	“I will do this, here’s how”
`clarification`	either	“Do you mean X or Y?” / “I mean X”
`amendment`	either	“Let’s change the approach”
`delivery`	agent	“Done” + commit hash
`verification`	conductor	“Confirmed” or “Doesn’t match”
`rejection`	conductor	“Doesn’t fulfill the agreement, because...”

{"id":"led-2026-02-12-001","type":"request","content":"Migrate Lug schema to v2","source":"conductor","status":"open"}
{"id":"led-2026-02-12-002","type":"agreement","content":"Will add PEV fields","source":"agent","references":"led-2026-02-12-001","status":"open"}
{"id":"led-2026-02-12-003","type":"delivery","content":"350 Lugs upgraded","source":"agent","references":"led-2026-02-12-001","commit":"73112e9","status":"fulfilled"}

Integration: Wakeup — read ledger, surface open commitments. Closeout — session-observer flags unfulfilled. Resume — new agent reads ledger, compares against codebase state. Integrity — WAI-Ledger.jsonl is append-only (declared in WAI-Integrity.md).

Skill Contract Specification

Version 1.1.0

Skills are executable capabilities — sub-agents with defined scope, cost profile, and output contract.

Skill Types

Type	Purpose	Tier	Write Access
`reviewer`	Analyze, produce diagnoses	lightweight	Lugs only
`watcher`	Monitor state changes	lightweight	Lugs only
`guardian`	Enforce policies, block	standard	Lugs + block
`worker`	Implement tasks	advanced	Code + Lugs
`advisor`	BRIEF alignment	standard	Lugs only
`orchestrator`	Reconcile, plan	advanced	Lugs + plans

Contract Schema

skill: security-review version: 1.2.0 type: reviewer model: tier: lightweight min_context: 32000 trigger: event: on_load frequency: per_session scope: reads: ["src/**", "WAI-Lugs.jsonl"] writes: ["WAI-Lugs.jsonl"] never: ["src/**", ".env*"]

Trigger Configuration

Event	Fires When
`on_load`	Wakeup sequence
`on_commit`	After git commit
`on_content_change`	Source files modified
`on_demand`	Explicitly requested
`pre_refactor`	Before structural changes

Scope & Permissions

never overrides writes. Only worker Skills write source code. Scope violations are logged as Lugs.

Tests & Use Cases

Every Skill MUST include use_cases — documentation, agent context, and institutional memory of why the Skill exists.

safe-refactor (Guardian)

Git checkpoint before structural changes. Cannot be skipped. Origin: A rogue agent destroyed a Hub folder on 2026-02-10 with no recovery.

qc-check (Reviewer)

Runs tests, verifies startup, diagnoses failures. Agents fix mechanical problems autonomously — never asks the user to debug.

hub-watcher (Watcher)

Checks Hub for signals, updates, and pending acknowledgments. Priority 1 in wakeup sequence.

framework-updater (Worker)

Applies template updates. Categorizes changes as safe/review/breaking. Auto-applies safe, creates Lugs for the rest. Depends on safe-refactor.

brief-advisor (Advisor)

Reviews BRIEF against Lug patterns. Detects contradictions between policy and practice. The apprenticeship engine.

Idempotency Rules

Before creating a Lug, check if an equivalent already exists (same type, title pattern, affected scope).
If found and still open: update the existing Lug (bump priority if recurring).
If found and resolved: create a new Lug referencing the original as a regression.
Lug ID is the idempotency key for cross-node references.
Sub-agents running the same check twice should produce the same findings unless the codebase changed.

BRIEF Integrity Checking

A Hub-level or spoke-level Skill that compares Lug patterns against BRIEF policies — surfaces contradictions between policy and practice.

BRIEF states “maintain 80% test coverage” but 3 QC Lugs about declining coverage were dismissed → surface contradiction
BRIEF states “security findings resolved within 48 hours” but a critical diagnosis Lug is 72 hours old → surface alert
BRIEF states “no direct database queries in API handlers” but a diagnosis Lug found one → confirm policy still holds

This is awareness, not enforcement. The conductor decides whether to update the policy or address the violation.

Full spec: For complete field references, migration patterns, and storage format details, see llms-full.md on GitHub.

Bench Test

Feature v1.0.0 · Internal · Open Dashboard →

Bench Test is the reception-side evaluator for WAI Tracks — a prompt laboratory, synchronization library, and scoring dashboard built inside WheelWright Vault.

The runtime prompt captures a session. Bench Test receives the output, scores it, compares it to prior runs, and generates grounded improvement suggestions. It turns prompt evolution from guesswork into a repeatable engineering practice.

One-sentence model: Upload a Track → get a score → compare to last run → adopt the suggestions that matter.

Use Cases

Use Case	What You Do	What You Get
Baseline a prompt version	Upload a track from a fresh prompt	Objective 0–10 score across 6 categories
Detect regressions	Upload track after prompt edit, compare to baseline	Per-category delta — improved / unchanged / regressed
Evidence-grounded iteration	Review generated suggestions	Specific findings from your actual track, not generic tips
Manage a change queue	Adopt, defer, or reject each suggestion	Curated list of prompt changes for next iteration
Build a prompt history	Keep running Bench Test across versions	Full audit trail of why the prompt changed over time

Workflow

Step 1 — Prepare your Track

Export the WAI Track JSONL from your session. The standard path is:

WAI-Spoke/sessions/track_session-YYYYMMDD-HHMM.jsonl

Each line must be a valid JSON object. The evaluator records parse failures as warnings — it does not reject a track because of a few malformed lines.

Step 2 — Create a Run

Go to /dashboard/bench-test and fill in the form. Required fields: Project, Prompt Version, and the Track JSONL (upload file or paste). All other fields are optional metadata that help you filter and compare runs later.

Or POST directly via the API:

curl -X POST https://wheelwright.ai/api/bench-test/runs \
  -H 'X-API-Key: <your-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "project": "WAIWeb",
    "promptVersion": "v2.0.18",
    "trackContent": "{\"turn\":1,...}\n{\"turn\":2,...}",
    "model": "claude-sonnet-4-6",
    "sessionCodename": "session-20260323-0844"
  }'

Step 3 — Review the Evaluation

The run detail page shows the overall score, a category breakdown with bar indicators, critical issues, and strengths. Everything on this page traces back to specific findings in your track — not boilerplate.

Step 4 — Attach Supporting Artifacts (optional)

Add a chat transcript, reviewer notes, or a review document to the same run:

curl -X POST https://wheelwright.ai/api/bench-test/runs/{runId}/artifacts \
  -H 'X-API-Key: <your-key>' \
  -H 'Content-Type: application/json' \
  -d '{
    "artifactType": "chat_transcript",
    "content": "<raw transcript text>"
  }'

Valid artifact types: chat_transcript, review, notes, derived

Step 5 — Compare to a Prior Run

If a prior run exists for the same project, a Compare vs Previous button appears on the detail page. The comparison view shows side-by-side category scores, per-category deltas, and a badge summary. Link directly:

/dashboard/bench-test/compare?a={priorRunId}&b={thisRunId}

Step 6 — Work the Suggestions

Bench Test generates a list of suggested prompt improvements grounded in the evaluation findings. Each is classified:

Classification	Meaning
`critical`	Active failure — address before next run
`structural`	Architectural gap — worth a dedicated prompt change
`optional`	Nice-to-have — consider when trimming later

Mark each suggestion Adopt, Defer, or Reject. Adopted suggestions form your change list for the next iteration.

Scoring

Each category scores 0–10. The overall score is a weighted average.

Category	Weight	What It Checks
Integrity	25%	Sequential turns, no duplicates, no gaps, clean parse
Schema	20%	Required fields present, no key drift (`e` vs `type`)
Signal Capture	20%	Decisions, insights, thinking, open threads populated
Drift Handling	10%	Evolution field present, phase transitions documented
Readability	15%	Focus and action field length and substance
Export Reliability	10%	Closing phase, parse error rate, truncation signals

Score	Grade
9–10	Excellent — production-quality prompt output
7–8	Good — minor gaps, nothing structural
5–6	Acceptable — real issues present, addressable
3–4	Needs Work — structural problems affecting signal value
0–2	Poor — fundamental capture failure

API Reference

Endpoint	Method	Purpose
`/api/bench-test/runs`	POST	Create run, ingest JSONL, run evaluation
`/api/bench-test/runs`	GET	List all runs for the authenticated user
`/api/bench-test/runs/:id/artifacts`	POST	Attach transcript / review / notes
`/api/bench-test/runs/:id/artifacts`	GET	List artifacts for a run
`/api/bench-test/suggestions/:id`	PATCH	Set adoption status (`adopted` / `deferred` / `rejected`)

All endpoints require auth: GitHub session cookie or X-API-Key header (generate from the Vault dashboard).

Submission Tips

Export the full session. Missing turns are detected and penalised. Partial exports will lower your Integrity score.

Category	How to score higher
Integrity	Export the complete JSONL — no partial sessions
Schema	Use consistent field names throughout the session
Signal Capture	Prompt explicitly for `decisions`, `insights`, `thinking`, `open` every turn
Drift Handling	Populate `evolution` on every turn after turn 1
Readability	Keep `focus` to 15–80 chars; make `action` a substantive sentence
Export Reliability	End the session with `phase: "review"` or `phase: "closeout"`