The Open-Source Toolkit That Stops AI Coding Agents From Shipping Sloppy Code

How we built a production-grade engineering layer for Claude, Codex, Cursor, Windsurf, and Copilot — after nearly merging an AI-generated database disaster.

#MIT Licensed #Open Source #Production Tested

May 22 • From Toan Vo

1 — The PR That Almost Shipped

A few months ago, one of our engineers asked Claude Code to build a new endpoint. Twenty minutes later, a beautiful diff appeared. Tests passed. Linter green. The PR description was — let's be honest — better than what most of us write at 4 PM on a Friday.

We almost merged it.

Then, on a hunch, we asked a different AI to review the same diff. Codex came back with one line:

src/api/users.kt:84: bug: N+1 query inside .map {} loop.  Each user triggers a separate `byBountyId()` call.  Fix: batch-fetch by ids before the loop.

It was right. The PR would have passed code review, passed staging, and quietly torched the database the first time someone hit the endpoint in production.

That was the moment we stopped trusting any single AI. It's also the moment Spartan AI Toolkit started taking shape.

Key insight: The problem wasn't that the AI wrote bad code. The problem was that no system existed to catch it.

02 — Three Failure Modes of AI Coding Agents

In the last 18 months we've shipped real features with Claude, Codex, Cursor, Windsurf, and Copilot. They are extraordinary. Out of the box, they are also junior engineers with no manager, no specs, no tests, and no memory.

Three failure modes show up over and over:

🚫 No Discipline Vibe coding: prompt → diff → commit. No spec. No plan. No TDD. The AI ships the first thing that compiles.

🌫 No Memory Every session starts from zero. Yesterday's architectural decision is gone. Last week's "do not use !! in this codebase" rule is gone.

♾ No Review Your only check is your own eyes — and the diff looks fine, and the tests pass. See Section 01.

Linters don't catch this. Style guides in Confluence don't catch this. A single AI assistant reviewing its own output doesn't catch it either — it has the same blind spots as the model that wrote the code.

So we built a toolkit that fixes all three.

03 — Meet Spartan

Spartan AI Toolkit is an open-source engineering layer that sits on top of your AI coding agent. It turns Claude Code (and Codex, and Cursor, and Windsurf, and Copilot) from a chatbot into something that behaves like a structured engineering team.

What	How many
Slash commands	70
Domain skills	35
Specialized agents	9
Configurable rule files	29
AI tools supported	5 — Claude Code, Codex, Cursor, Windsurf, Copilot
MCP integrations	4+ — Playwright, Notion, Figma, Gemini
Stack profiles	8 — Kotlin/Micronaut, Next.js, Go, FastAPI, Django, Spring, Node, custom
License	MIT

npx @c0x12c/ai-toolkit@latest --local

154 commits in the last 60 days. Still moving fast. Still public.

The Four Principles Behind Every Command

Underneath the rules, the gates, and the workflows, four ideas hold Spartan together. They live in ETHOS.md and get injected into the preamble of every command.

Boil the Lake. AI makes completeness near-free. When the complete version costs minutes more than the shortcut — do the complete thing.

Search Before Building. Three layers: tried-and-true → new-and-popular → first-principles. Reach for the existing answer before inventing a new one.

User Sovereignty. AI models recommend. Users decide. This overrides all other rules. Two AI models agreeing is signal, not mandate.

Do the Work, Not the Performance of Work. Building is not the performance of building. It becomes real when it ships and solves a real problem for a real person.

04 — Cross-AI Peer Review: Claude Builds, Codex Reviews, Claude Fixes

This is the flagship. /spartan:ship-pr-codex --rounds 3 is one slash command. Behind it:

Claude Code → GitHub PR → Codex CLI reviews → Findings surfaced → Claude fixes → ✓ Merged

What makes this interesting is how the review escalates. Each round uses a sharper prompt:

Round	Stance
1	Surface review. Obvious bugs. Missing tests. Broken contracts.
2	Harder. Race conditions, N+1 queries, swallowed errors.
3+	Brutal. Reject AI-generic code, premature abstraction, untested branches.

Codex runs in a read-only sandbox with --ask-for-approval never. Findings come back as path:line: severity: problem. fix. — Claude parses them, posts inline review comments on the PR, applies fixes, pushes a follow-up commit, and resolves each thread.

End-to-end automated cross-AI peer review. We've never seen Claude and Codex both miss the same race condition.

Different models have different blind spots. The cheapest review you can buy is a second opinion from an AI that wasn't in the room when the code was written.

05 — Rules as Law: Standards the AI Cannot Ignore

Most coding-standards documents read like polite suggestions. Spartan's rules read like a law book:

# rules/backend-micronaut/ FORBIDDEN:  - The force unwrap operator !! — no exceptions  - NEVER use workarounds. Always fix root cause.  - NEVER call a repository method inside .map {}, .forEach {}, or .filter {} loops  - All mutations use @Post. Never use @Put, @Delete, or @Patch. REQUIRED:  - Root-cause fixes only  - Architecture consistency across sessions  - Enforced API conventions  - Explicit error handling with typed exceptions

These are loaded into every session automatically. When the AI generates code, it refuses to violate them. When it reviews code, it cites them by name.

Your .spartan/config.yaml decides which rules apply:

stack: kotlin-micronautrules:  backend:    - rules/backend-micronaut/KOTLIN.md    - rules/backend-micronaut/API_DESIGN.md    - rules/custom/OUR_AUTH_RULES.md

Turn taste into tooling. Turn architecture into rules. Turn review comments into reusable constraints.

06 — The Three-Layer Memory Architecture

"AI assistants forget everything between sessions" is a well-known problem. Load full history → burn thousands of tokens. Load nothing → lose every decision you've ever made.

Spartan splits the difference with a three-layer architecture:

Layer	When loaded	Purpose
Index	Every turn (always in context)	Quick map of what we know. Max ~150 chars per entry.
Topics	On demand, when relevant	Full ADRs, patterns, gotchas. No hard size limit.
Transcripts	Never. Grep-only.	Append-only session archive. Searchable, never pollutes context.

Instead of "Sure, let's try that architecture again" — the AI responds: "We rejected that approach in March because of scaling issues." That changes everything.

07 — Parallel Builds With Git Worktrees

/spartan:build doesn't just build a feature. It builds a feature in its own git worktree, on its own branch, with its own PR.

# Terminal 1                          # Terminal 2$ /spartan:build "auth feature"       $ /spartan:build "payments feature"   → .worktrees/auth/                    → .worktrees/payments/   → feature/auth                        → feature/payments   → PR #142                             → PR #143

Two terminals, two AI assistants, two features in parallel. Zero merge conflicts. Zero context bleed. We routinely run 3–4 parallel builds when prepping a release.

08 — Anti-AI-Generic Design: The Eight-Phase UX Workflow

Frontend has its own failure mode: purple gradients, generic Inter font, bg-blue-500 everywhere, Lorem ipsum in production.

/spartan:ux wraps an eight-phase pipeline: Research → Define → Ideate → Design Tokens → Prototype + Gate → Test → Handoff → QA

🚫 FORBIDDEN in design: Tailwind defaults (bg-blue-500) when tokens exist · Generic fonts (Inter, Roboto) · Purple gradients on white · "Lorem ipsum" · "Unlock your potential"

✅ REQUIRED in design: Read tokens BEFORE any UI code · Design all states: default, loading, empty, error, success · Mobile 375px, tablet 768px, desktop 1440px · WCAG AA contrast

The toolkit ships with: 67 design styles, 96 colour palettes, 57 font pairings, 25 chart types, 99 UX guidelines — over 700 design data points across 29 files.

09 — From Live App to PRD in One Command

The most interesting AI tools aren't the models themselves anymore — they're what happens when you wire multiple agents together with the right plumbing in between.

/spartan:web-to-prd is the kind of command that only exists because AI agents and MCPs can be glued together. Point it at a URL. Behind the scenes:

Playwright MCP crawls the live web app — every page, every modal, every state.
Claude extracts features, user flows, and information architecture from the screenshots and DOM.
The output is a structured PRD with epics, user stories, priorities, and dependencies.
Notion MCP exports the whole thing into your workspace as a multi-page document.

In v2 we cut the pipeline from 80+ AI calls down to 3–5 by letting Playwright do the structured DOM extraction and reserving Claude for the high-level synthesis.

/spartan:web-to-prd "https://competitor.com"

Twelve minutes. One competitor app in. One forty-page PRD on Notion out.

This is the workflow that turned us into MCP believers. The slash command itself is small. The intelligence isn't in the prompt — it's in the orchestration of four tools (Playwright + Claude + Notion + Spartan's prompt scaffolding) so that each one does what it's individually best at.

The next leap in AI engineering isn't a smarter model. It's smarter plumbing between the ones we already have.

10 — Five Lessons From Building With Coding Agents

1 — Delete More Than You Build

In v1.24 we deleted our entire project-management pack. Nobody used it.

Tooling that tries to be a methodology gets in the way. Tooling that gives engineers superpowers along the path they were already walking sticks.

2 — Two AI Models Agreeing Is Signal, Not a Mandate

AI models recommend. Users decide. This overrides all other rules. Two AI models agreeing is signal, not mandate.

3 — Skills Should Fix Real Failure Modes

kotlin-best-practices — Claude defaulted to !! for null handling, every single time
database-table-creator — SQL → ORM → Entity → Repository sync went wrong in subtle ways
js-security-audit — npm supply-chain attacks need a checklist, not vibes
ui-ux-pro-max — "make a dashboard" used to mean "give me purple gradients"

If the AI already does X well, you don't need a skill for X. Skills are for the gap between what the AI does and what production needs.

4 — Gates Beat Post-Hoc Review

Spec → Gate 1 → Plan → Gate 2 → Design → Design Gate → Build → Gate 3 → Code Review → Gate 3.5 → PR → Gate 4 → Merge

By the time a PR opens, the work has survived five reviews. None of these gates exist because we love process — they exist because every gate we don't enforce gets paid for at production prices.

5 — Configurability Beats Opinions

This is the difference between a toolkit and a framework. We aimed for toolkit.

11 — Try It

npx @c0x12c/ai-toolkit@latest --local

Command	What it does
`/spartan:onboard`	Map this codebase. Set up rules. ~30 minutes.
`/spartan:build "feature"`	End-to-end: spec → plan → TDD → review → PR
`/spartan:ship-pr-codex`	Cross-AI peer review on your branch
`/spartan:ux prototype`	8-phase design workflow, dual-agent gate
`/spartan:debug "bug"`	Reproduce → investigate → fix → review → PR

GitHub: github.com/c0x12c/ai-toolkit

Final

Claude  = builderCodex   = reviewerRules   = lawMemory  = institutional knowledgeGates   = guardrailsSkills  = sharp tools for specific gapsHumans  = product, taste, the final call

The quality of agentic software development depends less on one perfect prompt and more on the quality of the environment the agent operates inside.

If you're shipping production code with AI agents today, you've already hit at least three of the five problems Spartan solves. You don't need to solve them with Spartan specifically. But you do need to solve them.

We picked one way. We're sharing it. If it saves you the PR we almost merged, that's enough.

Back to Blog