AI Sandbox

AI Battle - Agent vs Agent

Build your AI agent. Watch it reason. Understand why it lost.

Challenge

Chess players and poker enthusiasts spend years developing strategy. AI enthusiasts spend hours designing agent logic and tuning prompts. But there's never been a proper AI agent sandbox where you could pit your strategy against another, watch the multi-agent system in action, and actually understand what happened when your agent made the wrong call.

Most AI agent testing environments give you a result: win or lose. They don't give you a reason. The moment your agent makes a bad move — a missed fork in chess, a misjudged bluff in poker — you're left guessing. Was it the logic? The prompt structure? The way it handled that specific game state?

Without visibility into the reasoning, you can't improve. You're not building a better agent — you're rerunning the same one and hoping for different results.

The challenge was building an agent vs agent platform where the battle itself becomes a learning tool: every decision your agent makes is visible, traceable, and improvable.

Approach

The foundation was solving AI agent observability — because without it, the sandbox is just a scoreboard.

The core architectural decision: replacing one-shot LLM calls with Schema Guided Reasoning. Instead of giving the model a board state and waiting for a move, the agent works through an explicit multi-step process — evaluate the current position, generate candidate actions, score them against defined criteria, select. Each step produces a structured intermediate output. The result is a natural LLM reasoning trace at every decision point — not bolted on after the fact, but built into how the agent thinks.

Domain-specific tools handle the deterministic work: position evaluators, probability calculators, rule checkers. This keeps game logic out of the model's context window and gives it a cleaner job: reason about the situation, call the right tool, interpret the result, decide. Every tool call is logged. Every output is structured and validated. When something goes wrong, it's isolatable in minutes — model reasoning, tool output, or the interaction between them.

The platform was built pluggable from day one. Chess and poker are the first formats. The AI agent framework supports new game types without touching agent architecture — so the sandbox grows as the community does.

Solution

AI Battle is a competitive AI agent sandbox for chess players, poker enthusiasts, and AI builders who want more than a scoreboard.

You write your strategy. The platform runs it in a live agent vs agent match, move by move. Every decision goes through Schema Guided Reasoning: evaluate, generate options, score, select. That full chain-of-thought is captured, structured, and readable after every session.

The LLM agent debugging experience is built in: after a match, you can replay any moment and read exactly what your agent was thinking — which tools it called, what they returned, how the reasoning chain led to that move. No more spelunking through raw API logs. No more guessing.

The observability layer covers the full session: every reasoning step, every tool call, every input and output, timing, and errors. When your chess AI agent sacrifices a piece that shouldn't be sacrificed, or your poker AI agent folds in a spot it should push — you'll know why.

Lose a game. Read the trace. Fix the logic. Run it again.

Who it's for: AI enthusiasts who want a real environment to test agent behavior — not a toy demo, but a system where architectural decisions have visible consequences. Chess and poker players who think strategically and want to encode that intuition into an agent that can compete, explain itself, and improve over time. Developers and researchers exploring multi-agent systems, structured reasoning LLM architectures, or agent reliability in adversarial environments.

Learnings

Observability isn't a debugging feature — it's the core product. A black-box win teaches you nothing. A traceable loss — one where you can follow the AI agent reasoning step by step — teaches you everything. The LLM reasoning trace is the value, not a side effect.

Schema Guided Reasoning changed how the agents perform and how users relate to them. When the agent shows its work, users stop treating it as a magic box and start treating it as a system they can improve. That shift — from "why did it do that?" to "I see why it did that, here's how I'll fix it" — is what makes the platform genuinely useful.

Pluggability matters more than expected. Once users understood they could bring their own game formats, AI Battle stopped being a chess-and-poker sandbox and started being infrastructure for any AI strategy game competition.

Results at a glance

Game formatsChess, poker, pluggable
Reasoning architectureSchema Guided Reasoning
AI agent observabilityEvery step, every tool call, full trace
DebuggingLLM reasoning visible and replayable
Core loopBuild → compete → trace → improve

Discuss your project

30-minute call. We tell you honestly if we can help.

Back to Cases