cyberneticlibrary

Benchmark skill effectiveness with controlled variables

skill-arenaskillsetup L20
lythos-labs/lythoskill
What it does

Benchmark AI agent skills with controlled-variable A/B testing

Best for

A/B testing AI agent skills or deck configurations with controlled variables, native judge scoring, and parallel isolated execution.

Inputs
  • · Deck TOML file (skill definitions)
  • · Brief task description or prompt
  • · Optional explicit player (kimi/codex/claude)
Outputs
  • · Benchmark report with comparative scores
  • · Artifact outputs from each side
  • · Judge score and reasoning
Requires
  • · npm/bun (Node.js)
  • · Firecrawl or web scraper (optional for fetch-based tests)
Preconditions
  • · Deck TOML file with valid skill definitions
  • · Arena.toml for vs mode with side definitions
Failure modes
  • · Invalid TOML syntax causes parse failure
  • · Cross-player mode requires different agent binaries installed
  • · Timeout if task exceeds --timeout value
Trust signals
  • · Agent-orchestrated by default (parallel subagents), cross-player mode via Bun.spawn
  • · Isolated workdirs with deck link + skill preparation
  • · Native judge scoring (not external service)