Benchmark skill effectiveness with controlled variables
skill-arenaskillsetup L2★0
lythos-labs/lythoskill ↗What it does
Benchmark AI agent skills with controlled-variable A/B testing
Best for
A/B testing AI agent skills or deck configurations with controlled variables, native judge scoring, and parallel isolated execution.
Inputs
- · Deck TOML file (skill definitions)
- · Brief task description or prompt
- · Optional explicit player (kimi/codex/claude)
Outputs
- · Benchmark report with comparative scores
- · Artifact outputs from each side
- · Judge score and reasoning
Requires
- · npm/bun (Node.js)
- · Firecrawl or web scraper (optional for fetch-based tests)
Preconditions
- · Deck TOML file with valid skill definitions
- · Arena.toml for vs mode with side definitions
Failure modes
- · Invalid TOML syntax causes parse failure
- · Cross-player mode requires different agent binaries installed
- · Timeout if task exceeds --timeout value
Trust signals
- · Agent-orchestrated by default (parallel subagents), cross-player mode via Bun.spawn
- · Isolated workdirs with deck link + skill preparation
- · Native judge scoring (not external service)