Benchmark coding agents
agent-evalskillsetup L3★0
Sheshiyer/skill-clusters ↗What it does
Measure agent output quality and safety
Best for
Measuring multi-agent system output quality, safety alignment, and hallucination rates.
Outputs
- · prioritized violation report
- · code fixes or worklist
- · execution log or transcript
Requires
- · GitHub API
- · AST-Grep
- · Git
Preconditions
Agent runtime initialized with message queue and optional model service
Failure modes
- · Agent process hangs or infinite loop
- · Communication channel deadlock
- · Memory or CPU exhaustion
Trust signals
- · Includes regression test safety gates
- · Leverages LSP and AST-based code analysis
- · Optimizes for memory and CPU efficiency