cyberneticlibrary

Benchmark coding agents

agent-evalskillsetup L3★0

Sheshiyer/skill-clusters ↗

What it does

Measure agent output quality and safety

Best for

Measuring multi-agent system output quality, safety alignment, and hallucination rates.

Outputs

· prioritized violation report
· code fixes or worklist
· execution log or transcript

Requires

· GitHub API
· AST-Grep
· Git

Preconditions

Agent runtime initialized with message queue and optional model service

Failure modes

· Agent process hangs or infinite loop
· Communication channel deadlock
· Memory or CPU exhaustion

Trust signals

· Includes regression test safety gates
· Leverages LSP and AST-based code analysis
· Optimizes for memory and CPU efficiency