cyberneticlibrary

Benchmark coding agents

agent-evalskillsetup L30
Sheshiyer/skill-clusters
What it does

Measure agent output quality and safety

Best for

Measuring multi-agent system output quality, safety alignment, and hallucination rates.

Outputs
  • · prioritized violation report
  • · code fixes or worklist
  • · execution log or transcript
Requires
  • · GitHub API
  • · AST-Grep
  • · Git
Preconditions

Agent runtime initialized with message queue and optional model service

Failure modes
  • · Agent process hangs or infinite loop
  • · Communication channel deadlock
  • · Memory or CPU exhaustion
Trust signals
  • · Includes regression test safety gates
  • · Leverages LSP and AST-based code analysis
  • · Optimizes for memory and CPU efficiency