cyberneticlibrary

Test and benchmark LLM agents

agent-evaluationskillsetup L3★0

Sheshiyer/skill-clusters ↗

What it does

Evaluate agent reasoning and outputs

Best for

Measuring multi-agent system output quality, safety alignment, and hallucination rates.

Inputs

· file|module|project scope

Outputs

· prioritized violation report
· code fixes or worklist
· execution log or transcript

Requires

· AST-Grep

Preconditions

Agent runtime initialized with message queue and optional model service

Failure modes

· Agent process hangs or infinite loop
· Communication channel deadlock
· Memory or CPU exhaustion

Trust signals

· Uses statistical significance and pre-commit sample size
· Includes regression test safety gates
· Leverages LSP and AST-based code analysis