Test and benchmark LLM agents
agent-evaluationskillsetup L3★0
Sheshiyer/skill-clusters ↗What it does
Evaluate agent reasoning and outputs
Best for
Measuring multi-agent system output quality, safety alignment, and hallucination rates.
Inputs
- · file|module|project scope
Outputs
- · prioritized violation report
- · code fixes or worklist
- · execution log or transcript
Requires
- · AST-Grep
Preconditions
Agent runtime initialized with message queue and optional model service
Failure modes
- · Agent process hangs or infinite loop
- · Communication channel deadlock
- · Memory or CPU exhaustion
Trust signals
- · Uses statistical significance and pre-commit sample size
- · Includes regression test safety gates
- · Leverages LSP and AST-based code analysis