cyberneticlibrary

Moderate LLM outputs with LlamaGuard

llamaguardskillsetup L2★9,423

Orchestra-Research/AI-Research-SKILLs ↗

What it does

Classify text for 6 safety categories

Best for

Production input/output filtering where you need a specialized 7B moderation model instead of general LLM.

Inputs

· Conversation turns
· User prompts or bot responses
· Safety context

Outputs

· Classification (safe/unsafe)
· Category (S1-S6)
· Confidence

Requires

· transformers
· torch
· vllm (optional)

Preconditions

HuggingFace auth token; 8GB VRAM for 7B model

Failure modes

· False positives (over-blocking)
· Category ambiguity (S3 vs. S4)
· Truncated text loss

Trust signals

· Meta's specialized safety model
· 94-95% accuracy on safety benchmarks