Moderate LLM outputs with LlamaGuard

llamaguardskillsetup L29,423
Orchestra-Research/AI-Research-SKILLs
What it does

Classify text for 6 safety categories

Best for

Production input/output filtering where you need a specialized 7B moderation model instead of general LLM.

Inputs
  • · Conversation turns
  • · User prompts or bot responses
  • · Safety context
Outputs
  • · Classification (safe/unsafe)
  • · Category (S1-S6)
  • · Confidence
Requires
  • · transformers
  • · torch
  • · vllm (optional)
Preconditions

HuggingFace auth token; 8GB VRAM for 7B model

Failure modes
  • · False positives (over-blocking)
  • · Category ambiguity (S3 vs. S4)
  • · Truncated text loss
Trust signals
  • · Meta's specialized safety model
  • · 94-95% accuracy on safety benchmarks