cyberneticlibrary

Train safer AI with constitutional methods

constitutional-aiskillsetup L19,423
Orchestra-Research/AI-Research-SKILLs
What it does

Align LLMs via self-critique and AI feedback

Best for

When you want safety alignment without human labels and need explainable reasoning in refusals.

Inputs
  • · Base model
  • · Constitution (principles)
  • · Prompt responses to critique
  • · Preference dataset (optional)
Outputs
  • · Aligned model weights
  • · RLAIF preference pairs
Requires
  • · transformers
  • · torch
  • · trl
Preconditions

Base LLM; VRAM for model + feedback generation

Failure modes
  • · Over-refusal (evasive responses)
  • · Constitution principles contradicting each other
  • · Critique loop divergence
Trust signals
  • · Anthropic's method powering Claude
  • · RLAIF avoids expensive human preference labeling