Train safer AI with constitutional methods
constitutional-aiskillsetup L1★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Align LLMs via self-critique and AI feedback
Best for
When you want safety alignment without human labels and need explainable reasoning in refusals.
Inputs
- · Base model
- · Constitution (principles)
- · Prompt responses to critique
- · Preference dataset (optional)
Outputs
- · Aligned model weights
- · RLAIF preference pairs
Requires
- · transformers
- · torch
- · trl
Preconditions
Base LLM; VRAM for model + feedback generation
Failure modes
- · Over-refusal (evasive responses)
- · Constitution principles contradicting each other
- · Critique loop divergence
Trust signals
- · Anthropic's method powering Claude
- · RLAIF avoids expensive human preference labeling