cyberneticlibrary

Interpret 70B models without local GPU

nnsight-remote-interpretabilityskillsetup L39,423
Orchestra-Research/AI-Research-SKILLs
What it does

Interpret and patch neural network activations using nnsight proxy objects

Best for

Running the same interpretability code on GPT-2 locally and Llama-405B remotely without code changes, enabling scalable mechanistic interpretability research on massive models.

Inputs
  • · PyTorch language model (any architecture: Llama, GPT, Mistral, custom)
  • · Input text prompt(s)
  • · Layer indices and module paths to inspect
  • · Activation patching specifications (which activations to replace/zero/modify)
  • · Optional remote NDIF API key for massive models (70B+)
Outputs
  • · Saved activations (hidden states, attention weights, logits) in shape [batch, seq, hidden]
  • · Patched generation output (tokens, logits) with modified activations
  • · Comparative metrics (original vs. patched probability, entropy, token prediction)
  • · Mechanistic interpretability findings (which layers matter for which predictions)
Requires
  • · nnsight>=0.5.0
  • · torch>=2.0.0
  • · transformers (HuggingFace)
  • · Optional: NDIF API key for remote execution (login.ndif.us)
  • · Optional: vLLM for faster batched inference
Preconditions
  • · PyTorch model loadable via LanguageModel wrapper
  • · GPU memory for local execution (or NDIF API key for remote)
  • · Knowledge of model architecture (layer counts, module names)
  • · Familiarity with transformers internals (self-attention, MLPs, layer normalization)
  • · Input text tokenizable by model's tokenizer
Failure modes
  • · Wrong module path (e.g., model.layers[8] vs. model.transformer.h[8]) → AttributeError
  • · Proxy object operations outside trace context → fails silently
  • · Patching activations with wrong shape → dimension mismatch error
  • · NDIF remote execution timeout → incomplete results (increase timeout setting)
  • · Saving too many large activations → out-of-memory during trace context exit
  • · Activation dimensions change mid-trace (due to position embeddings) → indexing fails
Trust signals
  • · ICLR 2025 paper (arxiv:2407.14561) peer-reviewed by top interpretability researchers
  • · GitHub 730+ stars, active maintenance
  • · Unique capability: remote execution via NDIF without changing local code
  • · Transparent proxy object model (operations are recorded, not executed immediately)
  • · Integration with established PyTorch ecosystem (no vendor lock-in)
  • · Supports activation patching workflows from top interpretability papers (Li et al. 2023, etc.)