Interpret 70B models without local GPU
nnsight-remote-interpretabilityskillsetup L3★9,423
Orchestra-Research/AI-Research-SKILLs ↗What it does
Interpret and patch neural network activations using nnsight proxy objects
Best for
Running the same interpretability code on GPT-2 locally and Llama-405B remotely without code changes, enabling scalable mechanistic interpretability research on massive models.
Inputs
- · PyTorch language model (any architecture: Llama, GPT, Mistral, custom)
- · Input text prompt(s)
- · Layer indices and module paths to inspect
- · Activation patching specifications (which activations to replace/zero/modify)
- · Optional remote NDIF API key for massive models (70B+)
Outputs
- · Saved activations (hidden states, attention weights, logits) in shape [batch, seq, hidden]
- · Patched generation output (tokens, logits) with modified activations
- · Comparative metrics (original vs. patched probability, entropy, token prediction)
- · Mechanistic interpretability findings (which layers matter for which predictions)
Requires
- · nnsight>=0.5.0
- · torch>=2.0.0
- · transformers (HuggingFace)
- · Optional: NDIF API key for remote execution (login.ndif.us)
- · Optional: vLLM for faster batched inference
Preconditions
- · PyTorch model loadable via LanguageModel wrapper
- · GPU memory for local execution (or NDIF API key for remote)
- · Knowledge of model architecture (layer counts, module names)
- · Familiarity with transformers internals (self-attention, MLPs, layer normalization)
- · Input text tokenizable by model's tokenizer
Failure modes
- · Wrong module path (e.g., model.layers[8] vs. model.transformer.h[8]) → AttributeError
- · Proxy object operations outside trace context → fails silently
- · Patching activations with wrong shape → dimension mismatch error
- · NDIF remote execution timeout → incomplete results (increase timeout setting)
- · Saving too many large activations → out-of-memory during trace context exit
- · Activation dimensions change mid-trace (due to position embeddings) → indexing fails
Trust signals
- · ICLR 2025 paper (arxiv:2407.14561) peer-reviewed by top interpretability researchers
- · GitHub 730+ stars, active maintenance
- · Unique capability: remote execution via NDIF without changing local code
- · Transparent proxy object model (operations are recorded, not executed immediately)
- · Integration with established PyTorch ecosystem (no vendor lock-in)
- · Supports activation patching workflows from top interpretability papers (Li et al. 2023, etc.)