Interpret 70B models without local GPU

nnsight-remote-interpretabilityskillsetup L3★9,423

What it does

Interpret and patch neural network activations using nnsight proxy objects

Best for

Running the same interpretability code on GPT-2 locally and Llama-405B remotely without code changes, enabling scalable mechanistic interpretability research on massive models.

Inputs

· PyTorch language model (any architecture: Llama, GPT, Mistral, custom)
· Input text prompt(s)
· Layer indices and module paths to inspect
· Activation patching specifications (which activations to replace/zero/modify)
· Optional remote NDIF API key for massive models (70B+)

Outputs

· Saved activations (hidden states, attention weights, logits) in shape [batch, seq, hidden]
· Patched generation output (tokens, logits) with modified activations
· Comparative metrics (original vs. patched probability, entropy, token prediction)
· Mechanistic interpretability findings (which layers matter for which predictions)

Requires

· nnsight>=0.5.0
· torch>=2.0.0
· transformers (HuggingFace)
· Optional: NDIF API key for remote execution (login.ndif.us)
· Optional: vLLM for faster batched inference

Preconditions

· PyTorch model loadable via LanguageModel wrapper
· GPU memory for local execution (or NDIF API key for remote)
· Knowledge of model architecture (layer counts, module names)
· Familiarity with transformers internals (self-attention, MLPs, layer normalization)
· Input text tokenizable by model's tokenizer

Failure modes

· Wrong module path (e.g., model.layers[8] vs. model.transformer.h[8]) → AttributeError
· Proxy object operations outside trace context → fails silently
· Patching activations with wrong shape → dimension mismatch error
· NDIF remote execution timeout → incomplete results (increase timeout setting)
· Saving too many large activations → out-of-memory during trace context exit
· Activation dimensions change mid-trace (due to position embeddings) → indexing fails

Trust signals

· ICLR 2025 paper (arxiv:2407.14561) peer-reviewed by top interpretability researchers
· GitHub 730+ stars, active maintenance
· Unique capability: remote execution via NDIF without changing local code
· Transparent proxy object model (operations are recorded, not executed immediately)
· Integration with established PyTorch ecosystem (no vendor lock-in)
· Supports activation patching workflows from top interpretability papers (Li et al. 2023, etc.)