Cross-Layer Transcoders: The Interpretability Tool That Might Be Lying to You

^{SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | April 29, 2026}

Mechanistic interpretability has a foundational assumption: that the circuits we extract from language models faithfully describe how those models compute their outputs. Cross-Layer Transcoders (CLTs), introduced by Anthropic in 2025 as the infrastructure for their attribution graphs and the "Biology of a Large Language Model" research, are the current standard-bearer for this kind of circuit analysis. They produce interpretable feature graphs. They run at scale. They powered the first serious attempt to reverse-engineer a frontier model's internal computation.

They can also, under specific but real training conditions, produce a circuit that is behaviorally accurate but computationally false. This is the problem that Lange, Dearstyne, Maher, and colleagues demonstrated in February 2026. And it matters because interpretability tools that skip intermediate computation cannot distinguish between a model that reasons correctly and one that pattern-matches, or between a model with aligned internal goals and one that conceals its true reasoning.

This newsletter dissects the CLT architecture: what a transcoder is, what makes a CLT different from a per-layer transcoder (PLT), how the joint sparsity loss creates an incentive toward unfaithful circuit compression, what the Boolean toy model experiment proved, and what the CLT-Forge and circuit-tracer toolchain reveals in practice.

Scope: CLT architecture, the faithfulness problem, the toy model proof, practical toolchain (circuit-tracer, CLT-Forge), skip transcoders (arXiv:2501.18823), and the vision transformer application (arXiv:2604.13304). Not covered: SAE training details beyond comparison to transcoders, or the full mechanistic interpretability stack beyond CLTs.

What It Actually Does

A transcoder (introduced by Dunefsky et al., 2024) is a sparse neural network trained to approximate the input-output function of a single MLP layer. Where a Sparse Autoencoder (SAE) decomposes activations at one point in the residual stream into interpretable features, a transcoder takes the pre-MLP residual stream and predicts the post-MLP output as a sparse linear combination of learned features. This is the critical difference: transcoder features describe what an MLP computes, not just what activations look like.

A Cross-Layer Transcoder (CLT), introduced by Anthropic in Ameisen et al. (2025), extends this by sharing features across multiple layers. A feature encoded at layer ℓ has decoder weights for layers ℓ, ℓ+1, ..., L. The same feature can contribute to the residual stream at multiple downstream layers. CLTs are trained jointly across all layers against a combined reconstruction loss plus a single sparsity penalty summed across all layers.

The practical result: CLTs collapse redundant "amplification chains" (where similar features activate across many consecutive layers) into single cross-layer features, dramatically reducing attribution graph size. This is what made the "Biology of a Large Language Model" analysis tractable, attribution graphs with hundreds of nodes instead of thousands.

The ecosystem around CLTs has grown rapidly:

circuit-tracer (Anthropic): Given a model with pre-trained transcoders, computes the attribution graph, visualizes features, and enables interventions.
CLT-Forge (Max Planck / Vector Institute): Scalable end-to-end library for CLT training with distributed sharding, compressed activation caching, automated interpretability pipeline, and Circuit-Tracer integration.
crosscode: Crosscoder implementation for model diffing and cross-layer feature analysis.
transcoder_circuits: Original Dunefsky et al. transcoder circuit analysis code.

The Architecture, Unpacked

^{Focus on the joint training loss. The summed sparsity penalty across all layers is what makes CLTs compact. It is also what can make them unfaithful: a two-hop circuit A → B → C can be approximated by two single-hop circuits A → C and B → C at lower total sparsity cost, causing the CLT to skip the intermediate feature B.}

The replacement model approach is the conceptual foundation. When CLTs are trained, the goal is for the set of CLT features plus attention SAEs to form a "replacement model" that mirrors the original model's computation. The circuits derived from this replacement model are then assumed to describe the base model. The assumption is not validated, it is adopted. The faithfulness question is: when is that assumption safe, and when is it not?

The Code

Snippet One: Training a CLT with CLT-Forge (JumpReLU, joint loss)

# Based on CLT-Forge library: https://github.com/LLM-Interp/CLT-Forge
# This implements the L1-regularized JumpReLU CLT training loop
# (the same architecture used in Anthropic's circuit tracing work)

import torch
from clt_forge import CrossLayerTranscoder, CLTConfig, ActivationStore

# CLT configuration for a 2-layer toy model (mirroring the faithfulness experiment)
config = CLTConfig(
    d_model=128,          # residual stream dimension
    n_layers=2,           # number of MLP layers to transcoder
    n_features=32,        # features per layer (overcomplete basis)
    # ← JumpReLU activation: jumps at learned threshold θ, zero below
    # Produces cleaner sparsity than ReLU and is empirically preferred for CLTs
    activation="jumprelu",
    # ← THIS is the critical hyperparameter: the sparsity coefficient
    # Higher λ → sparser features → more compression → higher risk of feature skipping
    # Ameisen et al. used Tanh sparsity penalty with tuned λ
    sparsity_coeff=1e-3,
    # ← L1-of-norms: encourages sparse feature activation per layer
    # L2-of-norms (alternative): incentivizes features to spread across layers
    # L1 is safer for faithfulness; L2 maximizes compression but risks skipping
    sparsity_type="l1_of_norms",
)

clt = CrossLayerTranscoder(config)
optimizer = torch.optim.Adam(clt.parameters(), lr=1e-4)

# Training loop: joint loss summed across all layers
def train_step(batch_activations):
    """
    batch_activations: dict {layer_idx: (pre_mlp, post_mlp)} tensors
    pre_mlp: residual stream before the MLP (the transcoder input)
    post_mlp: residual stream after the MLP (the reconstruction target)
    """
    total_loss = 0.0

    for layer_idx in range(config.n_layers):
        pre, post = batch_activations[layer_idx]

        # Forward pass through the CLT for this layer's encoder
        features, reconstructions = clt.forward(pre, layer_idx)
        # features: (batch, n_features) — the sparse feature activations
        # reconstructions: dict {layer: (batch, d_model)} — cross-layer outputs

        # Reconstruction loss for this layer
        # ← We compare reconstructed post-MLP output to the actual model output
        # This is the "replacement model" objective: CLT should mirror the MLP
        recon_loss = torch.nn.functional.mse_loss(
            reconstructions[layer_idx], post
        )

        # ← ALSO add reconstruction loss for all DOWNSTREAM layers
        # This is what makes it cross-layer: feature at layer 0 is trained
        # to reconstruct MLP outputs at layers 1, 2, ..., L as well
        for downstream_layer in range(layer_idx + 1, config.n_layers):
            if downstream_layer in reconstructions:
                pre_down, post_down = batch_activations[downstream_layer]
                recon_loss += torch.nn.functional.mse_loss(
                    reconstructions[downstream_layer], post_down
                )

        # Sparsity penalty: L1 norm of feature activations summed across layers
        # ← THIS is the incentive toward feature skipping:
        # A two-hop circuit A→B→C uses 2 features (B and C active separately)
        # A single-hop approximation A→C uses 1 feature at lower sparsity cost
        # If the CLT can approximate the two-hop computation with one feature,
        # the sparsity loss will reward it for doing so, even if unfaithful.
        sparsity_loss = config.sparsity_coeff * features.abs().sum()

        total_loss += recon_loss + sparsity_loss

    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

    return total_loss.item()

The sparsity loss is not a regularization detail. It is an architectural incentive that shapes which circuits the CLT learns. Setting it too high pushes toward unfaithful compression. Setting it too low produces large, redundant graphs. The correct value is task-dependent and cannot be determined by training loss alone.

Snippet Two: Attribution Graph Extraction with circuit-tracer

# Based on: https://github.com/safety-research/circuit-tracer
# Circuit tracing: compute the attribution graph for a specific prompt
# This is the downstream consumer of a trained CLT

from circuit_tracer import AttributionGraph, TranscoderModel
import torch

# Load a pre-trained CLT-wrapped model
# ← TranscoderModel replaces each MLP with its CLT equivalent
# The resulting "replacement model" is what circuit tracing operates on
transcoder_model = TranscoderModel.from_pretrained(
    base_model="gpt2",
    transcoder_path="path/to/trained/clt",
    device="cuda",
)

# Define the prompt and target token to explain
prompt = "The capital of France is"
target_token = " Paris"  # the token whose logit we want to explain

# Compute the attribution graph
# ← This runs a forward pass through the replacement model,
# then computes direct effects of each feature on each downstream feature
# using linear approximations through the residual stream
graph = AttributionGraph.compute(
    model=transcoder_model,
    prompt=prompt,
    target_token=target_token,
    # ← attribution_threshold: minimum edge weight to include in graph
    # Too low: graph explodes in size (hundreds of irrelevant edges)
    # Too high: important causal pathways get pruned away
    attribution_threshold=0.05,
)

# Inspect the top features in the circuit
for node in graph.top_nodes(n=10):
    print(f"Layer {node.layer}, Feature {node.feature_idx}")
    print(f"  Attribution score: {node.attribution:.3f}")
    print(f"  Auto-interpretation: {node.auto_interp_label}")
    print(f"  Example activating tokens: {node.top_activating_examples[:3]}")

# ← THIS is the critical limitation identified by Lange et al.:
# If the CLT learned an unfaithful circuit (A→C instead of A→B→C),
# feature B will not appear in this graph even if it is causally necessary
# in the original model. The graph is behaviorally accurate but mechanistically wrong.
# There is no way to detect this from the graph output alone.

# Intervention: set a feature to zero to verify causal role
with graph.intervention(feature_idx=42, layer=3, value=0.0):
    modified_output = transcoder_model(prompt)
    # If the feature is truly causal in the replacement model, output changes.
    # If the CLT is unfaithful, a causally necessary feature in the BASE model
    # may be ABSENT from the replacement model entirely.

The attribution graph looks correct whether or not the CLT is faithful. Both a faithful circuit and an unfaithful one produce a graph that predicts the output token correctly. The only way to distinguish them is to ablate model components in the BASE model and check whether the attribution graph predicted those ablations would matter.

It In Action: End-to-End Worked Example

The Boolean Toy Model Experiment (Lange et al., February 2026)

Goal: Verify whether a trained CLT faithfully recovers a known ground-truth circuit.

Input: A toy model with two MLP-only layers implementing (a XOR b) AND (c XOR d), with four binary inputs (+1 or -1). Ground-truth circuit is known.

Step 1: Ground truth circuit (hand-crafted)

Layer 0 neurons (MLP 0):
  e = ReLU(a - b - 1)   → true if a=+1, b=-1
  f = ReLU(b - a - 1)   → true if b=+1, a=-1
  g = ReLU(c - d - 1)   → true if c=+1, d=-1
  h = ReLU(d - c - 1)   → true if d=+1, c=-1

Layer 1 neurons (MLP 1):
  i = ReLU(e + f + g + h - 1)  → true iff BOTH XORs are true

Ground truth features: {a, b, c, d} (inputs), {e, f, g, h} (intermediate), {i} (output)
Ground truth circuit: inputs → 4 XOR features (Layer 0) → AND feature (Layer 1)
This is a 2-hop circuit: inputs → intermediate features → output

Step 2: Train PLT (per-layer transcoder) and CLT on this toy model

PLT result: recovers the 2-hop circuit correctly
  PLT features at Layer 0: recover e, f, g, h (the four XOR partial products)
  PLT features at Layer 1: recover i (the AND combination)
  PLT circuit: a,b,c,d → e,f,g,h → i  (FAITHFUL to ground truth)

CLT result (with L1 sparsity penalty): FAILS to recover the 2-hop circuit
  CLT learns: TWO parallel single-hop circuits
    Circuit 1: a → i directly (skipping intermediate features e, f)
    Circuit 2: c → i directly (skipping intermediate features g, h)
  CLT circuit: a,c → i  (UNFAITHFUL: B-features {e,f,g,h} are absent)

Step 3: Why the CLT learned the unfaithful circuit

Two-hop faithful circuit cost (in L1 sparsity terms):
  Features needed: {e, f, g, h} at Layer 0 + {i} at Layer 1 = 5 feature activations

Single-hop unfaithful approximation cost:
  CLT can approximate i by: a→i and c→i (two cross-layer features each skipping one layer)
  Features needed: 2 cross-layer features with decoder weights spanning Layer 0 and Layer 1

If the CLT approximation error is small enough, the loss function REWARDS the unfaithful
decomposition because it achieves lower total L1 cost.

The sparsity penalty incentivizes compression.
Compression, in this case, means erasing intermediate features.
Erasing intermediate features means hiding multi-step computation.

Step 4: Real language model evidence (preliminary)

Lange et al. trained both PLTs and CLTs on real language models and compared their
implied circuits for the same prompts. In several cases, PLTs and CLTs implied
"sharply different circuit-level interpretations for the same behavior."

This is preliminary evidence (not proof) that the toy model failure mode
generalizes to models where ground truth is unknown.

The critical concern: for tasks like:
  - "Does this model perform intermediate calculations on math problems, or memorize?"
  - "Does this model's chain-of-thought actually reflect its internal computation?"

A CLT that collapses multi-hop circuits into single-hop circuits cannot answer these
questions correctly, even if its attribution graph looks plausible.

Why This Design Works, and What It Trades Away

The joint training with cross-layer decoder matrices is the correct engineering decision for reducing attribution graph size. When a feature genuinely participates in cross-layer superposition (multiple consecutive layers reinforcing the same feature direction without meaningful intermediate interaction), collapsing it into a single CLT feature is correct. The circuit is simpler and more accurate. This is the mechanism that made Anthropic's "Biology of a Large Language Model" analysis feasible at scale.

The sparsity penalty design is the tradeoff that creates the faithfulness risk. The L1 norm of feature activations, summed across all layers, penalizes complexity globally. A two-hop circuit with four intermediate features costs more in sparsity than a single-hop approximation with two cross-layer features. When the single-hop approximation is close enough in reconstruction loss, the optimizer chooses it. This is not a training error. It is the objective function working as designed.

The skip transcoder architecture (arXiv:2501.18823, Paulo, Shabalin, Belrose, EleutherAI) is a partial fix. A skip transcoder adds an affine skip connection from the MLP input directly to the MLP output prediction, before the sparse bottleneck. This reduces reconstruction error without affecting interpretability of the sparse features, achieving Pareto improvements over both standard transcoders and SAEs on the interpretability-vs-performance tradeoff. Skip transcoders evaluated on SAEBench show higher auto-interpretability scores than SAEs at matched model sizes, supporting the conclusion (arXiv:2501.18823) that "interpretability researchers should shift their focus away from sparse autoencoders trained on the outputs of MLPs and toward skip transcoders."

What CLTs trade away:

Guaranteed faithfulness. The replacement model assumption is not verified during training. CLTs can produce circuits that match output behavior while systematically hiding intermediate computational steps. The diagnostic for detecting this (ablating base model components and checking if the attribution graph predicted those ablations would matter) requires access to ground truth or careful experimental design.

Cross-layer decoder matrix scaling. A feature at layer 0 in a 12-layer model has decoder matrices for 12 layers. The parameter count of the CLT decoder grows quadratically with the number of layers. CLT-Forge addresses this with distributed training and compressed activation caching, but training CLTs on large models remains substantially more expensive than training per-layer SAEs.

Technical Moats

The faithfulness diagnostic requires ground truth. The toy model experiment works because the authors built the circuit manually and know what features should appear. In real language models, ground truth is unknown. Verifying CLT faithfulness requires careful ablation studies that check whether model components the CLT claims are irrelevant actually affect outputs when removed from the base model. This is time-consuming, requires interpretability expertise, and cannot be automated at scale without benchmarks that do not yet exist.

JumpReLU CLTs outperform all alternatives at matched sparsity. Neuronpedia's circuits research landscape (August 2025) evaluated JumpReLU, TopK, and ReLU CLTs on GPT-2 and found JumpReLU CLTs produce the best replacement scores. JumpReLU activation (activates only when input exceeds a learned per-feature threshold θ) produces cleaner sparsity distributions than ReLU and more stable training than TopK. The hyperparameter sensitivity is real, but the performance ceiling is higher.

Protein circuit tracing via CLTs (arXiv:2602.12026) demonstrates unexpected generalization. CLTs trained on protein language models recover biological circuit motifs that correspond to known biochemical pathways. This is the strongest evidence that CLTs are learning something structurally meaningful rather than just compressing activations. The same architecture that works for LLM circuit tracing generalizes to biological sequence models, suggesting the cross-layer feature sharing principle reflects something real about how residual stream models organize computation.

Insights

Insight One: The faithfulness problem is not a bug in CLT training. It is a fundamental tension between compression and mechanistic accuracy, and the community has not yet established principled criteria for when to trust a CLT-derived circuit.

The Lange et al. result shows that the same sparsity incentive that makes CLTs compact also makes them capable of learning unfaithful circuits. These are not separable properties. You cannot have maximum compression and guaranteed faithfulness simultaneously under the current training objective. Choosing a lower sparsity coefficient helps but does not eliminate the problem. The community's response should be: treat attribution graphs as hypotheses that require ablation validation, not as ground truth descriptions of model computation. The current practice in many mechanistic interpretability papers is to present attribution graphs without ablation validation. That practice is now known to be insufficient.

Insight Two: The vision transformer application (arXiv:2604.13304) is a harder test than language models, and CLTs partially pass it, which is the most informative single benchmark in the current literature.

"Can Cross-Layer Transcoders Replace Vision Transformer Activations?" applies CLTs to ViTs (Vision Transformers) where the cross-layer structure is different from language models: patch tokens interact spatially, attention patterns are less dominated by positional structure, and MLP layers process spatially-distributed features. CLTs trained on ViTs achieve competitive replacement scores, meaning the cross-layer feature sharing principle is not specific to the sequential, causal structure of autoregressive language models. This matters because it provides evidence that CLTs are discovering something real about multi-layer residual networks in general, not just exploiting properties of text data or causal masking. The partial success and the documented failure modes in the ViT setting are both informative: CLTs capture some of the cross-layer structure but miss spatial feature interactions that are unique to vision models.

Takeaway

Transcoders provably outperform Sparse Autoencoders for mechanistic interpretability, and the community's continued focus on SAEs as the primary interpretability tool is not scientifically justified.

Paulo, Shabalin, and Belrose (arXiv:2501.18823, EleutherAI) evaluated transcoders and skip transcoders against SAEs across diverse models up to 2B parameters using SAEBench. Skip transcoders Pareto-dominate SAEs: they achieve lower reconstruction error AND higher automated interpretability scores simultaneously at all tested sizes (32,768, 65,536, and 131,072 latents). The improvement is not marginal. The paper explicitly concludes that "interpretability researchers should shift their focus away from sparse autoencoders trained on the outputs of MLPs and toward skip transcoders."

The reason transcoders outperform SAEs is structural. SAEs decompose intermediate activations, which are a mixture of past MLP outputs and unprocessed inputs to the current MLP. Transcoders model the functional behavior of the MLP itself, which is what researchers actually care about when asking "what is this computation doing." Despite this, SAEs dominate the interpretability literature because they are easier to train, have a longer research history, and the transcoder faithfulness concerns create legitimate uncertainty about when to trust their circuits. The correct response is not to return to SAEs. It is to use skip transcoders while developing better faithfulness diagnostics for CLTs.

TL;DR For Engineers

A Cross-Layer Transcoder (CLT) replaces each MLP (Multilayer Perceptron) with a sparse bottleneck where a feature encoded at layer ℓ can decode to any downstream layer ℓ through L. All encoders and decoders are trained jointly under a summed reconstruction plus sparsity loss.
The faithfulness problem: the joint sparsity penalty incentivizes compressing multi-hop circuits A → B → C into single-hop approximations A → C, producing attribution graphs that match behavior while hiding intermediate computation B. This was proved on a Boolean toy model with known ground truth (Lange et al., 2026).
Practical implication: treat CLT attribution graphs as hypotheses, not ground truth. Validate causal claims by ablating components in the base model and verifying that the attribution graph predicted those ablations would matter.
Skip transcoders (arXiv:2501.18823) Pareto-dominate SAEs on reconstruction error and interpretability simultaneously. Use skip transcoders instead of SAEs for single-layer MLP interpretability. Use CLTs for cross-layer circuit analysis, with faithfulness validation.
Toolchain: circuit-tracer (Anthropic, attribution graph computation and intervention), CLT-Forge (arXiv:2603.21014, scalable CLT training with distributed sharding and compressed activation caching), crosscode (model diffing and cross-layer features), transcoder_circuits (original Dunefsky et al. code).

The Attribution Graph Is a Hypothesis, Not a Description

CLTs are the best available tool for mechanistic interpretability of residual stream models at scale. The circuits they produce are compact, partially validated, and generalize across model types (language models, protein models, vision transformers). They enabled the most ambitious mechanistic interpretability project in the field's history. They are also capable of producing circuits that look correct while hiding the actual computation.

The correct response is not to abandon CLTs. It is to stop treating their output as ground truth. Every attribution graph is a claim about how a model computes, not a verified description of it. The claim needs to be tested by ablating the base model components the graph implies are relevant, and checking whether the graph predicted those ablations would matter. The mechanistic interpretability community has the tools to do this. The field's credibility depends on making it standard practice.

References

Cross-Layer Transcoders are Incentivized to Learn Unfaithful Circuits, Lange, RGRGRG, Dearstyne, Maher, February 2026
CLT-Forge: A Scalable Library for CLTs and Attribution Graphs, arXiv:2603.21014, Draye et al., Max Planck / Vector Institute, 2026
Transcoders Beat Sparse Autoencoders for Interpretability, arXiv:2501.18823, Paulo, Shabalin, Belrose, EleutherAI, February 2025
Can Cross-Layer Transcoders Replace Vision Transformer Activations? arXiv:2604.13304
Protein Circuit Tracing via Cross-layer Transcoders, arXiv:2602.12026
circuit-tracer GitHub Repository, Anthropic safety research
crosscode GitHub Repository, Oli Clive-Griffin
transcoder_circuits GitHub Repository, Dunefsky et al.
Transcoders enable fine-grained interpretable circuit analysis for language models, Dunefsky et al., 2024
Sparse Crosscoders for Cross-Layer Features and Model Diffing, Anthropic Transformer Circuits, October 2024
Circuit Tracing: Revealing Computational Graphs in Language Models, Ameisen et al., Anthropic, 2025
The Circuits Research Landscape: Results and Perspectives, Neuronpedia, August 2025
Sparsely-connected Cross-layer Transcoders, LessWrong, June 2025

Cross-Layer Transcoders (CLTs), introduced by Anthropic in 2025, are sparse neural networks that replace each MLP (Multilayer Perceptron) layer with a shared feature basis where a feature encoded at layer ℓ can contribute to reconstructions at all downstream layers through jointly trained decoder matrices. CLTs compress attribution graphs by collapsing cross-layer superposition into single features, enabling scalable mechanistic interpretability. However, Lange et al. (2026) proved on a Boolean toy model with known ground truth that the joint sparsity penalty can incentivize CLTs to rewrite multi-hop circuits A → B → C as single-hop approximations A → C, producing attribution graphs that match behavior while hiding intermediate computation B. Skip transcoders (Paulo et al., arXiv:2501.18823) Pareto-dominate SAEs on reconstruction error and interpretability simultaneously. The practical implication: CLT attribution graphs are hypotheses that require ablation validation against the base model, not ground truth descriptions of model computation.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad, it helps us keep building and delivering value 🚀