Transformer internals: massive activations, attention sinks, and an architecture fix in the wild

Key Questions

What causes brittle failures in current transformer-based LLMs according to the highlight?

Brittle LLM failures are linked to huge activation magnitudes, pre-norm 1/n attention scaling that creates attention sinks, and residual-mixing artifacts affecting pruning and KV-cache behavior.

What is Moonshot AI's Attention-Residuals (AttnRes) approach and its reported benefits?

AttnRes uses depth-wise attention over residuals and claims approximately 1.25x gains in compute and stability while architecture fixes are being validated.

What inference mitigations are currently in use for transformer issues?

LookaheadKV and Sleep-Time Compute are being used as inference mitigations while waiting for full architecture fixes and addressing open questions around reproducibility at scale.

"Transformers Deconstructed" (Mar 2026) links brittle LLM failures to huge activation magnitudes, pre-norm 1/n attention scaling that creates attention sinks, and residual-mixing artifacts that break pruning and KV-cache behavior. Moonshot AI's Attention-Residuals (AttnRes) — depth-wise attention over residuals — claims ~1.25x compute/stability gains; inference mitigations (LookaheadKV, Sleep-Time Compute) are being used while architecture fixes are validated. Main open: reproducibility at scale, long contexts, and multimodal stacks.

Sources (2)

Updated May 25, 2026

AI Daily Brief

Transformer internals: massive activations, attention sinks, and an architecture fix in the wild

Key Questions

What causes brittle failures in current transformer-based LLMs according to the highlight?

What is Moonshot AI's Attention-Residuals (AttnRes) approach and its reported benefits?

What inference mitigations are currently in use for transformer issues?

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving