New attention, depth, and routing tricks for next-gen LLMs

Rewiring the Modern Transformer

This cluster tracks a wave of architectural experiments aimed at pushing LLMs beyond vanilla Transformers. Posts cover new attention mechanisms (Mixture-of-Depths/MoDA, attention residuals, attention sinks, directional routing, XSA), efficiency upgrades like FlashAttention variants and Mamba-3 SSMs, and hybrid or modular designs such as Nemotron 3 Super and FineRMoE. Together, these works explore how to scale depth, handle longer and denser contexts, and combine Transformers with state-space models and memory banks for more capable, agentic reasoning systems. The theme is clear: future gains are coming from smarter architectures, not just bigger models.

Sources (18)

Updated Mar 18, 2026

AI & ML Daily Digest

New attention, depth, and routing tricks for next-gen LLMs

MoDA: Scaling LLM Depth via Multi-Layer Attention

Mamba-3: High-Efficiency SSM Language Models

@srush_nlp reposted: What a day for Context Compaction! > Morph trained a dedicated model for Con...

Softmax Transformers Require Attention Sinks

@srush_nlp reposted: The newest model in the Mamba series is finally here 🐍 Hybrid models have becom...

@_akhaliq: Mixture-of-Depths Attention paper: https://t.co/OUgyAIQox7 https://t.co/IiQmDjq51p

How Attention Residuals are Rewiring the Modern LLM

Reading Notes: Attention Residuals - by David Mataciunas

Moonshot AI’s Attention Residuals for Kimi Could Change How AI Models Use Layers

Scaling Transformers: A Deep Dive into FlashAttention and Its Evolution

Structural reparameterization-based multi-branch attention network for ...

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

XSA: Improving Transformer Context Modeling

[2603.14923] Directional Routing in Transformers

Hybrid attention optimized hierarchical multiscale transformer ...

Transformer Hardness: No Shortcuts for Attention

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning

Adaptive Loops and Memory Banks for Better LLMs

New attention, depth, and routing tricks for next-gen LLMs

MoDA: Scaling LLM Depth via Multi-Layer Attention

Mamba-3: High-Efficiency SSM Language Models

@srush_nlp reposted: What a day for Context Compaction! &gt; Morph trained a dedicated model for Con...

Softmax Transformers Require Attention Sinks

@srush_nlp reposted: The newest model in the Mamba series is finally here 🐍 Hybrid models have becom...

@_akhaliq: Mixture-of-Depths Attention paper: https://t.co/OUgyAIQox7 https://t.co/IiQmDjq51p

How Attention Residuals are Rewiring the Modern LLM

Reading Notes: Attention Residuals - by David Mataciunas

Moonshot AI’s Attention Residuals for Kimi Could Change How AI Models Use Layers

Scaling Transformers: A Deep Dive into FlashAttention and Its Evolution

Structural reparameterization-based multi-branch attention network for ...

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

XSA: Improving Transformer Context Modeling

[2603.14923] Directional Routing in Transformers

Hybrid attention optimized hierarchical multiscale transformer ...

Transformer Hardness: No Shortcuts for Attention

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning

Adaptive Loops and Memory Banks for Better LLMs

@srush_nlp reposted: What a day for Context Compaction! > Morph trained a dedicated model for Con...