Inference Opts: Quant/Arch + New Models + Scaling

Key Questions

What optimizations improve LLM inference speed and memory use?

Techniques like OScaR INT2 KV cache quantization achieve 3x decoding speedup and 5.3x memory compression with near-lossless accuracy. Speculative decoding methods such as EAGLE-3 and MTP further enhance performance.

What new models and architectures are highlighted for scaling?

Cohere's Command A+ is a 218B sparse MoE model runnable on two H100 GPUs, while Gated DeltaNet-2 improves linear attention memory editing. Gemma MTP and hyperscaler capacity increases of 122% are also noted.

How efficient are AMD MI350 GEMM kernels?

AMD MI350 achieves 99% efficiency in GEMM operations, demonstrating near-peak performance for inference workloads.

What is CODA and its role in transformer optimization?

CODA rewrites transformer blocks as GEMM-epilogue programs, enabling high-performance kernels and gaining 88 points on Hacker News.

What benchmarks show gains from claude-flow quantization?

claude-flow M4 quantization delivers 2.7x performance improvements across multiple scale points in guidance benchmarks.

How does NVIDIA's 4-bit pretraining validation impact scaling?

NVIDIA validated 4-bit pretraining where a 12B Mamba-Transformer matches larger models, extending microscaling formats beyond inference.

What are the main components of the LLM inference optimization stack?

The stack covers quantization, speculative decoding, and kernel optimizations like DashAttention for sparse hierarchical attention.

What status do these inference developments hold?

The highlight on inference optimizations is climaxing, reflecting rapid advancements in quantization, architectures, and hardware scaling.

OScaR INT2 KV, speculative decoding (EAGLE-3/MTP), Gated DeltaNet-2, CODA kernels, Cohere Command A+, Gemma MTP, hyperscaler +122% capacity; AMD MI350 GEMM (99% eff.), claude-flow M4 quant (2.7x). Climaxing.

Sources (43)

Updated May 23, 2026

Inference Opts: Quant/Arch + New Models + Scaling

Key Questions

What optimizations improve LLM inference speed and memory use?

What new models and architectures are highlighted for scaling?

How efficient are AMD MI350 GEMM kernels?

What is CODA and its role in transformer optimization?

What benchmarks show gains from claude-flow quantization?

How does NVIDIA's 4-bit pretraining validation impact scaling?

What are the main components of the LLM inference optimization stack?

What status do these inference developments hold?

claude-flow/guidance performance benchmarks — 4 scale points, multi- ...

From Naive to Near-Peak: Building High-Performance GEMM Kernels ...

The LLM Inference Optimization Stack: From Quantization to Speculative ...

Gated DeltaNet-2: Better Memory Editing for Linear Attention

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Speculative Decoding Guide

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

arXiv:2605.19660：OScaR — INT2量化KV缓存实现3倍解码加速| 24 AI

@EliasEskin reposted: In self-distillation, there are various choices for what can be added as privile...

DashAttention: Differentiable and Adaptable Sparse Hierarchical Attention

@EliasEskin: 🚨 AVSD is a new self-distillation method that enables learning from multiple "views" of privileged i...

@EliasEskin reposted: On-Policy Self-Distillation methods usually condition on only one form of privil...

OScaR: The Occam's Razor for Extreme KV Cache Quantization in ...

Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

NVIDIA validates 4-bit pretraining: a 12B Mamba-Transformer matches ...

Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O

chenhongyu2048/LLM-inference-optimization-paper

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert ...

InferenceBench Benchmark 2026: 15 aggregate speedup rows

The AI Gateway: Scaling Centralized Inference Across Decentralized ...

Quantifying the Impact of Inference Backends on LLM Reproducibility

5 Hidden Features That Made 671B Models Actually Work

@jeremyphoward reposted: Excited to share our new paper: RoPE Distinguishes Neither Positions Nor Tokens ...

Semidynamics Secures a Strategic Investment to Advance Memory-Centric AI Inference Chips

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and ...

Kill AI Latency: Run Local Inference for Real-Time Code

Advancing Training and Inference Efficiency in Large-Scale Models

LLM fine-tuning with LoRA & QLoRA

SNLP: Layer-Parallel Inference via Structured Newton Corrections

Lossless 7.67× LoRA / 8.35× Full FT speedup for Qwen3.5 on DGX Spark ( ...

MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

Post-Trained MoE Can Skip Half Experts via Self-Distillation

@LukeZettlemoyer reposted: MoEs are everywhere, but the design space is confusing: total vs active experts?...

New Power, Memory, Interconnect, and Thermal Architectures for AI Infrastructure at Scale

Zero-copy ML: Doing inference inside the Linux Kernel by Jesper Derehag

People overestimate how confident AI systems are in their responses, experiments reveal

(Podcast) Triple the Speed How Gemma 4 MTP Drafters Are Changing AI Inference

Self-Distillation Enables Continual Learning [pdf]

@_akhaliq reposted: Can fast generative models still be likelihood-based? Excited to share our new ...

@rasbt: New article: a visual tour of recent LLM architecture advances, from Gemma 4 to DeepSeek V4. I focu...

Local AI on a budget GPU: Qwen 3.6 35B and 27B tested