LLM Insight Tracker

Inference Opts: Quant/Arch + New Models + Scaling

Inference Opts: Quant/Arch + New Models + Scaling

Key Questions

What optimizations improve LLM inference speed and memory use?

Techniques like OScaR INT2 KV cache quantization achieve 3x decoding speedup and 5.3x memory compression with near-lossless accuracy. Speculative decoding methods such as EAGLE-3 and MTP further enhance performance.

What new models and architectures are highlighted for scaling?

Cohere's Command A+ is a 218B sparse MoE model runnable on two H100 GPUs, while Gated DeltaNet-2 improves linear attention memory editing. Gemma MTP and hyperscaler capacity increases of 122% are also noted.

How efficient are AMD MI350 GEMM kernels?

AMD MI350 achieves 99% efficiency in GEMM operations, demonstrating near-peak performance for inference workloads.

What is CODA and its role in transformer optimization?

CODA rewrites transformer blocks as GEMM-epilogue programs, enabling high-performance kernels and gaining 88 points on Hacker News.

What benchmarks show gains from claude-flow quantization?

claude-flow M4 quantization delivers 2.7x performance improvements across multiple scale points in guidance benchmarks.

How does NVIDIA's 4-bit pretraining validation impact scaling?

NVIDIA validated 4-bit pretraining where a 12B Mamba-Transformer matches larger models, extending microscaling formats beyond inference.

What are the main components of the LLM inference optimization stack?

The stack covers quantization, speculative decoding, and kernel optimizations like DashAttention for sparse hierarchical attention.

What status do these inference developments hold?

The highlight on inference optimizations is climaxing, reflecting rapid advancements in quantization, architectures, and hardware scaling.

OScaR INT2 KV, speculative decoding (EAGLE-3/MTP), Gated DeltaNet-2, CODA kernels, Cohere Command A+, Gemma MTP, hyperscaler +122% capacity; AMD MI350 GEMM (99% eff.), claude-flow M4 quant (2.7x). Climaxing.

Sources (43)
Updated May 23, 2026