LLM Inference Optimization Stack

Key Questions

What is DSpark and how does it speed up LLM inference?

DeepSeek's DSpark achieves 60-85% speedup with zero quality loss using suffix decay, Markov heads, and load-aware scheduling. It has been open-sourced as DeepSpec code for production use.

How does OSCAR optimize KV cache in LLM serving?

OSCAR open-sourced 2-bit attention-aware KV cache quantization integrated with SGLang. This reduces memory usage while maintaining performance for long-context inference.

What throughput gains does MBALL deliver through tiering?

MBALL achieves 60% throughput gain by implementing VRAM/DRAM/NVMe tiering for efficient memory management during inference.

How does FlexLLM enable co-serving of inference and finetuning?

FlexLLM (NSDI '26) supports token-level co-serving, allowing simultaneous inference and finetuning workloads on the same infrastructure.

What are the key features of NVIDIA Nemotron 3 Ultra?

NVIDIA Nemotron 3 Ultra uses hybrid Mamba-Transformer, LatentMoE, and NVFP4 to deliver 5x throughput, though a high-concurrency bug has been noted.

OSCAR open-sourced 2-bit attention-aware KV cache quantization with SGLang. MBALL achieved 60% throughput gain via VRAM/DRAM/NVMe tiering. FlexLLM (NSDI '26) enables token-level co-serving of inference and finetuning. Tether AI open-sourced TurboQuant reducing KV cache by 5x. Perplexity AI's split-compute achieves 60% cost reduction. Tiered KV cache caching (LMCache) for long-context. NVIDIA Nemotron 3 Ultra with hybrid Mamba-Transformer, LatentMoE, NVFP4, achieving 5x throughput, but high-concurrency bug flagged. General Instinct compresses 245GB MoE to 48GB GGUF for edge. Code2LoRA generates repo-specific LoRAs. New: fairness-aware scheduling, survey on memory per token optimizations, productionizing TurboQuant on AMD GPUs, concurrency-aware methodology, practical guide on optimizing and serving LLMs, deep dive on FlashAttention. Also: GPU utilization strategies, prefill vs decode primer, AI inference engineering guide. New: SMEPilot CPU inference, Luce-org/lucebox-hub fast speculative inference, P-EAGLE on SageMaker managed speculative decoding. Comprehensive overview of LLM inference optimization. AMD's Mext acquisition signals memory tiering innovation for MoE serving. Video on LLM architecture revolution highlights Key Value Sharing and MHC. ATOMesh unlocks AMD hardware. Hiding GPU heterogeneity. Latency prediction for NPU systems. Latest: practical video on cutting AI inference costs with vLLM, KV cache optimization to classic caching patterns, semantic caching measurement guide, Ray Serve on GKE 5x throughput, RAG data hygiene, FinOps playbook, speculative decoding playbook, cache-resident CPU inference paper, deep dive on KV cache memory tradeoffs for CPU inference, TensorRT 11.0 multi-device inference. PersistentKV introduces page-aware decode scheduling. Red Hat deployment blueprints for distributed inference. Serverless inference consistency exposes hidden provider decisions. Inference profitability guide provides economic benchmarks. New: load-aware prefill deflection paper; AMD Eagle3 speculative decoding on MI350X/MI355X with Quark FP8 draft quantization; HOLA paper pairing compressive recurrent state with small exact memory for linear attention. DeepSeek's DSpark achieves 60-85% speedup with zero quality loss via suffix decay, Markov heads, and load-aware scheduling; open-sourced DeepSpec code. New: vLLM × HPC-Ops upstreamed load-balanced decode scheduler and fused MoE pipeline, achieving 2.95x attention speedup and 24% TTFT reduction on H20. New inference chips explainer compares Inferentia2, TPU, Groq LPU, Tenstorrent for LLM serving workloads.

Sources (6)