LLM Serving and Infrastructure

Key Questions

What optimizations were upstreamed in vLLM × HPC-Ops?

vLLM × HPC-Ops upstreamed a load-balanced decode scheduler and fused MoE pipeline, achieving 2.95x attention speedup and 24% TTFT reduction on H20 GPUs.

How do inference chips differ for LLM serving workloads?

An explainer compares Inferentia2, TPU, Groq LPU, and Tenstorrent chips, emphasizing differences in latency profiles, compiler maturity, and workload shape compatibility.

What is the focus of the new load-aware prefill deflection paper?

The paper explores load-aware prefill deflection techniques for disaggregated inference to improve efficiency in distributed LLM serving.

How does GKE Inference Gateway enhance LLM serving?

GKE Inference Gateway supports prefix caching and other optimizations that improve throughput for production LLM deployments.

What risks are associated with self-hosting LLMs?

Self-hosting carries risks including model deprecation issues and requires careful management of serving infrastructure to maintain reliability.

What new serving capabilities does TensorRT 11.0 introduce?

TensorRT 11.0 adds multi-device inference support for scaling LLM workloads across heterogeneous hardware setups.

How does AMD Eagle3 improve speculative decoding?

AMD Eagle3 enables speculative decoding on MI350X/MI355X GPUs using Quark FP8 draft quantization for faster inference.

What are common anti-patterns in LLM streaming responses?

AWS Well-Architected guidance (AGENTPERF02-BP04) highlights anti-patterns and KPIs for optimizing TTFT and streaming performance in agentic serving.

DriveNets $410M Series D. Supermicro partners with Arm. NVIDIA Dynamo distributed inference. Intel Crescent Island GPU. Cerebras CEO discusses inference disaggregation. Self-hosting risks and LLM deprecation survival pattern. Production GenAI Architecture video. Andrew Ng course on serving LLMs. NVIDIA DGX Spark review. Supabase raises $500M. General Instinct edge devices. Manus migrates to TiDB. Sentdex flags Nemotron 3 Ultra high-concurrency bug. WebSockets anti-pattern for LLM streaming. CNCF explores Kubernetes AI conformance. Apple WWDC26 distributed inference with MLX. GKE Inference Gateway prefix caching. Fairness-aware scheduling. OpenPcc confidential LLM serving. AMD Helios rack-scale. AI Gateway patterns. Scaling RAG. New serving platform claims 300K+ QPS. vLLM tuning tip. vLLM in 2026 challenges. Practical walkthrough for deploying LLMs on Amazon EKS. Concurrency-aware methodology. Marvell 102.4 Tbps switch silicon. OpsPilot auto-generates K8s scaling policies. llm-d on Red Hat AI. Software aging in GPU LLM serving. Bifrost open-source AI gateway. BOute cost-efficient serving with heterogeneous GPUs. Real-time analytics for AI systems using Apache Doris. ATOMesh. Hiding GPU heterogeneity. Ray Serve on GKE optimization. Semantic caching measurement guide. Shift-Left Performance Engineering for RAG and LLM. Reliable LLM Inference lessons from Databricks. Running a 70B LLM on a single OpenMetal H200. TensorRT 11.0 multi-device inference. Cloud cost comparison guide. Red Hat deployment blueprints for distributed inference. Serverless inference consistency exposes hidden provider decisions. Inference profitability guide provides economic benchmarks. Best AI data pipeline tools compare web scraping tools for LLM pipelines. New: load-aware prefill deflection paper; Databricks production guide on Model Serving cost/latency; AMD Eagle3 speculative decoding on MI350X/MI355X; SambaNova disaggregated inference explainer; Runway's capacity controller for GPU reallocation. Serving infrastructure interview trend confirms industry shift to systems-level LLM knowledge. SGLang guide on GPU sandboxes with MXFP4-quantized models and RadixAttention. New article on web-scale RCA methods for GPU-driven LLM systems adds debugging and reliability angle to production serving. New: Inference chips explainer compares Inferentia2, TPU, Groq LPU, Tenstorrent for LLM serving workloads, emphasizing workload shape and compiler maturity. New: vLLM × HPC-Ops kernel optimizations upstreamed into vLLM, achieving 2.95x attention speedup and 24% TTFT reduction on H20.

Sources (13)