Inference Optimization, Costs & Economics

Key Questions

What new papers address inference optimization?

Recent papers include LongAttnComp, ChartArena, κ-SwiGLU, VaSE, TRON, and OmniOPD, which delivered a 28.64% gain on math tasks. These works focus on efficiency gains in attention and reasoning.

Which models emphasize low-cost inference?

StepFun's Step 3.7 Flash offers 198B MoE VLM inference at $0.20 per million input tokens. VibeThinker-3B and DiffusionGemma also demonstrate strong efficiency for smaller and accelerated models.

How are companies improving AI hardware efficiency?

DriveNets raised $410 million for Ethernet AI fabric, ZutaCore secured $100 million for liquid cooling, and Flourish raised $500 million for brain-inspired designs. These efforts target reduced power and latency in data centers.

What advancements support test-time computation scaling?

LoopCoder-v2 introduces methods for efficient test-time scaling, while LCLMs achieve 16:1 compression ratios. Azure sessions also covered RDMA and KV cache optimizations for inference.

How is Cloudflare expanding its AI capabilities?

Cloudflare is adding talent from Ensemble AI to focus on model compression and infrastructure. The goal is to lower the cost of running advanced AI models at scale.

New papers: LongAttnComp, ChartArena, κ-SwiGLU, VaSE, TRON, OmniOPD (+28.64% on math). StepFun Step 3.7 Flash (198B MoE VLM, $0.20/M input). AWS Bedrock adds Claude Opus 4.8. DriveNets $410M for Ethernet AI fabric. ZutaCore $100M for liquid cooling. Intel rackscale inference systems. Google Gemma 4 12B. Flourish $500M for brain-inspired efficiency. LLMCodec uses video compression. NF-CoT introduces new reasoning method. DiffusionGemma (4x faster) and LCLMs (16:1 compression) advance efficiency. Azure innovations session covers RDMA for inference and KV cache optimization. SG-OPD paper improves on-policy distillation with sign-consistency gating, +1.98-7.50 points on math. Cloudflare expands AI infrastructure team with Ensemble AI talent for model compression. LoopCoder-v2 paper on efficient test-time computation scaling. VibeThinker-3B shows small model efficiency.

Sources (3)

Updated Jun 20, 2026

AI Pulse Digest