Efficiency & reasoning primitives + new evals
Key Questions
What speed improvements does MTP bring to llama.cpp?
MTP integration in llama.cpp raises inference from 65 to 99 tokens per second on H100 GPUs. A full walkthrough demonstrates the upgrade for local deployments.
How fast is Gemma 4 inference on local setups?
Gemma 4 achieves 65 tokens per second in local inference using Ollama, vLLM, and llama.cpp. It enables practical on-device LLM usage.
What efficiency gains come from DeepSeek KV and attention optimizations?
DeepSeek KV and attention cuts reduce memory usage during long-context inference. Combined with Unsloth and mlx Vulkan, these techniques lower overall compute costs.
How does KVBoost accelerate Hugging Face inference?
KVBoost enables chunk-level KV cache reuse, delivering 5-48x faster TTFT on Hugging Face models. It is an open-source tool focused on reducing time-to-first-token latency.
What is Gated DeltaNet-2 and its relation to prior architectures?
Gated DeltaNet-2 introduces a new linear attention architecture that decouples erase and update operations. It shows strong similarity to RWKV-7's DPLR recurrence.
Which new benchmarks evaluate dynamic memory and agent tasks?
ESI-Bench, Artifact-Bench, OSWorld, and MINTEval assess dynamic memory interference and agent performance. They target verifiable subproblems and real-world terminal tasks.
What fine-tuning approaches improve tiny LLMs for on-device agents?
Curriculum reinforcement learning and LoRA/QLoRA fine-tuning raise tiny LLM accuracy from 46% to 90% on agentic tasks. Google presented results using Function Gemma.
How do recent papers address reasoning credit assignment in LLMs?
Curriculum reinforcement learning breaks reasoning chains into verifiable subproblems for better credit assignment. This approach improves LLM reasoning transparency and training efficiency.
llama.cpp MTP (65→99 t/s H100), Gemma 4 local inference (65 tok/s), DeepSeek KV/attn cuts, Unsloth, mlx Vulkan, KVBoost (5-48x TTFT on HF). Gated DeltaNet-2 new linear attention architecture (similarity to RWKV-7 noted). ESI-Bench, Artifact-Bench, OSWorld, MINTEval for dynamic memory interference.