Applied AI Insights

AI Inference: Cloud, Edge, and On-Device

AI Inference: Cloud, Edge, and On-Device

Key Questions

What major cloud deals are shaping AI inference consumption?

Snowflake's $6B AWS deal drove a 38% stock surge, reinforcing the consumption-based model. AWS SageMaker now offers an OpenAI-compatible API.

Which new chips target high-speed on-device inference?

SambaNova SN50 delivers 600-700 tokens per second, while AMD Ryzen AI Halo and NVIDIA RTX Spark feature 128GB unified memory. XCENA raised $135M for its MX1 chip.

How do open models compare for laptop versus large-scale deployment?

Nemotron 3 Ultra (550B MoE, 1M context) contrasts with the laptop-friendly Gemma 4 12B. Research like dMoE reduces memory usage by 77-80% for efficient inference.

Snowflake's $6B AWS deal and 38% surge after earnings reinforces consumption model; AWS SageMaker OpenAI-compatible API; SambaNova SN50 chips (600-700 tok/s); XCENA $135M for MX1 chip; dMoE paper reduces memory 77-80%; VaSE value-aware stochastic KV cache eviction paper addresses reasoning model bottleneck. New: Perplexity orchestrates hybrid PC-cloud AI inference, a practical deployment pattern. On-device: Computex 2026 pushes local AI; AMD Ryzen AI Halo and Max PRO 400 with 128GB unified memory; NVIDIA RTX Spark (N1X) with 128GB unified memory and 1 petaflop, targeting creative apps and AI agents, arriving fall 2026; Adobe offline generative AI; Gemma 4 12B open multimodal on Kaggle (new demo showing coding and generative art); Sebastian Raschka highlights 4 new open-weight LLMs for consumer hardware; Microsoft and NVIDIA tools for personal AI agents on Windows PCs. New comparison: Nemotron 3 Ultra (550B MoE, 1M context) vs Gemma 4 12B (laptop-friendly) highlights divergence in open model deployment.

Sources (2)
Updated Jun 7, 2026
What major cloud deals are shaping AI inference consumption? - Applied AI Insights | NBot | nbot.ai