On-device inference and quantized models make local prototyping viable

Key Questions

Why is Gemma 4 notable on Hugging Face?

Gemma 4 tops Hugging Face rankings and runs on phones without internet, enabling offline agentic demos. Models range from 2-31B MoE, achieving 40 tokens per second with Ollama.

What are key open models matching closed performance?

MiniMax M2.7, Kimi 2.5, and Qwen 3.6 deliver 75-80% of closed model performance at 1/10th the cost. Their adoption is exploding due to efficiency.

What hardware supports local inference?

Tinybox, NVIDIA, AWS, Groq, and sllm GPU sharing enable efficient local runs. TurboQuant, Nemotron, and Mistral Small 4 optimize quantized models.

Can Gemma 4 run locally on iPhone?

Yes, Gemma 4 performs local inference on iPhones, sparking discussions on zero-token eras. It supports mobile edge to workstation deployments under broad licensing.

What is Qwen-3.6-Plus achievement?

Qwen-3.6-Plus is the first model to process over 1T tokens in a day. It highlights the push towards efficient local deploys amid rising LLM bills.

What is George Hotz's tinybox initiative?

George Hotz aims to build a $100 AI box for running models locally. It targets affordable on-device inference for broader accessibility.

How does sllm help with GPU costs?

sllm splits GPU costs via a cohort sharing model, providing access to large models like DeepSeek V3. It makes local prototyping more viable.

What licensing does Gemma 4 have?

Google launched Gemma 4 under Apache 2.0 with broad licensing. The 31B model ranks #3 globally and runs on local hardware.

Gemma 4 #1 HF with iPhone/offline agentic demos (2-31B MoE/40 tps/Ollama); MiniMax M2.7/Kimi 2.5/Qwen 3.6 (75-80% closed perf/1/10th cost exploding); sllm GPU/TurboQuant/Nemotron/Mistral Small 4; Tinybox/NVIDIA/AWS/Groq; LLM bills push efficient deploys.

Sources (24)

Updated Apr 8, 2026

AI PM Playbook

On-device inference and quantized models make local prototyping viable

Key Questions

Why is Gemma 4 notable on Hugging Face?

What are key open models matching closed performance?

What hardware supports local inference?

Can Gemma 4 run locally on iPhone?

What is Qwen-3.6-Plus achievement?

What is George Hotz's tinybox initiative?

How does sllm help with GPU costs?

What licensing does Gemma 4 have?

@DynamicWebPaige reposted: Gemma 4 can run on phones without an internet connection! 🤯 It can perform loca...

@huggingface: RT @ClementDelangue: Gemma 4 is #1 on @huggingface! https://t.co/wNOXZ1BzBu

@huggingface: RT @NielsRogge: Just now reading through the Gemma 4 blog Safe to say the @huggingface team is goat...

@minchoi reposted: Holy smokes... leaked OpenAI GPT-Image-2 model on Arena is wild. This is 100% A...

@Miles_Brundage reposted: New @axios Scoop: Meta will open source versions of new models set to be releas...

Running Gemma 4 Locally on iPhone: A Hot Topic - How Close Is the Era of 0 Tokens?

Qwen-3.6-Plus is the first model to break 1T tokens processed in a day

Multiscreen: Replacing Softmax for Faster LLMs

George Hotz Wants to Build a $100 AI Box That Runs Models Locally — and He’s Dead Serious

Build a Local AI Analyst with OpenClaw + Ollama

sllm Wants to Split Your GPU Costs With a Cohort Sharing Model

Google launches Gemma 4 with a broad licensing model

Google Unveils the Gemma 4 Open Model Family

Google Gemma 4 AI launched: 31B model ranks #3 | tbreak

@LinusEkenstam: This is huge 🚨 We're no longer running, we are sprinting towards a future with edge models doing a ...

@julien_c: To celebrate the Gemma 4 launch we held a small impromptu get together with @yagilb from @lmstudio f...

Gemma 4 - I Tested it on My Laptop and Desktop

End-to-End MLOps: Production-Ready LLM Fine-Tuning Pipeline (Mistral-7B + QLoRA)

GPT-5.5 INCOMING + DeepSeek V4 Breaks Free + OpenAI's SUPER App!

Bring state-of-the-art agentic skills to the edge with Gemma 4

Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks

@ClementDelangue reposted: Local models are a very very good thing

Microsoft Ships First Independent AI Models as OpenAI Pivot Hits Reality

Inside LLM Infrastructure: Scaling, Routing, and Resiliency in Modern AI Systems