Local/on-device: Gemma4/Ollama + hardware & quick setups

Key Questions

What hardware enables running large local models like 110B+?

The Asus Flow Z13 with 128GB unified memory and OWC Stack AI Thunderbolt accelerator support high-parameter models. Framework Desktop is also mentioned for strong local performance.

How do GGUF K-quants and MTP improve local inference?

GGUF K-quants and llama.cpp's MTP optimizations deliver faster speeds on RTX GPUs and Macs for models like Qwen3.6. Benchmarks compare 3090, 5090, and Apple Silicon.

What is OpenHands and how is it used locally?

OpenHands is an open-source coding agent framework that runs on any local model via Ollama. It provides an alternative to closed agents like Devin.

Can small 8B models outperform Claude Sonnet locally?

Yes, optimized 8B local models have beaten Claude Sonnet in certain real-world evaluations. This highlights rapid gains in efficient local inference.

Which workflows are supported by n8n and ComfyUI for local AI?

n8n enables agentic RAG with open LLMs and free embeddings, while ComfyUI and Unraid support broader local pipelines. Ollama integration is common.

How well does Qwen3.6 run on limited VRAM?

Qwen3.6-35B-A3B runs effectively on 12GB VRAM with MLX and llama.cpp, delivering better-than-Copilot results after tweaks.

What local TTS and browser agents are emerging?

New local TTS models and browser agents that handle modern websites are being developed for fully offline use. These reduce reliance on cloud services.

How do coding agents perform on new hardware like Strix Halo?

Benchmarks on Strix Halo and R9700 show strong results for Pi, Opencode, and SWE-bench tasks with local agents.

GGUF K-quants, Ollama/Windows/Unraid/ComfyUI/n8n; new hardware: Asus Flow Z13 128GB unified (110B+ models), OWC Stack AI Thunderbolt accelerator, Framework Desktop. MTP boosts, Qwen3.6-35B-A3B on MLX/llama.cpp, 8B models beat Claude Sonnet, OpenHands agents.

Sources (124)