Local Coding Model on a 24 GB Mac: Why My NVFP4 Setup Failed Quietly

TL;DR

I wanted to find the best locally runnable coding model for my MacBook M4 Pro with 24 GB of unified memory and put it into actual use. Researched it, pulled it, configured it, started it -- and stalled at 0.1 tokens per second. According to Ollama the model was running on the GPU, everything looked correct, and yet it did not work. This article is the honest write-up: what I tried, what went wrong, five lessons I am taking forward, and what you should try instead on whichever hardware you have.

Why I bothered in the first place

Most of my coding sessions run on cloud models. That works, it is fast, and it is robust. But there are three reasons why I keep trying to get a good model running locally:

Data sovereignty. Some codebases I do not want to send into someone else's inference pipeline, even when the providers promise zero retention.
Offline resilience. When I am at an airport or in an attic with poor reception, I want to keep working.
Control and learning. Local models force you to understand the mechanics: quantisation, context lengths, backends, memory layout. Once you have done that cleanly, you make better choices in the cloud too.

In April 2026 the picture for 24 GB Macs looked good at first glance. Qwen released Qwen3.6-35B-A3B-Coding, a model that scores 73.4% on SWE-Bench Verified. As a Mixture-of-Experts model with only 3 billion active parameters per token, it promises performance on par with dense 24-billion-parameter models, at a footprint of 22 GB in 4-bit format. That fits exactly into my memory.

On top of that, Ollama 0.21 brought the MLX backend, Apple's native accelerator for Apple Silicon. In theory, the fastest local setup for my Mac was within reach.

The plan

Multiple sources claimed the trick on machines under 32 GB was a specific quantisation tag: nvfp4. NVFP4 is a 4-bit format from the Nvidia ecosystem that MLX supports natively. Ollama 0.19+ only enables the MLX backend automatically from 32 GB and up; with the explicit NVFP4 tag, the sources said, you could trigger it on 24 GB machines as well.

My plan was simple:

Pull qwen3.6:35b-a3b-coding-nvfp4 via Ollama
Build a custom Modelfile with reduced context (8K) and the official Qwen sampling parameters
Quantise the KV cache to q8_0 and enable Flash Attention
Run a smoke test, measure tokens per second
Wire it into the agentic coding setup

Sounds like 30 minutes of work. It was not.

What actually happened

Wall 1: The invisible memory ceiling

First smoke test, immediate error:

Error: 500 Internal Server Error: model requires 20.4 GiB
but only 17.3 GiB are available (after 512.0 MiB overhead).

My first reflex: not enough RAM free, kill other apps. Codex.app, Chrome helpers, Notion frameworks. Soon I had 16 GB free. Same error message, same number: 17.3 GiB available.

What I only understood after three web searches: that was not free RAM, that was the macOS internal GPU wired-memory limit. By default Apple reserves only about 75% of unified memory for the GPU, around 18 GB on a 24 GB Mac. The value lives in the kernel parameter iogpu.wired_limit_mb. With the default of 0 on a 24 GB Mac, you get a hard 18 GB ceiling for GPU-resident models, completely independent of how much RAM is actually free.

Fix: sudo sysctl iogpu.wired_limit_mb=21504. I lifted the value to 21 GB, which counts as aggressive but is still safe on a tidy system. Above 22 GB you enter freeze territory.

Wall 2: The bug I did not know about

Sysctl set, verified, model restarted -- and Ollama still reported the same 17.3 GiB. Turns out: Ollama only reads the kernel reserve at daemon start, not on every inference. Known open bug since 2024, issue #1826. Fix: a full cold start of the Ollama daemon.

The moment of hope

Cold start, fresh attempt. This time the model loaded cleanly. ollama ps showed:

Model on GPU: 100%
Memory wired: 21 GB (right at the cap)
Runner argument: --mlx-engine

Meaning: Apple Silicon acceleration was active. The NVFP4 tag had genuinely triggered the MLX path in Ollama, even though my machine sits below the 32 GB auto threshold. I felt a brief, quiet flash of triumph. This was exactly the setup the research had promised.

Wall 3: Silence

API call sent. Curl waiting. 30 seconds. 60. 90 second timeout. Zero bytes of response.

Health check on /api/version answered immediately. The daemon was alive. But generation was not happening. Activity Monitor showed 100% GPU usage, so the model was computing. Just extremely slowly.

The server log eventually revealed the actual number: prefill at roughly 0.1 tokens per second. On an M-series GPU with MLX, 100 to 200 tokens per second would be normal. That is a factor of about 1,000 too slow.

I tried twice more with different context lengths and sampling parameters. Same result. The setup loaded technically correctly, but the computation itself appears to have fallen back to a software emulation of the NVFP4 kernels. Apple's productive MLX acceleration for NVFP4 seems to require the Neural Accelerators that arrived with the M5 generation, which my M4 Pro does not have.

Honest diagnosis

This is a reasoned hypothesis, not a proven fact:

The research sources claimed "NVFP4 runs on any hardware that has kernels". Formally true. In practice "runs" only means "correctness", not "performant".
Apple's own publications explicitly reference the M5 Neural Accelerators for the 4x prefill speedup. That was the reliable signal I missed.
Memory was fine (peak 19.67 GiB under the 21 GB cap), GPU routing was fine, the runner was using the right backend. The bottleneck was not in the setup, it was in the hardware assumption.

What I can rule out: memory pressure, CPU fallback, broken model, KV cache pressure, or a generic Ollama issue. All four would have shown up in the server log or in the activity monitor.

What I am taking from this

Five lessons that I think generalise.

1. A quantisation tag is not a universal accelerator

NVFP4 on hardware without Neural Accelerators is effectively an anti-pattern: the backend gets activated, but the kernels emulate in software. The 32 GB auto-activation threshold in Ollama has a hardware reason, not just a memory reason. If you find a "trick" that bypasses that threshold on under-spec hardware, that should make you suspicious, not euphoric.

2. Tech blogs routinely oversell hardware portability

"Runs on any hardware" in tech-blog speak almost always means "is syntactically supported", rarely "is performant". When the original vendor names a hardware generation as a prerequisite, that is the more reliable signal. In my case Apple's own MLX-and-M5 publication explicitly tagged the Neural Accelerators as the source of the speedup. Three tech blogs suggesting otherwise cost me two hours.

3. "100% GPU" in `ollama ps` is not proof of performance

The column only tells you the model is loaded onto the GPU. It says nothing about computational efficiency. Real tokens per second only show up in the server log or the API response metrics. If you only read the status command, you can mistake a dead setup for a healthy one. I am now baking this into my diagnostic workflow: status first, then always check the server log for the real prefill and decode rates.

4. The memory ceiling and performance are separate problems

iogpu.wired_limit_mb solves the question "does the model fit into memory". It does not solve the question "how fast does the model compute". Tutorials often treat both in the same breath, but they are orthogonal. A clean diagnostic order: loading first, then backend, then prefill speed, then decode speed. Four separate questions, four separate proofs.

5. Quantisation and hardware acceleration are two separate axes

A quantisation like NVFP4, Q4_K_M, or MXFP8 solves memory problems. A hardware accelerator like MLX kernels, CUDA, or Metal solves speed problems. The fact that both exist and are compatible does not mean their combination runs productively. It needs matching kernel implementations for the specific hardware generation. This separation was intellectually clear to me, but in the concrete case I missed it.

What you should try instead

General recommendations by hardware class, based on the April 2026 picture and what appears to actually work. No promises, just the most reliable paths from my research.

16 to 24 GB Mac, agentic coding wanted

Devstral 2 24B as a dense model at 14 GB in 4-bit format is the robust daily driver. Specifically tuned for tool calling and agentic workflows, no NVFP4 trick needed, runs through the classic llama.cpp path in Ollama. Expected performance on an M4 Pro: roughly 25 to 35 tokens per second.

Alternatively Qwen3 14B as a dense model at around 9 GB. Lots of context headroom, very fast, weaker on pure coding benchmarks but more than enough for many tasks.

24 GB Mac, larger model wanted

Qwen3.6-35B-A3B in the classic Q4_K_M format (not NVFP4) over the llama.cpp path in Ollama. Loads 22 GB, runs without MLX tricks, but stably. Expected performance: roughly 25 to 35 tokens per second. Slower than a hypothetically working MLX setup, but it actually works.

Important here too: set the KV cache to q8_0, limit context to 8K to 16K, enable Flash Attention. Otherwise the setup will tip into swap on the first longer tool-calling run.

32 GB+ Mac, any M generation

This is where the NVFP4 path actually becomes viable. Ollama auto-activates MLX from 32 GB upward, the explicit tag trick is no longer needed. Qwen3.6-35B-A3B-Coding-NVFP4 runs the way the blog sources described it. Also: LM Studio with MLX-DWQ quantisations is a separate toolchain with a more mature MLX implementation than Ollama right now. If you want maximum quality at 4-bit, look there.

M5 generation, any memory tier

With the Neural Accelerators on the M5 GPUs, NVFP4 becomes "the right" pick for the first time, according to everything Apple and the early benchmarks say. Real measured speedups go in the direction of 4x prefill acceleration. If you are on new hardware, you can take the path I could not take.

What I am not covering

I am deliberately leaving out comparisons against my personal upgrade plans. If you are making a purchase decision, base it on concrete benchmarks from current sources, not on a write-up of a failed setup. The model and backend landscape in 2026 shifts month by month. The half-life of dominant models is currently around three months.

Clean rollback

If you followed the recipe and walked into the same wall, three steps to clean up:

ollama rm qwen36-coder-24gb
ollama rm qwen3.6:35b-a3b-coding-nvfp4
sudo sysctl iogpu.wired_limit_mb=0

The last command resets the memory ceiling to the macOS default. The aggressive cap otherwise costs you system headroom and can cause stutters under load.

What stays

Two hours in a dead end, an empty model cache, and an honestly documented hypothesis. That is not a wasted evening. The diagnostic order, the five lessons, and the hardware recommendations will save me hours on the next local-LLM attempt. More importantly: I will read research sources differently now. When a tech blog promises a trick that bypasses the official hardware requirements, that is a warning sign, not a hot tip.

If you took the same path, or had a different experience: write me. Local-LLM is a field that only moves forward collectively -- everyone has a different slice of the hardware reality on their desk.