MLX and NVFP4 FTW
In Which the Author Checks Back In on Local Model Coding
It had been a bit since I seriously tried using local models in interactive development sessions. So, I decided to update OpenCode to 1.17.4 and give a few local models a try... and things are getting really close to being good. In fact, I am starting to use a local model for bits of for active development.
My local model environment is a Mac Studio with 64GB RAM and an M2 Ultra CPU. Notably, this is Apple Silicon, and Ollama now supports MLX, Apple's high-speed ML framework.
My test was having the local models implement a milestone in an actual development plan. This milestone involved:
- Creating a Kotlin source file containing unit tests
- Running those tests to confirm that they failed
- Creating a Kotlin source file with the implementation of the code being tested
- Running those tests to confirm that they now succeed
- Running Detekt to see if there are any issues with any of that Kotlin code
Each test started with the project in the same state. Each test involved three prompts:
- A "warmup" prompt (literally "how can you help me?"), just to get Ollama to load the model into memory
- A "review this plan" prompt, asking for feedback
- A prompt directing the coding agent to implement Milestone 1 of that plan
I started my testing with qwen3.6:35b-a3b-nvfp4. In local model names like that, the suffix (here, -nvfp4) represents how the model is encoded. You can think of these as a form of lossy compression on the model: no suffix is the original, and those with the suffix had their original encodings compressed into smaller numeric types. NVFP4 is NVIDIA's 4-bit floating point encoding form, whereas MXFP8 is an 8-bit floating point form. MXFP8 is a higher quality compression than is NVFP4, but you need to choose your model based on its end size, and MXFP8-encoded models run larger than do their NVFP4 counterparts. In my case, I was testing as large of a model as I could that used NVFP4 encoding and would reasonably fit in this 64GB Mac.
The output from qwen3.6:35b-a3b-nvfp4 was decent but not awesome. I then tried gemma4:31b-nvfp4, which was about 50% faster and had slightly better output.
The real hit was qwen3.6:35b-a3b-coding-nvfp4: same as the first one, but focused on use by coding agents. This was about 33% faster than the Gemma 4 model and had fairly respectable output. It was not perfect by any means, but it was well within the range of "I can work with this".
My next steps are:
- Optimize my Ollama setup a bit
- Continue tweaking my development plan format to better provide instructions to models
- Revise my milestone-implementer skill that agents use, as I have neglected that end of the process 😞
- See what OpenCode uses as a system prompt and learn how I might adapt that for use with Knosh
- Start using
qwen3.6:35b-a3b-coding-nvfp4with OpenCode and Knosh "for realz"
For example, the next Knosh release will contain a plan-review command: read a development plan and provide feedback. Doing this ad-hoc in OpenCode, qwen3.6:35b-a3b-coding-nvfp4 found issues in a development plan created by Claude Opus.
Exciting times!
Mark Shust is going down the same path of local models on a 64GB Mac. In the past month, he has had several excellent posts on how he has configured his environment, such as this one where he independently setted on the same qwen3.6:35b-a3b-coding-nvfp4 model. Now, all I need to do is convince him to add an RSS or Atom feed to his site... 😀
Add a comment: