On 15 June 2026, Vicki Boykis published a short, technically clean post on her blog titled "Running local models is good now." The headline is the point. After three years of local model releases that were always six months behind the frontier, Boykis — a working ML engineer who has been on the local-inference side of this since llama.cpp was a weekend project — is willing to say out loud: the gap just closed enough to matter. The "vibe metric" she uses to make the call is the one anyone who has shipped with a local model eventually lands on: do I still need to double-check this against an API model? When the answer stops being "yes, every time," the local model has crossed a threshold. The post is the documented version of that crossing. It is worth reading closely.
The setup: where local models actually are in 2026
Boykis's working stack, on a 2022 M2 Mac with 64 GB of RAM and 1 TB of storage, is the one most engineers who care about local inference end up on: raw llama.cpp with Open WebUI on top, llama-cpp-python, Ollama, llamafiles, and LM Studio as the desktop client. The model list she has been driving is the actual frontier of small open-weights: Mistral 7B (the early one), Gemma 3, OpenAI's OSS-20B, Qwen 3 MOE, and the Qwen 2.5 Coder variants. None of this is exotic. All of it is in the public model registries; all of it runs on hardware a senior engineer can buy off the shelf.
Where Boykis's post is sharp is the inflection point she names. For years, the local tier has been "fast personalized Google" — useful for "what is the syntax for X in library Y" lookups, slow for anything that required sustained reasoning. The release of GPT-OSS, in her telling, was the first time the double-check reflex stopped firing. The latest Google releases, in the Gemma 4 family, are the first time local agentic coding loops "work at about ~75% the accuracy/speed of frontier models." That is the claim, and the claim is what the rest of this post is built on.
What the post is actually demonstrating
Boykis is not benchmarking. She is reporting on a setup she has been running. The specific things she has gotten a local Gemma 4 26B-A4B model running through the Pi agent harness to do:
- Refactor a Python script from a notebook into a five-or-six-module repo, with a separate pass to clean up generic type hints.
- Proofread blog posts.
- Write unit tests.
- Bootstrap a recommendation-system repo from a blank slate and watch what the agent produces.
- Build out the surface that scrapes trending topics from arXiv papers.
The "Docker container with limited execution" framing is the one detail that matters most for any reader who is going to try this at work. Boykis runs Pi in a sandbox with bash permissions only — no Python, no web browsing — and plans to add curl in a separate image for the research tasks. The 64 GB K-V cache ceiling on long-context runs is the part she is honest about: local means local, and the hardware bound is real.
The "75%" claim, held up to the light
The number worth arguing with is 75%. It is Boykis's read, not a benchmark number, and the metric she is using is "accuracy/speed of frontier models" — which is a two-axis read on a single subjective scale. Three things are true at the same time:
- 75% of frontier is enough for almost all the day-to-day engineering work that does not require a frontier-tier reasoning model. The "personalized Google" use case is the dominant one in any working engineer's day, and a local model that handles it without the API round trip is, in expectation, faster end-to-end.
- 75% of frontier is not enough for the 25% that does. The benchmarking, the long-horizon agentic work, the multi-file refactor with architectural judgment, the security-sensitive code review — these still want the frontier. The threshold-crossing is a threshold-crossing for a specific workload, not a general capability ceiling.
- The 75% number is the floor, not the ceiling, of this release cycle. Gemma 4 12B-QAT, which Boykis flags at the end of the post, is already the model she is migrating to. Smaller, faster, "without much sacrifice in accuracy." The 75% number is going to move up over the next two release cycles, and the threshold is going to move with it.
The defensible read of the post is not "local models have caught up." It is "the local tier crossed the threshold for the use case that most engineers spend most of their time on, and the threshold is not going to go back down."
What the post leaves out, and what to do about it
Three things Boykis does not say, and that the local-model reader needs to hear:
Hardware is the actual constraint, not model quality. The 64 GB M2 Mac is the floor for the workloads she describes. The 16 GB laptop a junior engineer is running is not. The LM Studio system-requirements page calls 16 GB the recommended minimum; the working note for 8 GB Macs is to stick to smaller models and modest context sizes. The 75% number does not transfer to the 16 GB tier. The realistic expectation for a 16 GB M-series laptop is Gemma 3 4B and Qwen 2.5 Coder 7B at modest context, and the workloads that work at that size are the personalized-Google ones, not the agentic ones.
The harness is half the product. Pi in read-only mode is doing a lot of the work Boykis credits to the model. The local model is producing the tokens; the agent harness is producing the file structure, the test scaffolding, the import graph. If you swap the harness — Aider, Claude Code pointed at the local endpoint, OpenHands — the same model produces a different 75%. The post is a "local model + Pi + LM Studio + Docker sandbox" report, not a "local model" report. The stack is the unit of analysis.
The "introspect everything" angle is the underrated one. The closing of Boykis's post is the part that should land hardest for the developer-tools audience. With a local model, you can watch the token inference in real time, change the context window and watch the performance move, swap the quantization, swap the system prompt, swap the model entirely, and see what each swap does to the output. That is not a debugging story. It is a learning story. The frontier API is a black box; the local stack is not. For an engineer who is trying to develop intuition for what these models actually do, the local setup is the only one that gives you the loop.
The original take: the 75% threshold is a labor-market signal, not a model-quality signal
The thing nobody is making explicit: the local-tier model crossing the 75% threshold is a signal about what counts as a "developer job" in 2026, not a signal about model capability. The model has not "caught up" in any objective sense — frontier models are still frontier, and the gap on the long tail of agentic and reasoning workloads is real. What has changed is that the work that is actually most of a working engineer's day — the syntax lookup, the test scaffold, the lint pass, the blog post proofread, the bootstrap — has been moved out of the "requires a human engineer" column and into the "requires a 75%-of-frontier model" column. That reclassification is permanent. It does not reverse when the next model release lands.
The labor-market consequence is the part the post does not make. When a single engineer with a 64 GB laptop and a local model can ship the work that used to take a team, the question is not "are local models good enough." The question is "what is the team for." The 75% threshold is the point at which the team-shape question becomes the question, full stop. The engineers who can name the workloads that benefit and the workloads that don't are the ones who will be hirable through the transition. The engineers who are still benchmarking local models against frontier models in 2027 are the ones who missed the reclassification.
What this means for you
- If you are an engineer who has not run a local model yet — pick the smallest model that fits your hardware (Gemma 3 4B or Qwen 2.5 Coder 7B on a 16 GB laptop; Gemma 4 26B-A4B or gpt-oss-20B on a 64 GB desktop) and run a personalized-Google workflow through it for a week. The point is to develop intuition for what the threshold is on your hardware, not to beat a benchmark.
- If you are a tech lead making tooling decisions — the question is not "do we buy frontier API access." The question is "which workflows do we run locally, which we run on frontier, and which we run on a small fine-tune." The 75% threshold means a meaningful slice of the answer is "local." The cost model and the data-handling model both change.
- If you are evaluating agent harnesses — the local stack is the right place to do the comparison. Swap the model, keep the harness; swap the harness, keep the model; look at the diff. The harness matters as much as the model. Pi is not the only option; it is the one Boykis is using, and it is worth trying.
- If you are writing about local models — the right unit of analysis is the stack, not the model. "Gemma 4 12B on a Mac M2 with Pi in Docker" is the real subject. "Gemma 4 12B" is not.
What to do this week
## Step 1. Install LM Studio (skip if you have it). System requirements
# are documented at https://lmstudio.ai/docs/app/system-requirements:
# - macOS: Apple Silicon M1/M2/M3/M4, macOS 14+, 16 GB+ RAM
# - Windows: x64 or ARM (Snapdragon X Elite), AVX2 required, 16 GB+ RAM
# - Linux: x64 or ARM64, Ubuntu 20.04+, distributed as AppImage
# 4 GB of dedicated VRAM is the recommended amount for hardware that
# has a discrete GPU.
## Step 2. Download the model Boykis is migrating to (gemma-4-12b-qat)
# if your hardware can run it; otherwise start with gemma-3-4b or
# qwen2.5-coder-7b. The exact file you want is in the LM Studio model
# browser; pick the Q4_K_M quantization if you are RAM-constrained.
## Step 3. Set up the agent harness. Pi is at
# https://github.com/earendil-works/pi (formerly badlogic/pi-mono).
# The models.json you need to point Pi at LM Studio is in the
# primary source. Use docker-compose to run Pi in a sandbox with
# bash-only permissions, the same way Boykis does.
## Step 4. Run a personalized-Google workflow for a week. Pick a real
# task you do every day (a syntax lookup, a test scaffold, a lint
# pass, a blog post proofread) and run it through the local model.
# The point is not to publish the result. The point is to develop
# intuition for what the 75% threshold feels like on your hardware.
## Step 5. If the local stack works for you, file the ticket. "Move
# this workflow off the frontier API" is a procurement decision, a
# cost-of-inference decision, and a data-handling decision. The
# ticket is the audit trail; the result of running it for a week
# is the input.
Related reads from this blog
- GLM-5.2 Hits 1M Context and Lands in Claude Code for $18 — the open-weights-vs-frontier pricing story; the GLM-5.2 post is about an open-weights model being good enough to displace a frontier subscription. The local-model story is the same reclassification at the per-engineer level.
- Rio's 'Homegrown' 397B LLM Is Just Nex + Qwen With a Mask — the open-weights-merger story; the 75% threshold is what makes the "we fine-tuned an open-weights model" claim viable in 2026. The model architecture itself is upstream of the local-tier result.
- An AI Agent Burned $6,531 on AWS to Scan a Hobby Network Nobody Asked It to Scan — the cost-control angle. The dn42 incident is what happens when a frontier API runs unattended; the local-model story is the cost-control answer for the 75% of workflows that do not need a frontier model in the first place.
Disclosure
Drafted with AI assistance. Primary source: Vicki Boykis, "Running local models is good now," 15 June 2026,
https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/(retrieved 17 June 2026 00:30 UTC+8 viacurl -L --compressed; the page body extracted was ~34 KB of rendered HTML). The "75% of frontier" figure is Boykis's read, not a derived benchmark; the model list (Mistral 7B, Gemma 3, OSS-20B, Qwen 3 MOE, Qwen 2.5 Coder), the harness (Pi), the inference client (LM Studio), the hardware spec (2022 M2 Mac, 64 GB RAM, 1 TB storage), the 64 GB K-V cache ceiling, and the Docker-sandbox-with-bash-only pattern are all Boykis's. The "Pi" agent harness is athttps://github.com/earendil-works/pi(the canonical repo; the priorbadlogic/pi-monoURL is a 301 redirect to it, verified viacurl -sI). The LM Studio system requirements (16 GB RAM recommended, macOS 14+ on Apple Silicon, 4 GB VRAM recommended, AppImage on Linux) are fromhttps://lmstudio.ai/docs/app/system-requirements, retrieved 17 June 2026 00:35 UTC+8. The "75% threshold is a labor-market signal, not a model-quality signal" framing in the original-take section is this blog's editorial position, not a claim in the Boykis post. The "Pi in read-only mode" defensive framing is Boykis's; the local-model-as-learning-loop framing in the "what the post leaves out" section is the blog's. The hardware-bound argument (16 GB tier cannot run Gemma 4 26B) is a derived claim from the LM Studio system-requirements document, not a direct quote from Boykis. No quote in the body is presented as a verbatim Boykis sentence; the paraphrases are marked as such. Limit on inference: the "75%" figure is not a benchmark and is not a stable cross-workload metric; treat it as Boykis's reading of her own setup.
Sources
- Vicki Boykis, "Running local models is good now," 15 June 2026 —
https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/ - LM Studio, "System Requirements" (retrieved 17 June 2026 00:35 UTC+8) —
https://lmstudio.ai/docs/app/system-requirements - Pi agent harness repository (canonical URL; prior
badlogic/pi-monois a 301 redirect) —https://github.com/earendil-works/pi - Google, "Gemma 4 model card" (background on the Gemma 4 family that Boykis is migrating to) —
https://ai.google.dev/gemma - OpenAI, "gpt-oss-20b model card" (the OSS-20B reference in Boykis's model list) —
https://openai.com/index/gpt-oss-20b/(link returned HTTP 403 as of 17 June 2026; the same model on Hugging Face athttps://huggingface.co/openai/gpt-oss-20bis the live reference) - LM Studio, "Integrations: Claude Code" (the documented path for pointing Claude Code at a local LM Studio endpoint) —
https://lmstudio.ai/docs/integrations/claude-code
No comments:
Post a Comment