Programming guides for beginner...
Any comments are welcomed....
I hope it helps!!! Thanks for drop by...
Powered By Blogger

Tuesday, June 9, 2026

Xiaomi Hit 1000 t/s on a 1T Model. The Race Just Changed.

Xiaomi Hit 1000 t/s on a 1T Model. The Race Just Changed.

Disclosure: This post was researched and drafted with AI assistance. Primary source: Xiaomi MiMo Team, "MiMo-V2.5-Pro-UltraSpeed", mimo.xiaomi.com/blog/mimo-tilert-1000tps, 8 June 2026 (HN front page 9 June 2026, 476 points). Secondary: DFlash paper, arXiv:2602.06036; HN thread 48446639; TileRT blog. The 1000 tps figure, the 1T-parameter MoE, the 8-GPU single-node footprint, MXFP4 on Experts only, DFlash block-level drafting with 6.30 / 5.56 / 4.29 acceptance on Coding / Math / Agent, the 9–23 June 2026 trial window, the 3× base-cost pricing, the FP4-DFlash checkpoint on HuggingFace, and the TileRT persistent-kernel / warp-specialization execution model are all from those sources. The quoted phrases "essentially on par" and "one breath per verification round" are direct lifts from the Xiaomi blog post. The "speed is the new scaling" thesis, the parallel-reasoning / coding-agent / real-time-decision-loops downstream taxonomy, the experts-only quantization observation, and "the original take" are the blog's own. The "~42B active parameters" figure is one HN commenter's read of the architecture, presented as such, not a confirmed spec.

A 1-trillion-parameter model, generating roughly 1,000 tokens per second, on a single 8-GPU commodity node. That is the headline from Xiaomi and TileRT on 8 June 2026. For two years the axis was "bigger model wins." As of this week it is "fast model wins," and the new speed comes not from exotic silicon but from how you quantize the experts, how you draft the next block of tokens, and how you keep the GPU pipeline full. The 1000-tps number is not a vanity stat. It is a step change that lets a frontier-class model enter real-time decision loops — and the model weights are public, on HuggingFace, today.

What Xiaomi actually claims: 1T at 1000 tps on one 8-GPU node

MiMo-V2.5-Pro-UltraSpeed is a 1-trillion-parameter Mixture-of-Experts model with roughly 42B parameters active per token, per one HN commenter's read of the architecture (Xiaomi's post does not state the active-params figure explicitly). Decode speed is 1000+ tps, peaking near 1200 tps. It runs on a single standard 8-GPU commodity node — no wafer-scale Cerebras, no on-chip SRAM Groq, no bespoke interconnect. The price is 3× the cost of standard MiMo-V2.5-Pro for ~10× the generation speed, available by application only, trial window 9–23 June 2026 (Beijing time), application-gated. The FP4-DFlash checkpoint is open-sourced. A frontier-tier model, made fast, on off-the-shelf hardware, with the weights shipped. That is the shape that makes the number land.

How they got there: model-system codesign, not one trick

FP4 quantization on the experts only. The 1T model is MoE. Most parameters live in the Experts, and Experts tolerate low-bit quantization much better than the rest of the model. Xiaomi quantizes only the Experts to MXFP4 (the OCP Microscaling spec) and leaves the rest at higher precision. Quantization-aware training keeps the capability "essentially on par" with the FP8 baseline. This is not "run a 1T model in 4-bit and pray." It is "run the 90% of the 1T that is structured for low bit, at low bit, and leave the 10% that isn't, at higher bit."

DFlash, block-level parallel drafting. Speculative decoding normally uses a small draft model that generates autoregressively — fast, but still serial. DFlash, the arXiv paper Xiaomi cites, replaces the autoregressive draft with a lightweight block diffusion model that fills an entire block of masked positions in one forward pass. The draft uses Sliding Window Attention, which makes per-prediction compute constant in context length rather than linear. The training pipeline pushes mask sampling down to GPU-local shards, so a single sequence yields tens of thousands of independent training signals per step. The acceptance lengths Xiaomi reports are unusually high: 6.30 for Coding, 5.56 for Math / Reasoning, 4.29 for Agent. Block size is capped at 8, which keeps verification overhead low and concurrency high. "The large model can confirm more content in one breath per verification round" is how the post puts it.

TileRT, a runtime that stops launching operators. At 1000 tps each operator's lifecycle is microseconds. Launch overhead, synchronization stalls, global-memory round-trips — at this clock frequency they become the bottleneck. TileRT discards the per-operator launch paradigm. A persistent engine kernel keeps the whole compute pipeline resident on the GPU, prefetching the next tile while the current tile is still on Tensor Cores. Warp specialization decomposes communication, data movement, and tensor computation into physically separated work. Each layer of the stack — quantization, drafting algorithm, kernel design — was chosen to be compatible with the others. That is the codesign.

Why 1000 tps is a category change

Parallel reasoning paths. When a hard problem is one slow generation, the developer waits. When the model is 10× faster, the same wall-clock budget runs ten candidate paths in parallel (Best-of-N, tree search, self-verification). Parallel sampling at inference time can substitute for longer chains at training time. The evidence has been stacking up for a year. 1000 tps makes the math work in production — a hard problem stops being a serial wait and becomes ten candidate paths in the same wall-clock budget.

Coding agents stop being a multi-second wait. At 1000 tps code generation becomes an interactive act. "A fast agent feels more like a partner" is the same observation that drove inline completions at ~50 ms, scaled up to whole-file generation.

Real-time decision loops for 1T models. High-frequency trading, fraud interception, voice assistants, surgical assistance — all have latency budgets tighter than the typical 50-tps frontier model can meet. A 1T model at 1000 tps fits inside most of them.

A 1T model is, in 2026, not new. What is new is the price-performance point: frontier-class capability, commodity hardware, near-real-time speed, the FP4-DFlash weights public. The HN thread's consensus is that the other frontier labs will need to match this number on commodity hardware. The more important fact is that the path does not require a custom chip. TileRT and Xiaomi shipped a model-system codesign, not a hardware moat. The same algorithmic choices can be made by anyone with the weights and a competent kernel team. Execution speed is a movable surface.

What you can do with this

  • If you build agent infrastructure: 1000 tps is the new baseline for code generation and tool-call loops. Plan capacity around near-real-time.
  • If you run inference at scale: MXFP4 quantization on the Experts of an MoE is the highest-leverage cost optimization available right now. Verify your GPU (H100, B200, MI300X) has the FP4 path before betting the cost model on it.
  • If you write speculative-decoding code: DFlash's block-diffusion drafting is the most credible challenge to autoregressive-draft speculative decoding at frontier scale. The "tiny autoregressive draft" pattern behind EAGLE-3 is the path to retire first.
  • If you are a CTO buying frontier model access: the price gap between Western closed-weights APIs and Chinese open-weights serving is widening. MiMo UltraSpeed (3× base for ~10× speed) is still well below the effective per-token cost of premium US closed APIs.

The original take: speed is the new scaling

For two years the AI race has been a parameter race. GPT-4 at ~1.8T, Llama 4 at 2T, the next model at 5T. Each reset the capability-vs-cost curve because it was bigger. Xiaomi and TileRT show the curve can be reset in the other direction: same capability, ~10× faster, same hardware budget. The obvious next move is not "build a 10T model" but "find the next 5–10× speedup on what we already have." Speculative decoding, expert-only quantization, persistent kernels, and warp specialization are the first four moves. The next ones will look like memory-tier orchestration, sparsity-aware scheduling, and more aggressive multi-token verification. The frontier capability story and the frontier cost story are decoupling.

The corollary: the latency budget of "what you can do with one model call" just got 10× larger. The 2027 product roadmap is being written this month, by the teams that figure out what becomes possible when a frontier model is faster than the developer's keystrokes.

What to do this week

# 1. Pull the FP4-DFlash checkpoint and benchmark your workload.
#    huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash
#    Check: first-token latency (TTFT) on a 32k context;
#           sustained tps at 8k context on one 8x H100 / 8x B200 node;
#           quality on your eval set, not the public benchmarks.

# 2. If you still use EAGLE-3 or a vanilla draft-model speculative
#    decoding setup, read the DFlash paper (arXiv:2602.06036) and
#    prototype a block-diffusion draft. Acceptance length 6.3 on
#    Coding translates to real throughput, not peak-spec wins.

# 3. If you run an MoE model in production, instrument expert-level
#    precision. Quantizing only the Experts to MXFP4 is the cheapest
#    inference win available. Verify your GPU has the FP4 path first.

# 4. If you sell "fast inference," your public tps number is now a
#    buy/no-buy criterion. Publish sustained-tps at 8k context on
#    commodity hardware, or stop quoting peak.

# 5. If you price a token plan, re-run unit economics with 10x decode
#    speed. The cost-per-completed-task curve bends non-linearly once
#    you can fan out to parallel sampling.

The bottom line

Xiaomi and TileRT did not invent a new model and they did not invent a new chip. They combined a small set of existing techniques — MoE, FP4, block-diffusion drafting, persistent kernels — in a way that the parts compound. The result is a 1T model running at near-real-time speed, with the weights public, on commodity hardware. The race is no longer "whose model is biggest." The race is "whose model is fastest, and who can keep the speed as the models get smarter." This week, that race just started.

Related reads from this blog

Sources

Monday, June 8, 2026

Miasma Worm: Your Settings.json Is a Shell Prompt Now

Miasma Worm: Your Settings.json Is a Shell Prompt Now

Disclosure: This post was researched and drafted with AI assistance. Primary source: SafeDep Team, "Config Files That Run Code: Supply Chain Security Blindspot", safedep.io, 6 June 2026 (HN front page the week of 8 June 2026). Secondary source: SafeDep Team, "Mini Shai-Hulud 'Miasma: The Spreading Blight' Hits @redhat-cloud-services", safedep.io, 1 June 2026. The seven-launcher taxonomy, the .github/setup.js dropper (4,348,254 bytes, Caesar shift), the icflorescu/mantine-datatable commit f72462d9, the braune-digital/BrauneDigitalImagineBundle and mhar-andal/MyBlok launchers, the 121-repository figure, the workspace-trust-prompt mechanics, the claude -p headless prompt-skip, the CVE-2025-59536 / CVE-2026-21852 references, the npm-preinstall example on @redhat-cloud-services/[email protected], and the 32-package / 96-version Red Hat figure are all from those two posts. The Trigger / Authority / Grammar framework and the synthesis in "the original take" are the blog's own. CVE-2025-59536 and CVE-2026-21852 are reproduced from the SafeDep write-up and have not been independently verified against an NVD listing.

There is a class of supply-chain attack that does not need a malicious dependency, a typosquatted package, or a hijacked maintainer account. It only needs a folder. The folder can be empty except for a handful of ordinary-looking config files. The moment you open it in your editor, start an AI coding agent, or run the install command, the attack fires. The trigger is not the code — it is the config. The Miasma worm, which hit npm this month and surfaced on HN this week, is the clearest worked example. The threat model it breaks is the one most security checklists still assume holds: that opening a fresh clone is safe until you run npm install.

The seven config files Miasma uses to fire on open

The SafeDep post walks a single icflorescu/mantine-datatable commit (f72462d9, titled chore: update dependencies [skip ci]) showing that one commit added six files. Five exist to launch the sixth: a 4,348,254-byte dropper at .github/setup.js. None of the launchers contains the payload. They each carry node .github/setup.js and rely on the developer's own tools to evaluate it.

  • .claude/settings.json / .gemini/settings.json — byte-identical SessionStart hook configs that run a shell command the moment an agent session opens. Once the folder is trusted, the hook runs without further confirmation. Since Claude Code 2.1.0, SessionStart hooks run silently.
  • .cursor/rules/setup.mdc — Cursor has no shell-hook primitive, so the attacker used a project rule with alwaysApply: true instructing the agent to run the dropper. Prompt injection committed to the repo.
  • .vscode/tasks.json — a task with runOptions.runOn: "folderOpen". The workspace-trust prompt is the only gate; it flags that a hook exists, not that the hook's 4.3 MB target is a Caesar-shifted eval launcher.
  • package.json test script"test": "node .github/setup.js". Needs a deliberate action, but the deliberate action is npm test or a CI test step, both run on autopilot.
  • composer.json post-install-cmd in braune-digital/BrauneDigitalImagineBundle — runs on every composer install, no trust gate.
  • Gemfile line one in mhar-andal/MyBloksystem("node .github/setup.js"). A Gemfile is Ruby, evaluated top to bottom. bundle install, bundle exec, any Rails command reading it runs the dropper. No malicious gem in the dependency tree.

Seven surfaces. One dropper. The variety is the point: the attacker is betting on the category of tool that reads config and acts, not on any one editor.

What the dropper actually does

The .github/setup.js file is one statement in a try/catch. The first visible bytes are a Caesar shift over a numeric char-code array fed to eval. Statically decoding it (shift of 4) yields a staged Bun loader that AES-decrypts a credential stealer, scanning the host for AWS, Azure, GCP, Vault, Kubernetes, npm, and GitHub secrets and exfiltrating them to attacker-created public GitHub repos.

Two design choices are worth lingering on. The file is sized to be just above the limit where GitHub's code search stops indexing — roughly 384 KB — so the launcher files are what show up in search, not the dropper. And the obfuscation shape (numeric array, rotation function, eval, encrypted second stage) is the same harness SafeDep says they keep seeing recompiled across separate Miasma waves and across unrelated malicious-package campaigns. The rotation amount and the AES key change between builds, so the SHA-256 changes, but the structure stays.

The Red Hat / @redhat-cloud-services compromise on 1 June was a parallel Miasma wave abusing npm's GitHub trusted publishing via short-lived oidc-<hex> branches: push a branch that rewrites the trusted CI workflow into a self-publishing job, exchange its OIDC token for an npm publish token, repackage the legitimate tarball with a malicious preinstall, republish with valid provenance. Same target (the developer's npm install step), different distribution channel.

The trust prompts are a permission dialog, not a security control

VS Code, Claude Code, and Gemini CLI all show a workspace-trust prompt the first time a session starts in a new directory. The attack does not defeat those prompts. It relies on the developer granting trust the way they dismiss a cookie banner, and on the prompt flagging that a hook exists without making its 4.3 MB target obvious. Two situations skip the prompt outright: pulling the malicious commit into a repo that was already trusted, and running headless (claude -p), which disables trust verification — the CI case. The package-manager vectors have no trust gate at all. npm test, composer install, bundle install, bundle exec, and any Rails command that reads the Gemfile run their hooks as a normal part of the work.

What makes a config file dangerous

A config file is dangerous when a tool reads it and acts without asking, and when its format can carry a command. Score any config on three axes:

  • Trigger. What event reads the file? Folder open, agent session start, dependency install, test run, lint, build. The earlier the trigger fires, the more dangerous the file.
  • Authority. What stands between the trigger and execution? A folder-trust prompt on first open (still bypassable in headless mode), an LLM agent deciding whether to follow an instruction in its context (the Cursor .mdc case), or nothing at all (npm test, Composer post-install, Gemfile top-level Ruby).
  • Grammar. Whether the format can carry a shell command or arbitrary code. JSON hook configs carry commands by design. Markdown rules carry instructions that a sufficiently compliant agent will treat as commands. A Gemfile is a full programming language.

The most dangerous configs combine an early trigger, a low-authority gate, and a high-grammar format. Miasma maximised on all three.

The original take: the attack is on the threat model, not the tool

The conventional reading of Miasma is that Claude Code, Cursor, and the rest have a security bug. The bug is real — the lack of re-warning on hook changes in Claude Code is a clear gap, and claude -p should not skip the trust prompt. But the deeper issue is the threat model most security teams still operate with: "scanned the dependency tree, found no known-bad packages, the project is safe to open." That model is structurally wrong for an ecosystem where the attack is in the project's own config.

Opening a folder is the same risk class as running npm install. A new commit to .claude/settings.json, .vscode/tasks.json, package.json, composer.json, or Gemfile is the same supply-chain event as a new version of a pinned dependency. The trust prompt is a permission dialog, not a security control. The realistic compromise is to scope the blast radius: a project with access to your cloud credentials should not be the same dev environment that opens arbitrary GitHub repos.

What this means for you

  • If you use Claude Code or Cursor on third-party repos: treat the first git clone as if it were npm install. Read .claude/settings.json, .cursor/rules/, and .vscode/tasks.json before opening. A SessionStart hook or an alwaysApply: true rule is a shell command.
  • If you maintain an editor with a hook primitive: re-warn on hook changes (Gemini does; Claude Code does not), and never skip the trust prompt in headless mode by default.
  • If you maintain a package: audit package.json for preinstall/postinstall/test on every release. The Red Hat compromise shipped a one-line preinstall with valid provenance.
  • If you run AI agents in CI: claude -p is the headless trust-bypass case. Pin a commit SHA and diff config files before invoking.

What to do this week

# 1. Audit the last 10 repos you opened. For each, check:
#    .claude/settings.json   -> hooks.SessionStart
#    .gemini/settings.json   -> hooks.SessionStart
#    .cursor/rules/*.mdc     -> alwaysApply: true
#    .vscode/tasks.json      -> runOn: "folderOpen"
#    package.json            -> scripts: preinstall / postinstall / test
#    composer.json           -> scripts: post-install-cmd
#    Gemfile                 -> top-level system() or backtick calls
#    Treat any matching entry as a one-line shell command.

# 2. If you maintain a project that ships an AI-agent config:
#    - Don't add a SessionStart hook to the project's own settings.
#    - If you must, gate it: no node, no eval, no shell, no curl.
#    - Re-warn on hook command changes (the Gemini pattern).

# 3. If you run claude -p or similar in CI:
#    - Pin the commit SHA in the checkout step.
#    - Diff .claude/, .gemini/, .cursor/, .vscode/, package.json,
#      composer.json, Gemfile before invoking the agent.
#    - Treat any added hook or script as a build-breaking event.

The bottom line

The supply chain has a new surface: the project's own config. The seven Miasma files are not exotic; they are the files developers commit every day. They are an execution layer, not metadata — and the supply chain has to score them on the same axis as a dependency change.

Related reads from this blog

Sources

Linear Is Fast Because the Browser Is the Database

Linear Is Fast Because the Browser Is the Database

Disclosure: This post was researched and drafted with AI assistance. Primary source: Dennis Brotzky, "How's Linear so fast? A technical breakdown", performance.dev, 3 May 2026; surfaced on the HN front page the week of 8 June 2026. The sync-engine description, the Parcel → Rollup → Vite → Rolldown bundler arc, the React + TypeScript + MobX + Postgres + Redis + turbopuffer stack, the 50% / 30% / 59% / 70–80% build-pipeline numbers, the modulepreload + service-worker precache technique, the inlined boot script, the "render first, authenticate second" pattern, the per-property MobX observable + observer() granular re-render model, the 0.10s–0.35s transition variables, and the transform / opacity / paint / layout property tiering are all from that post. The author is an outside observer; he has never worked at Linear and has not seen their code. Architectural inferences in the "original take" section are the blog's synthesis. Stack entries and numbers were not independently verified.

A CRUD app takes 300ms to update an issue. Linear does the same update in a few milliseconds. The difference is a single architectural inversion: Linear does not treat the server as the source of truth for the UI. The server is a sync target. The database is in the browser. Almost every other optimization in Dennis Brotzky's reverse-engineering write-up — which hit the HN front page this week — is a downstream consequence of that one decision.

The architectural move worth studying in 2026 is the data layer. Everything else is downstream.

The local-first sync engine, in three parts

Brotzky's write-up is a tour, not a discovery, and the three pieces of the sync engine are the part most worth re-stating clearly.

1. The data is already there. When the app boots, it hydrates from IndexedDB into an in-memory MobX object pool, and every UI query hits that pool. There is no "loading issues" state because the issues are already on the user's machine. Heavy tables like Issue and Comment lazy-hydrate on demand: a 10,000-issue workspace boots about as fast as a 100-issue one because startup cost tracks workspace structure, not workspace size.

2. Mutations do not wait for the network. Changing a status updates the MobX observable, writes the change to a durable transaction queue in IndexedDB, and queues it for the server. The network is touched last. If the server rejects, the observable reverts and there is a brief flicker; in practice, this almost never happens because invalid mutations are caught before the transaction is even created.

3. One delta, one cell. When a server confirmation arrives — yours or a collaborator's — the client receives a small JSON envelope describing what moved and applies it by writing to the corresponding MobX observable. Because every property on every model is its own observable, MobX knows which components depend on which fields. A 50-issue update is 50 cell re-renders, not a list re-render.

Take any one of those three away and the app starts to feel slow. A local database without optimistic writes still spins on save. Optimistic writes without granular observables still jank on every update. Granular observables without a local database still wait on initial load. Linear's speed is a property of the system, not any single layer.

The first-load pipeline is a separate engineering project

If the sync engine is the answer to "feels fast while you work," the loader is the answer to "feels fast when you arrive." Brotzky's account of Linear's build pipeline is a four-migration arc — Parcel → Rollup → Vite → Rolldown — driven by the same goal each time: ship less code, faster. The numbers Linear published from their own migration: 50% less code shipped, 30% smaller after compression, cold-cache page loads 10 to 30% faster, time-to-first-paint of the active-issues view dropped 59% on Safari, memory usage dropped 70 to 80%.

The bulk of the win came from dropping legacy browsers (no polyfills, no ES5 transpilation, no nomodule fallback), tighter dead-code elimination, and aggressive code splitting. Even after all of this, Linear still ships roughly 21 MB of minified JavaScript, but split into hundreds of route-level chunks fetched on demand. The entry script fires modulepreload tags for the whole critical path so the browser parallel-fetches them before the entry script's first import resolves, collapsing the water-fall into a single parallel batch. A service worker with a precache manifest of about 1,200 hashed assets then pulls down the rest of the route chunks lazily after the first page load; within a few seconds of hitting the login screen, the full app is sitting in cache, and the app is offline-capable because the local-first sync engine already has the user's data in IndexedDB.

The boot script is the part most teams will copy first

The cheapest Linear trick to reproduce is also the one most likely to slip past you: the inlined boot logic in <head>. Before any bundle has parsed, the inline JavaScript reads localStorage.splashScreenConfig, restores the user's remembered shell tokens (sidebar background, base color, border color, sidebar width, dark mode), and applies them to document.documentElement.style. It checks whether localStorage.ApplicationStore exists. If it does, the user has used Linear in this browser before, which means their workspace is already in IndexedDB. If it does not, the shell flips to the logged-out layout and the login flow takes over.

The bundle never tries to be smart about authentication. The actual session token lives in a cookie. The next request — the WebSocket handshake, a sync delta, any HTTP call — is the thing that fails with a 401 if the session has gone stale, and the client redirects to login. Render first, authenticate second. The pattern is consistent with the rest of the architecture: trust the local, the server is the source of truth for correctness, the two reconcile asynchronously.

Stack composition: a deliberate refusal of the modern default

The stack list in the write-up is interesting mostly because of what is not in it. React, TypeScript, MobX, Postgres, a CDN, a service worker, IndexedDB. No Next.js, no React Server Components, no TanStack Query, no edge database, no fancy framework. Brotzky calls out the simplicity as a feature, not an oversight: keeping the app entirely client-side removes the constant question of "am I on the server or the client" and gives a single mental model for the entire app.

Backend is Node.js + TypeScript, PostgreSQL on Cloud SQL with the issues table partitioned 300 ways, Memorystore Redis as event bus + cache + sync cursors, turbopuffer for similar-issue vector search, Kubernetes on GCP with one workload per concern, and Cloudflare Workers as a multi-region edge proxy. The two big concessions to the modern web are Rolldown-Vite (with plugin-react-oxc, not @vitejs/plugin-react) and the inline app shell in the head. Everything else is straight 2018-React-with-MobX, and that is a deliberate choice: the technology that ships the data fastest is the technology that ships the data.

The original take: the design is also the bottleneck

Most write-ups of Linear's performance end on the bundler or the sync engine. The post's most underrated observation is in the "Designed for speed" section: a perfectly built sync engine still loses to a slow input model. If the fastest path to an action requires a mouse, three menus, and a click, the user pays for those steps regardless of how fast the engine runs.

Single letters edit the focused issue. Two-letter combos navigate. ⌘ K opens a command palette that searches the local MobX object pool, not a server. Every common action has a shortcut, and every action can be done with a mouse. Engineering speed makes a single interaction fast. Design speed makes the path to each interaction short. For a tool used all day, the difference between a shortcut and a two-second mouse path compounds over every action.

The animation rules complete the same thesis. Browsers have three tiers of property changes — composited (transform, opacity), paint (color, background-color, border-color, fill), and layout (width, height, top, left, margin, padding) — and Linear only animates the first two. The margin-left: 2px; transition: all 0.2s example in the post is a perfect villain: a small visual change that recomputes the layout of every row beneath the hovered one, on every frame, for the full 200ms of the transition. Durations sit at 0.10s–0.35s, well below the 100ms cause-and-effect threshold, and Linear defaults to asymmetric timing — instant on enter, 150ms fade on exit.

The synthesis most people will miss: the fast app is one where every layer is in the same conversation. The data is local, the mutations are optimistic, the observables are granular, the input is keyboard-first, the animations stay on the GPU, the loader ships less code, and the service worker fills in the gaps. None of those are the trick. The trick is the discipline of refusing to let any one layer leak latency into the next.

What this means for you

  • If your team treats the server as the source of truth for the UI: the cheapest single change is the optimistic update. SWR and TanStack Query both support it; the mutate(key, optimistic, false) pattern gets you surprisingly close to Linear's feel without rewriting the data layer.
  • If you maintain a Vite or Rollup config: the manualChunks pattern in the post — one chunk per npm package above ~3 KB, cached independently — is the move. Bump a single dependency, invalidate one chunk, not the whole vendor graph.
  • If you animate anything in a tool used all day: audit your CSS for transition: all. Replace margin and padding animations with transform. Default new transitions to 0.1s–0.25s, not 0.3s. The 100ms cause-and-effect threshold is real.
  • If you build for slow networks or emerging markets: the service-worker precache + modulepreload pair is the single highest-leverage combination in the post. It collapses a multi-second cold load into a single parallel batch and makes the rest of the app offline-capable for free.

What to do this week

# 1. If your app makes a /me or /api/user call before rendering:
#    - Add the inlined localStorage boot check to your <head>.
#    - If localStorage.<your-app-store> exists, render the shell
#      immediately and let the next request do the 401 detection.
#    - One inline script removes one round-trip from every cold load.

# 2. If you maintain a Vite config:
#    - Switch to per-package manualChunks above ~3 KB.
#    - Add <link rel=modulepreload> tags for the critical-path
#      vendor chunks in your index.html template.
#    - Add a service worker with a precache manifest of route chunks.
#      Warm the cache in the background after first paint.

# 3. If you build for slow networks or emerging markets:
#    - The service-worker precache + modulepreload pair is the
#      single highest-leverage combination. It collapses a
#      multi-second cold load into a single parallel batch and
#      makes the rest of the app offline-capable for free.

The bottom line

Linear feels fast because of a single architectural decision: the data the user came to edit is already on their machine. Rolldown-Vite, modulepreload, the service worker, MobX, the IndexedDB hydration, the boot script, the keyboard-first input model, the animation tiers — all downstream of it. If you want a fast web app, the question is "why is my CRUD waiting on the network at all," and the answer in 2026 is "it does not have to."

Related reads from this blog

Sources

Sunday, June 7, 2026

Speculative KV Coding: 4× Lossless Cache Compression

Speculative KV Coding: 4× Lossless Cache Compression

Disclosure: This post was researched and drafted with AI assistance. Primary source: "kkm", "Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model", fergusfinn.com, posted 8 May 2026; surfaced on the HN front page on 4 June 2026. The arithmetic-coder framing, the 11-bits-per-scalar bf16 cache entropy number, the ~4× lossless / ~8× gross compression claim, and the analogy to Leviathan et al.'s speculative decoding (2022) are all from the post. The "predictor is the product" framing in the original-take section is the author's synthesis. The comments quoted in the discussion section are real HN comments on that thread, permalinked to the right authors; we did not paraphrase around them. Benchmarks were not independently reproduced.

In 2026, the bottleneck in long-context LLM serving is VRAM holding the KV cache and PCIe moving it — not flops. As agentic workflows (coding agents, long-document RAG, multi-hour research sessions) push average context windows past the 200K mark, the cache stops being "a little memory" and starts being the dominant line item on the inference bill. A new write-up from kkm on fergusfinn.com describes a method called Speculative KV coding that gets you up to ~4× lossless compression of the cache using a cheaper predictor model, stacking on top of the lossy FP8 compression everyone is already doing for a gross ~8× reduction. The post hit the HN front page on June 4 with 79 points and a comment thread that is, unusually, a real engineering discussion rather than a flame war. It deserves more attention than the ranking suggests.

The headline is "4×." The interesting number is buried in the setup cost.

What speculative KV coding actually does

The classical way to make a KV cache smaller is lossy quantization: drop K and V from bf16 to FP8 (or FP4), accept the quality hit, and run evals until your benchmarks stop screaming. TurboQuant is the most-discussed recent example of this family, and sits in the same conceptual neighborhood as the fergusfinn post.

Speculative KV coding is a different move. It is lossless — the reconstructed cache is bit-identical to the original — and it works by analogy with speculative decoding (Leviathan, Kalman, Matias, 2022):

  1. Pick a predictor model — a smaller, faster model whose forward pass on the same prompt gives a per-scalar guess μ and a calibrated uncertainty σ² of the target model's KV cache.
  2. Both the encoder (who has access to the target model) and the decoder (who will reconstruct the cache) run the predictor on the prompt. The predictor is cheap, so running it twice is fine. Both sides end up with the same (μ, σ) per scalar.
  3. The encoder runs the target model to get the real KV cache, then feeds (KV_full, μ, σ) into an arithmetic coder (the same family of coders behind rANS / tANS — the post links to prior work on both). The coder emits a bitstream whose length is bounded by the cross-entropy H(p, q) = H(p) + KL(p || q). Because the KV cache is a deterministic function of weights and prompt, its "true" entropy is zero; every bit the coder emits is pure KL against the predictor.
  4. The decoder consumes the bitstream alongside its locally reconstructed (μ, σ) and recovers KV_full exactly.

The whole point is the split cost. The encoder pays one full target-model forward pass (it has to, that's the only way to get the real cache). The decoder pays a predictor forward pass per token and some arithmetic. The bandwidth between them is just the bitstream. In a long-context agent session, the decoder side is the side that runs many many times — the encoder is prefill-once, the decoder is decode-many — so the asymmetry is the entire point of the method.

The numbers that matter

The post gives three numbers worth keeping in your head.

  • bf16 KV cache is about 11 bits per scalar of bytewise entropy, roughly 30% smaller than the raw 16-bit format. So even a perfect general-purpose entropy coder, with no model of the cache at all, gets you ~1.45×. That is the floor.
  • ~4× lossless compression of a bf16 cache with the predictor-model approach. The author is explicit that this is on top of any lossy FP8 quantization you were already doing — which, because the bf16→FP8 step is already saving 2× on its own, gives a gross ~8× reduction in cache size for an FP8 cache you are now losslessly compressing.
  • The bitrate is ~½ ln(2Ï€e σ²) bits per scalar in expectation, which is just log(typical error magnitude). Better predictor → smaller typical error → fewer bits. The marginal cost of a smarter predictor is paid in flops; the marginal benefit is paid in VRAM and bandwidth. The arbitrage is in the ratio.

That last equation is the reason this is interesting. The cost of a forward pass through a predictor model scales with the predictor's parameters. The savings scale with how well that predictor's μ matches the target's KV_full. There is a break-even point, and the post is careful to say it does not yet know exactly where.

The comment thread is the real story

The HN discussion is roughly a dozen comments long and unusually high-signal. Three exchanges in particular are worth quoting at length.

wongarsu lays out the cost curve: "The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model. For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so."

That is the post in one sentence. The economics only flip in your favor when the cache is genuinely expensive, which today means frontier models in production serving, not your laptop running Llama 8B.

0-_-0 raises the obvious follow-up: "You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it." That is the trivial upper bound the post walks you through, and yes, paying a full target forward pass to predict your own forward pass is silly. The author's framing is that the predictor needs to be cheaper than the target — and the choice of predictor is the cost-versus-bits tradeoff the whole post is organized around.

saagarjha makes the cleaner point: "Speculation is only worth it if you can profit from it. Not every context allows this or has a similar idea of what can be speculated." A predictor model only helps if its forward pass on the prompt is correlated with the target's forward pass. If you pick a predictor that is just a bag of weights with no shared structure, you get the floor (the 11-bit entropy, ~1.45×). The post's choice — "an optimised version of the same model" — is the obvious, principled answer: same architecture, same prompt, same attention pattern, just a cheaper optimization. That is what makes the conditional entropy H(KV_full | M_pred(prompt)) actually small.

The original take: the cheap predictor is the product, not the compression

Most coverage of compression releases treats the codec as the product and the predictor as a black box. The framing in this post has it exactly backwards, and that is the part most people will miss.

The codec is two pages of rANS. It is the part that has been solved for twenty years. The predictor is the part that has just become cheap enough to use, because in 2026 you can serve a small open-weights model in a few hundred milliseconds on a single GPU. The cost of running a 1B-parameter predictor model on a 200K-token prompt, in 2026, is in the range of seconds. The cost of not compressing your 1T-parameter target model's KV cache is in the range of not-fitting-it-in-memory.

That cost curve is what makes the method timeable. Two years ago, the predictor would have been a quarter of the target's flops and the arbitrage would not have closed. Two years from now, the predictor will be a single forward pass of a distilled version of the same model trained specifically to predict the target's cache, and the 4× number will probably be 6×. The interesting number is the one we will get when someone trains that predictor end-to-end.

Expect the first production deployments to stay within a single model family — same architecture, same tokenizer, same training data. Cross-family prediction is possible (the arithmetic coder is still lossless) but the bitrate will be much higher because the conditional entropy gets larger as the predictor and target diverge.

What this means for you

Four reader profiles, four different calls:

  • If you serve frontier models in production (≥ 70B, long contexts, batched traffic): the 4× lossless number is real and the 8× gross number is the one that matters for your VRAM bill. This is the deployment profile wongarsu describes, and it is the only profile where the cost curve is unambiguously in your favor. The integration cost is one predictor-model forward pass per request, which is in the noise relative to a 1T-class prefill.
  • If you run smaller models locally (8B–13B, single-user, sub-100K context): you are on the wrong side of the break-even. The predictor model would cost you a meaningful fraction of the target's flops, and the cache is not the bottleneck yet. Hold off.
  • If you build agentic systems: this is the workload that should make you care the most. An agent loop that holds a 500K-token context across many turns is paying the cache cost on every decode. The 4× compression is bandwidth between your LLM provider and your agent runtime, which today is the single biggest cap on agent session length. Watch for vendor support here first; it will land in the inference stacks that already do speculative decoding (vLLM, TensorRT-LLM, SGLang) before it shows up in any closed API.
  • If you build ML systems for a living: the predictor-quality story is the next thing to pay attention to. A predictor model trained end-to-end to minimize the cross-entropy of (target KV | predictor forward pass) is a much smaller research project than the codec work was, and the marginal value is large. The post is essentially an open call for that work.

What to do this week

# 1. If you maintain a vLLM / TensorRT-LLM / SGLang fork:
#    - Find the existing speculative-decoding code path.
#    - The (encoder, decoder) asymmetry it implements is structurally
#      identical to what Speculative KV coding needs.
#    - The 4× number is a vRAM win, not a flops win. Plan the benchmark
#      for batched traffic, not single-stream.

# 2. If you serve a frontier model with > 200K context windows:
#    - Measure the share of your inference cost that is cache storage
#      and cache transfer. If it is < 20%, skip. If it is > 40%, this
#      is worth a prototype.
#    - Start with same-family predictor (e.g., target = Llama-3 70B,
#      predictor = Llama-3 8B at INT4). Cross-family is a research
#      project, not a deploy.

# 3. If you build agents: be ready to switch inference providers
#    the day one of them ships this. An 8× cache-size win is the
#    difference between a 30-minute session and a 4-hour session
#    on the same hardware, and whichever provider gets there first
#    is the one whose API key ends up in your framework's default.

# 4. If you write ML systems posts: do not lead with "4× lossless
#    compression." Lead with "the predictor model is the product."
#    That is the framing nobody else has and it is the part of
#    the post that will still be true in 2028.

The bottom line

Speculative KV coding is not a clever codec trick. It is a cost-curve observation: that a 1B-parameter predictor model in 2026 is cheap enough to run as a side computation, that a frontier model's KV cache is expensive enough to make that side computation worth it, and that the gap between those two facts has been closing for three years and will continue to close. The 4× number is real. The interesting question is what the number will be in twelve months, when the predictor is trained end-to-end against the target's actual cache distribution, and the answer is almost certainly "larger than 4×, and the predictor itself is the thing someone ships as a model."

This is the post to send to the engineer on your team who keeps saying "we'll just quantize harder." It is also the post to send to the person who keeps saying "VRAM is the new FLOPS." Both of them are right. The 2026 argument is about which side of the cost curve you are on, and this Speculative KV coding write-up is the cleanest published version of that argument I have read.

Related reads from this blog

  • Microsoft Just Put a Workflow Engine Inside Postgres — same week, different bottleneck: durable execution in the database. The structural similarity is that both moves relocate work from where it is expensive (a separate orchestrator, separate VRAM) to where it is already paid for (the database, the predictor model).
  • Redis 8.8: Your Lua Rate Limiter Is Now Obsolete — both posts are about a vendor deciding your separate layer is now their default. Redis 8.8 ate the rate-limiter; whoever ships Speculative KV coding in vLLM eats your cache-cost budget.

Sources

Meta's AI Chatbot Reset 20,225 Instagram Passwords

Meta's AI Chatbot Reset 20,225 Instagram Passwords

Disclosure: This post was researched and drafted with AI assistance. Primary source: Zack Whittaker, "Meta confirms thousands of Instagram accounts were hacked by abusing its AI chatbot", ~this week in security~, June 6, 2026, cross-referenced against the Hacker News thread (349 points, 127 comments at time of writing), the original 404 Media and TechCrunch reporting from June 1, and Meta's data-breach notice filed with the Maine Attorney General. All numbers (20,225 affected accounts, ~30 in Maine, April 17 through early June window) and Meta's quoted breach-notice language are taken directly from Whittaker's write-up, which is itself based on the filing. Analysis and framing are the author's.

The number is finally on the record. Meta has told the Maine Attorney General that at least 20,225 people had their Instagram accounts hijacked between roughly April 17 and the first week of June, via a single, embarrassing bug: the company's own AI support chatbot could be talked into resetting the password of any account that didn't have two-factor authentication turned on. You didn't need a phishing kit. You didn't need a SIM swap. You typed "reset the password for [target account], send the link to [email you control]," and the chatbot did it. The data-breach notice — which Meta filed late Friday and which this week in security obtained — confirms what the original 404 Media and TechCrunch (June 1) reporting first claimed. Nearly seven weeks of hijackings, and the headline fix was to disable the chatbot entirely.

The interesting part is not the bug. The interesting part is what the bug tells us about how Meta is shipping AI features right now.

The mechanism, in one sentence

The "AI-assisted account recovery system" that Meta built into Instagram did not check that the email address you asked it to send the reset link to actually matched the email address on the account. So you gave it your own Gmail, asked for a reset, and it mailed the link to you. From there it was a normal password reset flow on a clean, authenticated browser. No exploits, no zero-days, no 2FA prompt to fail closed.

That is the whole vulnerability. In Meta's own words from the notice: "due to a bug in a separate code path, the system did not properly verify that the email address provided by the individual requesting a password reset matched the email address associated with that user's Instagram account. As a result, when an individual provided an email address not previously associated with the account, the system incorrectly sent a password reset link to that unassociated email rather than rejecting the request."

If you've ever written a password-reset endpoint, you know exactly which check is missing. The "match the email" rule is the load-bearing one. Drop it, and the entire flow degrades to "anyone can request a reset, the system has no way to know it shouldn't." Which is what happened, for about seven weeks, at scale.

Meta's breach notice is careful to push the failure into "a separate code path" rather than the LLM — "The tool itself worked properly and functioned as intended; however due to a bug in a separate code path, the system did not properly verify that the email address provided … matched the email address associated with that user's Instagram account." The LLM did what it was asked. The broken check was downstream, in a deterministic code path that should have rejected the request before any link ever went out.

The original take: the AI is the interface, not the bug

The pre-AI version of this — the plain web form — has had the "verify the email matches" check baked in for fifteen years. Every framework ships it. Every junior developer has written it. The whole reason a password reset is even a half-secure flow is that the one thing it's supposed to verify is that the requester controls the email on file.

What Meta did, in pursuit of an "AI-assisted" support experience, was wrap that flow in an LLM and lose the check. The LLM is the conversational interface through which an attacker phrases the request. The shape that matters, though, is what enabled the missing check to ship: the AI layer was treated as a service that could call the password-reset primitive, and the hard server-side invariant ("the email on the request must equal the email on file") stopped being load-bearing. It became one of several "validations" the AI could route around by rephrasing. The interface, in other words, is the policy. Whether Meta wants to call that "the tool worked" or "a bug in a separate code path" is a phrasing preference; the structural fact is that the AI layer is the only thing standing between an arbitrary prompt and a password-reset email.

This is the same shape as the smart-TV residential-proxy SDK story from yesterday, and the reason it keeps showing up is the same: a feature was added on top of an existing surface, the integration loosened a check that the underlying surface was relying on, and the failure mode was quiet. The proxy SDK didn't need to be malicious. The chatbot didn't need to be jailbroken. They just needed to be less careful than the thing they were augmenting.

The 2FA detail is the only thing that limited the blast radius

If you want to know how bad this could have been, count the people who didn't get hit. The whole reason the number is 20,225 and not 20,225,000 is that the attack only worked against accounts without two-factor authentication enabled. Anyone with a TOTP authenticator, a hardware key, or even SMS-based 2FA turned on would have hit a second wall the attacker couldn't get past, because the password reset alone wouldn't be enough.

This is a useful data point. It is also the only one. Most consumer services do not publish what fraction of their user base has 2FA on, but the honest internal number at most large consumer apps is in the low single digits for the methods that actually stop this kind of attack. SMS 2FA is the most common form by far, and it has its own bypass ecosystem. The attackers who found this bug were not 2FA-on accounts; they were the long tail of accounts whose owners never opened the security settings screen.

Meta did not, in the breach notice, disclose how many of the 20,225 victims had been notified that 2FA was available. The notice instructs users to "reset passwords and re-authenticate through secure, verified channels"; turning on 2FA is the obvious next step, and the only one the framing of the notice points users toward. That framing places the cost of the company's architectural mistake on the user.

The layoffs are not a coincidence

The original 404 Media piece on the bug, and Whittaker's follow-up, both land the observation that the hack came shortly after Meta laid off thousands of employees while continuing to reward top executives with stock incentives. The instinct is to read that as a one-line "context aside." It isn't. It is the causal mechanism.

An account-recovery system that wraps a password reset in an LLM is the kind of feature that gets greenlit by a product manager who needs an "AI-powered" demo for a quarterly review, and that gets shipped by an engineering team that is two reorganizations smaller than it was a year ago. The team that would have caught the missing email-match check in code review is one of the teams that has been told, in 2026, that its function is being consolidated. The security team that would have flagged "we are letting a non-deterministic model arbitrate a security primitive" is the team whose headcount was cut to make the margin number. The result is exactly what you would predict: a feature that, in a smaller and more cautious Meta, would not have shipped, did ship, and shipped wrong. This is our read, not Meta's — but it is a read that the breach notice conspicuously does not refute.

This is not a story about a single bug. It is a story about the kind of bug a company ships when its incentive structure rewards "AI features shipped" over "AI features shipped safely." Meta's quarterly calls are full of AI capability announcements; the risk-disclosure language is, by a wide margin, the shorter section. The 20,225 figure is what the gap looks like when it finally shows up in a regulatory filing.

What Meta actually did about it

Three things, in the notice:

  1. Disabled the AI chatbot for now.
  2. Removed the code path that allowed the chatbot to reset user accounts.
  3. Said it is "checking other chatbots across its platforms to prevent a repeat incident."

Item 3 is the one to watch. If "checking other chatbots" turns into "we also removed password-reset capabilities from our other AI support surfaces," that is a real fix. If "checking" turns into "we reviewed the prompts and added a system message," that is a security-theater answer — a language-model guardrail on a security primitive that should be enforced in code. The history of these incidents is that the second answer is much more common than the first, because the second answer is faster and the executives who set the security budget are the same executives who set the AI-ship budget. There is no organizational structure that resolves this without a regulator forcing it.

What this means for you

If you have an Instagram account and you do not have 2FA on, turn it on today. Use a TOTP authenticator (Authy, 1Password, Google Authenticator) rather than SMS — SMS-based 2FA is bypassable by carrier-port attacks, and you do not want your second factor to be weaker than the bug that broke the first one. If you run any consumer-facing service with a password-reset flow, audit it this week for the exact check Meta forgot: that the address the reset link is sent to is the address on file, and that the change-of-email path requires the existing email to confirm. The pre-LLM-era server-side check. The boring one. It still matters, and the fact that a company with the resources of Meta missed it is not a reason to skip it — it's a reason to add it explicitly to your test plan.

If you are on a product team that is being asked to wrap an existing security-sensitive flow in an LLM: refuse, on the record, and copy the security team. The cost of being the person who said "this should not be a model decision" when the postmortem gets written is much lower than the cost of being the person who didn't say it. A language model is the wrong place to enforce an invariant that has to hold every time. Use a model to interpret the request, then call a hard server-side check that is deterministic, reviewable, and covered by a test that has been there since 2015.

What to do this week

# 1. Audit your password-reset flow for the missing email-match check.
#    In your reset handler, the logic must include:
#
#    if requested_email != on_file_email:
#        reject("email does not match account")
#
#    If you can find a path where this check is missing, you have the bug.
#
# 2. If you have AI in any password, account-recovery, or 2FA path,
#    confirm the model is *advisory* and the server enforces the rule.
#    Grep your repo for "openai", "anthropic", "claude", "llm" near
#    auth/, login/, reset/, recovery/, 2fa/, otp/.
#
# 3. Turn on TOTP 2FA on every account that supports it.
#    SMS 2FA is better than nothing; TOTP is the floor.

The Meta breach is going to age into a textbook case, the same way the Cloudflare Just Bought the Build Tool That Runs the Web, Redis 8.8: Your Lua Rate Limiter Is Now Obsolete, and Gemma 4 12B Just Killed the Multimodal Encoder stories from earlier this month will. The category of bug is new: "we let a model be the policy." The lesson is older than the technology. The technology just made it cheaper to ship the wrong version of it.

Who pays the cost when the org chart says the security team is too expensive to keep around, and the product team is too important to slow down?

Saturday, June 6, 2026

Your Smart TV Is a Node in the AI Scraping Economy

Your Smart TV Is a Node in the AI Scraping Economy

Disclosure: This post was researched and drafted with AI assistance. Primary source: buchodi / Include Security, The Smart TV in Your Living Room Is a Node in the AIScraping Economy (June 5, 2026), cross-referenced against the Hacker News front-page discussion (85 points, 19 comments at time of writing). All claims, framework versions, endpoint hostnames, and per-country bandwidth tiers are taken directly from the buchodi write-up, which itself documents the reverse-engineering of a consent-installed partner app over 30 days. Analysis and framing are the author's.

The write-up of the week is buchodi's at Include Security: a forensic look at Bright Data's "consent SDK" for residential proxying, and an argument — backed by reverse-engineered binaries and 30 days of captured traffic — that the connected TV in your living room is the ideal exit node for the AI training data economy. The interesting part is not the SDK itself, but that the legal supply side of the residential-proxy market has been engineered to be invisible to the people whose homes it runs in. Most of the existing press is looking at the illegal supply side and missing it.

Why the TV, not the phone

The reason CTV (connected TV — any TV with a built-in internet connection and apps, including Roku, Apple TV, Fire TV, and smart TVs from Samsung, LG, etc.) matters more than the mobile phone — where the same SDK already lives in apps like EarnApp and XYO COIN — is form factor:

Factor Mobile phone Smart TV / CTV
Power Battery most of the day Always plugged in
Network WiFi + cellular Always WiFi, high-speed
Uptime Intermittent 24/7 in standby
Bandwidth ceiling Low (cellular caps) Effectively unlimited
User attention Actively used Often unattended
Corporate / family oversight Higher (MDM, mobile EDR) Virtually none

A phone hits 1% battery, gets locked, jumps networks, and has EDR (endpoint detection and response — software that monitors a device for suspicious behavior, common on corporate and BYOD phones) watching it. A TV in your guest room doesn't. Once the SDK is past its install screen, it owns a residential IP that is online every night while the user is asleep, on a fast unmetered connection, in a household that has no idea it's running.

How the SDK actually works

The protocol design is the part most people will find surprising, because the implementation choices are deliberately aimed at the mobile app-security tooling that would normally catch this kind of behavior.

The config endpoint is unauthenticated. On every launch, the SDK calls https://clientsdk.bright-sdk.com/sdk_config_ios.json?appid=<bundle>&ver=<sdk-version>&uuid=sdk-ios-<32hex>. The server only gates on appid (a bundle ID you can read off the App Store listing) and ver (an SDK version string). Pass any random UUID, get the same config a real device gets: feature flags, idle thresholds, country bandwidth caps, and the partner manifest.

The peer tunnel is a plain WebSocket. After config fetch, the SDK opens a persistent wss://proxyjs.brdtnet.com:443. The TLS cert is CN=*.luminatinet.com — the corporate name Bright Data used before its 2018 rebrand. Active SDK infrastructure still runs on the legacy cert, which is a clean detection pivot: any *.luminatinet.com or *.brdtnet.com traffic on your network is specifically the peer-tunnel plane, not customer-side Bright Data usage.

No message signing, no client certificate, no device attestation. The server filters peers by IP reputation. The IPC envelope is plain JSON with commands like tunnel_init, cid_set, status_get, and cmd_tun. Once the device reports favorable idle state, the server pushes a cmd_tun frame, which the SDK executes as a real HTTP request against a third-party site, sourced from your residential IP.

The idle rules are not what you think they are

The config ships an explicit rulebook for when the device is eligible to relay someone else's traffic:

"idle_metrics": {
  "ignore_screen_on": true,
  "ignore_on_call": true,
  "max_bw_ratio": 1,
  "min_battery": 0.2,
  "wifi_on_battery": true,
  "min_battery_wifi": 0.2,
  "max_cpu_usage": 70,
  "max_mem_usage": 90,
  "mem_screen_off": true,
  "idle_timeout": 30,
  "not_idle_timeout": 10
}

The ignore_screen_on and ignore_on_call flags are the important ones. In the SDK's rulebook, "idle" means the device's CPU, memory, and battery are within thresholds — not that the user is away. A user actively on a phone call, reading the screen, counts as idle. So does a TV in the background during dinner.

"Consent" is a TV-remote problem

This is where most coverage is going to get the framing wrong. Petflix — a Roku app documented by The Verge and cited by buchodi as a representative consent-dialog example (not a partner-manifest entry) — has a consent screen that reads:

"To enjoy Petflix for free with fewer ads, you are allowing Bright Data to occasionally use your device's free resources and IP address to download public web data from the internet. Bright Data will only use your IP address for approved business-related use cases. None of your personal information is accessed or collected except your IP address. Period."

The word "occasionally" does a lot of work. The same SDK's publicly queryable config sets max_bw_monthly_wifi: 200,000,000,000 bytes — a 200 GB default monthly WiFi budget. Privacy-policy disclosure on a TV navigated by arrow keys is the wrong control surface.

The VPN bypass is the actual problem for security teams

The single technical finding that should change how enterprise security teams think about this SDK is the use_netifs flag, which triggers code in the binary that constructs its NWConnection with a specific requiredInterfaceen0 (WiFi) or pdp_ip0 (cellular) — rather than the system default route. On iOS, this bypasses any configured VPN's tun0 (the virtual network interface a VPN creates on the device) entirely. The peer tunnel does not cross a user-configured VPN, even when the rest of the app's HTTPS traffic does.

Buchodi verified this empirically with transparent TLS interception: every HTTPS call the SDK made was captured except the peer tunnel to proxyjs.brdtnet.com:443, despite port 443 being explicitly redirected to the inspector.

The SDK uses two independent inspection bypasses, one per plane:

  • Control plane (config fetch, telemetry): built on CFHTTPMessage primitives rather than URLSession. This defeats URLSession-level instrumentation (swizzling, network extensions, URLProtocol subclasses) commonly used in mobile app-security tooling.
  • Data plane (peer tunnel): built on NWConnection with requiredInterface set to the physical interface. This is what defeats VPNs and ensures the scraping is executed from a residential IP.

Both choices are legitimate Apple APIs. The combination is the interesting artifact: the data plane is invisible to VPN-based inspection and the control plane is invisible to URLSession-based hooks. Researchers who rely on either single technique see only half the SDK's behavior. For enterprise security teams running MDM (mobile device management — software that lets an organization enforce policy on phones and tablets, typically installed on company-issued or BYOD devices), corporate-VPN traffic inspection, or home-router parental controls: the most sensitive channel this SDK operates is designed to go around your visibility layer.

The original take: legal ≠ invisible

The wider story this drops into is the AI training data economy. Cloudflare's pay-per-crawl program, the Gemma 4 multimodal encoder consolidation we covered a few days ago, the rise of rate-limited retrieval-augmented agents — all of this is downstream of an LLM training pipeline that depends on scraping data that increasingly has owners who would prefer not to give it up. Residential proxies are how scrapers route around that resistance. They are the load-bearing infrastructure of the post-Cloudflare web.

Most of the press on residential proxies has focused on the illegal supply side: botnets like Aisuru and Kimwolf, trojanized apps like the HUMAN Security PROXYLIB disclosure, pre-infected IoT hardware in the Google/Mandiant IPIDEA takedown. The FBI issued a formal advisory earlier this year. These are the bad actors. They are also the ones that get reported on, because they have obvious victims and obvious villains.

Bright Data is the legal supply side. The SDK ships as a documented commercial product. The "consent" comes from a publisher that put it in their app's EULA. The user is told the device is being monetized, in language designed to be skimmed past on a TV. The scraping jobs that go through the network are bound to be "approved business-related use cases" because Bright Data is also the customer side and gets to define what that means.

What this changes is the defensive posture itself: the press, the takedowns, the FBI advisories have implicitly assumed the supply side is a thing that gets installed on a victim's device by an adversary, not a thing the victim consented to. The defensive posture does not currently distinguish between a TV that has been rooted by a botnet herder and a TV that has been enrolled in a "free ad-supported app." From the perspective of network telemetry, both are the same: an iOS device on a residential IP, opening a long-lived WebSocket to proxyjs.brdtnet.com, executing inbound HTTP jobs. The detection signal is the same. The remediation story is harder.

What this means for you

Home / small business / school network you control — the buchodi write-up gives you five DNS hostnames to block at the router. They will not affect any customer who legitimately uses Bright Data's customer-facing proxy service on a different domain.

# Block at your router's DNS — Pi-hole, NextDNS, Cloudflare Gateway, OpenWrt+dnsmasq, etc.
proxyjs.brdtnet.com
proxyjs.luminatinet.com
proxyjs.bright-sdk.com
clientsdk.bright-sdk.com
clientsdk.brdtnet.com

For deeper inspection: TLS SNI (Server Name Indication — the unencrypted hostname field in a TLS handshake, readable at the network boundary without decrypting the traffic) filtering on *.brdtnet.com, *.luminatinet.com, *.luminati.io works at the network boundary without TLS interception. The *.brdtnet.com and *.luminatinet.com TLS certificate fingerprints are stable until the next Sectigo rotation (current certs valid through mid-2026, per the write-up).

Corporate security stack relying on VPN-based traffic inspection or MDM with URLSession-level instrumentation — the use_netifs + CFHTTPMessage combination is built to defeat both. Add a host-based or app-store binary check for the Swift symbols BrdWebSocketFacade and BrdNetwork.DNSResolver to your managed-fleet scanning.

If you build consumer apps or CTV platforms — the most uncomfortable finding is the per-country bandwidth tier table, which suggests deliberate market segmentation:

Country Min battery to relay Daily cap Monthly cap
Uzbekistan 1% 1 GB 30 GB
Oman 1% 1 GB 30 GB
Qatar 20% 40 MB 250 MB
UAE 20% 40 MB 250 MB
Default (worldwide) 20% 50 MB 500 MB

Uzbekistan and Oman devices are permitted to relay down to 1% battery, with daily caps 20× the default and monthly caps 60× the default. The default-worldwide allowance still permits 500 MB of someone else's traffic per month over the user's home internet. There is a market design choice being made here that the consumer-facing copy does not describe.

What to do this week

The 30-day experiment in the buchodi write-up is reproducible without any special tooling. On a spare iOS device with mitmproxy and a partner app installed (XYO COIN is publicly named in the research), you can capture the same clientsdk.bright-sdk.com config fetch, the same wss://proxyjs.brdtnet.com:443 upgrade, and the same JSON envelopes — ipc_call with cmd=tunnel_init / cmd=cid_set. You will also see, in your own network logs, that the tunnel does not cross the iOS device's VPN if you have one configured. That is the part that is hard to argue with.

The bigger question — whether the consent-dialog model for residential-proxy enrollment survives the moment a regulator or a major platform holder decides to look at the SDK's actual config vs. its marketing copy — is one this post is not going to answer. But the buchodi write-up is now the public artifact that lets the question be asked in concrete terms, and that is the part that is going to matter.


Related on the blog: Cloudflare Just Bought the Build Tool That Runs the Web (the upstream half of the scraping-detection story), Redis 8.8: Your Lua Rate Limiter Is Now Obsolete (where rate-limited scrape traffic ends up), and Gemma 4 12B Just Killed the Multimodal Encoder (where the scraped data is going).

Key terms used in this post: CTV = connected TV (a TV with built-in internet and apps, including Roku, Apple TV, Fire TV, and most smart TVs); MDM = mobile device management (software that lets an organization enforce policy on phones and tablets, common on company-issued and BYOD devices); EDR = endpoint detection and response (software that monitors a device for suspicious behavior, common on corporate endpoints); SNI = Server Name Indication (the unencrypted hostname field in a TLS handshake, visible at the network boundary without decryption); tun0 = the virtual network interface a VPN creates on a device, which most traffic-inspection tools rely on for visibility.