Xiaomi Hit 1000 t/s on a 1T Model. The Race Just Changed.
Disclosure: This post was researched and drafted with AI assistance. Primary source: Xiaomi MiMo Team, "MiMo-V2.5-Pro-UltraSpeed", mimo.xiaomi.com/blog/mimo-tilert-1000tps, 8 June 2026 (HN front page 9 June 2026, 476 points). Secondary: DFlash paper, arXiv:2602.06036; HN thread 48446639; TileRT blog. The 1000 tps figure, the 1T-parameter MoE, the 8-GPU single-node footprint, MXFP4 on Experts only, DFlash block-level drafting with 6.30 / 5.56 / 4.29 acceptance on Coding / Math / Agent, the 9–23 June 2026 trial window, the 3× base-cost pricing, the FP4-DFlash checkpoint on HuggingFace, and the TileRT persistent-kernel / warp-specialization execution model are all from those sources. The quoted phrases "essentially on par" and "one breath per verification round" are direct lifts from the Xiaomi blog post. The "speed is the new scaling" thesis, the parallel-reasoning / coding-agent / real-time-decision-loops downstream taxonomy, the experts-only quantization observation, and "the original take" are the blog's own. The "~42B active parameters" figure is one HN commenter's read of the architecture, presented as such, not a confirmed spec.
A 1-trillion-parameter model, generating roughly 1,000 tokens per second, on a single 8-GPU commodity node. That is the headline from Xiaomi and TileRT on 8 June 2026. For two years the axis was "bigger model wins." As of this week it is "fast model wins," and the new speed comes not from exotic silicon but from how you quantize the experts, how you draft the next block of tokens, and how you keep the GPU pipeline full. The 1000-tps number is not a vanity stat. It is a step change that lets a frontier-class model enter real-time decision loops — and the model weights are public, on HuggingFace, today.
What Xiaomi actually claims: 1T at 1000 tps on one 8-GPU node
MiMo-V2.5-Pro-UltraSpeed is a 1-trillion-parameter Mixture-of-Experts model with roughly 42B parameters active per token, per one HN commenter's read of the architecture (Xiaomi's post does not state the active-params figure explicitly). Decode speed is 1000+ tps, peaking near 1200 tps. It runs on a single standard 8-GPU commodity node — no wafer-scale Cerebras, no on-chip SRAM Groq, no bespoke interconnect. The price is 3× the cost of standard MiMo-V2.5-Pro for ~10× the generation speed, available by application only, trial window 9–23 June 2026 (Beijing time), application-gated. The FP4-DFlash checkpoint is open-sourced. A frontier-tier model, made fast, on off-the-shelf hardware, with the weights shipped. That is the shape that makes the number land.
How they got there: model-system codesign, not one trick
FP4 quantization on the experts only. The 1T model is MoE. Most parameters live in the Experts, and Experts tolerate low-bit quantization much better than the rest of the model. Xiaomi quantizes only the Experts to MXFP4 (the OCP Microscaling spec) and leaves the rest at higher precision. Quantization-aware training keeps the capability "essentially on par" with the FP8 baseline. This is not "run a 1T model in 4-bit and pray." It is "run the 90% of the 1T that is structured for low bit, at low bit, and leave the 10% that isn't, at higher bit."
DFlash, block-level parallel drafting. Speculative decoding normally uses a small draft model that generates autoregressively — fast, but still serial. DFlash, the arXiv paper Xiaomi cites, replaces the autoregressive draft with a lightweight block diffusion model that fills an entire block of masked positions in one forward pass. The draft uses Sliding Window Attention, which makes per-prediction compute constant in context length rather than linear. The training pipeline pushes mask sampling down to GPU-local shards, so a single sequence yields tens of thousands of independent training signals per step. The acceptance lengths Xiaomi reports are unusually high: 6.30 for Coding, 5.56 for Math / Reasoning, 4.29 for Agent. Block size is capped at 8, which keeps verification overhead low and concurrency high. "The large model can confirm more content in one breath per verification round" is how the post puts it.
TileRT, a runtime that stops launching operators. At 1000 tps each operator's lifecycle is microseconds. Launch overhead, synchronization stalls, global-memory round-trips — at this clock frequency they become the bottleneck. TileRT discards the per-operator launch paradigm. A persistent engine kernel keeps the whole compute pipeline resident on the GPU, prefetching the next tile while the current tile is still on Tensor Cores. Warp specialization decomposes communication, data movement, and tensor computation into physically separated work. Each layer of the stack — quantization, drafting algorithm, kernel design — was chosen to be compatible with the others. That is the codesign.
Why 1000 tps is a category change
Parallel reasoning paths. When a hard problem is one slow generation, the developer waits. When the model is 10× faster, the same wall-clock budget runs ten candidate paths in parallel (Best-of-N, tree search, self-verification). Parallel sampling at inference time can substitute for longer chains at training time. The evidence has been stacking up for a year. 1000 tps makes the math work in production — a hard problem stops being a serial wait and becomes ten candidate paths in the same wall-clock budget.
Coding agents stop being a multi-second wait. At 1000 tps code generation becomes an interactive act. "A fast agent feels more like a partner" is the same observation that drove inline completions at ~50 ms, scaled up to whole-file generation.
Real-time decision loops for 1T models. High-frequency trading, fraud interception, voice assistants, surgical assistance — all have latency budgets tighter than the typical 50-tps frontier model can meet. A 1T model at 1000 tps fits inside most of them.
A 1T model is, in 2026, not new. What is new is the price-performance point: frontier-class capability, commodity hardware, near-real-time speed, the FP4-DFlash weights public. The HN thread's consensus is that the other frontier labs will need to match this number on commodity hardware. The more important fact is that the path does not require a custom chip. TileRT and Xiaomi shipped a model-system codesign, not a hardware moat. The same algorithmic choices can be made by anyone with the weights and a competent kernel team. Execution speed is a movable surface.
What you can do with this
- If you build agent infrastructure: 1000 tps is the new baseline for code generation and tool-call loops. Plan capacity around near-real-time.
- If you run inference at scale: MXFP4 quantization on the Experts of an MoE is the highest-leverage cost optimization available right now. Verify your GPU (H100, B200, MI300X) has the FP4 path before betting the cost model on it.
- If you write speculative-decoding code: DFlash's block-diffusion drafting is the most credible challenge to autoregressive-draft speculative decoding at frontier scale. The "tiny autoregressive draft" pattern behind EAGLE-3 is the path to retire first.
- If you are a CTO buying frontier model access: the price gap between Western closed-weights APIs and Chinese open-weights serving is widening. MiMo UltraSpeed (3× base for ~10× speed) is still well below the effective per-token cost of premium US closed APIs.
The original take: speed is the new scaling
For two years the AI race has been a parameter race. GPT-4 at ~1.8T, Llama 4 at 2T, the next model at 5T. Each reset the capability-vs-cost curve because it was bigger. Xiaomi and TileRT show the curve can be reset in the other direction: same capability, ~10× faster, same hardware budget. The obvious next move is not "build a 10T model" but "find the next 5–10× speedup on what we already have." Speculative decoding, expert-only quantization, persistent kernels, and warp specialization are the first four moves. The next ones will look like memory-tier orchestration, sparsity-aware scheduling, and more aggressive multi-token verification. The frontier capability story and the frontier cost story are decoupling.
The corollary: the latency budget of "what you can do with one model call" just got 10× larger. The 2027 product roadmap is being written this month, by the teams that figure out what becomes possible when a frontier model is faster than the developer's keystrokes.
What to do this week
# 1. Pull the FP4-DFlash checkpoint and benchmark your workload.
# huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash
# Check: first-token latency (TTFT) on a 32k context;
# sustained tps at 8k context on one 8x H100 / 8x B200 node;
# quality on your eval set, not the public benchmarks.
# 2. If you still use EAGLE-3 or a vanilla draft-model speculative
# decoding setup, read the DFlash paper (arXiv:2602.06036) and
# prototype a block-diffusion draft. Acceptance length 6.3 on
# Coding translates to real throughput, not peak-spec wins.
# 3. If you run an MoE model in production, instrument expert-level
# precision. Quantizing only the Experts to MXFP4 is the cheapest
# inference win available. Verify your GPU has the FP4 path first.
# 4. If you sell "fast inference," your public tps number is now a
# buy/no-buy criterion. Publish sustained-tps at 8k context on
# commodity hardware, or stop quoting peak.
# 5. If you price a token plan, re-run unit economics with 10x decode
# speed. The cost-per-completed-task curve bends non-linearly once
# you can fan out to parallel sampling.
The bottom line
Xiaomi and TileRT did not invent a new model and they did not invent a new chip. They combined a small set of existing techniques — MoE, FP4, block-diffusion drafting, persistent kernels — in a way that the parts compound. The result is a 1T model running at near-real-time speed, with the weights public, on commodity hardware. The race is no longer "whose model is biggest." The race is "whose model is fastest, and who can keep the speed as the models get smarter." This week, that race just started.
Related reads from this blog
- Linear Is Fast Because the Browser Is the Database — Same thesis, different surface: how a local-first client-side architecture resets the cost / latency tradeoff in product UX.
- Speculative KV Coding: 4× Lossless Cache Compression — Another inference-engineering move that compounds: cache compression as a partner to decoding speed.
- Cloudflare Just Bought the Build Tool That Runs the Web — When the build tool is the distribution channel, the next generation of frontend speed wins will look like this MiMo move: a model-system codesign, not a single magic technique.
Sources
- Xiaomi MiMo Team, "MiMo-V2.5-Pro-UltraSpeed: Pushing 1T-Parameter Model Generation Speed to 1000 TPS", mimo.xiaomi.com, 8 June 2026. https://mimo.xiaomi.com/blog/mimo-tilert-1000tps
- DFlash authors, "DFlash: Block-Level Diffusion Drafting for Speculative Decoding", arXiv:2602.06036. https://arxiv.org/abs/2602.06036
- OCP, "Microscaling Formats (MX) v1.0 Spec", opencompute.org. https://opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
- Hacker News, "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second", 8 June 2026 (476 points; HN front page 9 June). https://news.ycombinator.com/item?id=48446639
- TileRT technical blog, referenced from the MiMo post: https://tilert.ai/blog/breaking-1000-tps.html
- MiMo-V2.5-Pro-FP4-DFlash open-weights checkpoint: https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

No comments:
Post a Comment