On the morning HackerRank published their open-source applicant tracking system, a developer named Dan Kinsky opened a terminal, pointed his own resume at it a hundred times, and watched the same document score anywhere from 66 to 99 out of 100. The repo is real, the runs are reproducible, and the bottom line is the design choice everyone in hiring tooling has been quietly making for three years.
The tool in question is interviewstreet/hiring-agent: a Python pipeline that parses a PDF resume, calls a local LLM (default: gemma3:4b) six times to pull structured fields out of work history, education, skills, projects, and awards, optionally enriches the result with GitHub repository scans, and then asks the model to grade the whole bundle out of 100. Up to 20 bonus points get stacked on top for startup experience, a portfolio site, or a technical blog. MIT-licensed, 3,592 stars on GitHub at time of writing, 253 open issues — most of which are the same complaint from different people. HackerRank didn't appear out of nowhere either: the repo dates to July 2025, but the link only went viral after a LinkedIn and r/leetcode pass that started roughly two months later, which matches Kinsky's correction footnote on the post (one LinkedIn post linked; one Reddit thread linked, both in his footnote 1). Anyone who has been watching the AI-in-hiring discourse knows the pattern by now: an LLM is wired into a pipeline that touches millions of decisions, the LLM's behavior changes under load, and nobody on the buying side inspects which version of stochastic they actually deployed.
Kinsky's experiment is the part that should change how the industry talks about the space. With the tool set to its default temperature — 0.1, a setting most people would call "effectively deterministic" — the same resume gets graded on the same rubric and the same rubric returns a 33-point spread on 100 trials. Toggling DEVELOPMENT_MODE off, hard-coding the inputs, and changing nothing except deleting a print() statement would already shift the score by 16 points; looping the model produces the full range. Re-running with Gemini instead of gemma3:4b tightens the distribution — but to a 48-64 band, which still has a 16-point spread and would still fail any cutoff in that range on roughly 28% of submissions (Kinsky's number for a 60-cutoff, not a separate reproduction). The non-determinism is a sampling problem, and the sampling never goes away.
The numbers that matter
Most resume-screeners, including this one, grade on a 100-point rubric anchored to a handful of weighted categories. Hiring-agent's breakdown is unusually explicit about what it's optimizing for: 35 points for open source contributions, 30 for personal projects, 25 for work experience, 10 for technical skills, plus up to 20 in bonus. Read it once and you see what the tool is for: a fairly specific kind of engineer with a specific kind of artifact trail. Candidates whose work happens inside a corporation and stays there — the majority of working engineers, by every measure — start the test at a structural disadvantage that has nothing to do with their quality.
That structural tilt is what makes the non-determinism land so hard. Kinsky ran the tool against the "technical skills" category and watched it score 8 out of 10 in 98 of 100 trials — almost a hard rule, because "did this candidate list React" is the kind of check that any extraction model can do reliably. The "work experience" category came back 25/25 in every run, including against a stripped-down resume listing only one internship — the rubric is two lines long, contains no anchor examples, and the LLM has nothing to vary on, so it just agrees with itself. Categories with something to judge are exactly the categories the tool can't judge consistently. Projects swings wildly. Open source, with the rubric actually reading like a rubric, swings less than it used to but still swings. Kinsky's resume got marked as one that its projects "lack architectural complexity" or, with comparable frequency, projects that "demonstrate real-world deployment" — two opposite readings from the same input, sampled roughly evenly across runs, and the only meaningful distinction between those phrasings is the random seed the sampler hit.
Temperature 0 is a story the model tells you
The HN thread on Kinsky's post spent the first hundred comments litigating the same argument, and it happens to be the part of the story that most confidently deserves a closer reading. In theory, "temperature 0" produces deterministic outputs from a sampling model. In theory-theory — which is the theory library developers actually mean when they quote it — temperature 0 doesn't really exist as a fixed point. The softmax becomes a spike function in the limit, but a discrete tokenizer with a finite vocabulary doesn't carry a true Dirac; it carries a Dirac comb, which collapses to the single highest-logit token only when there's a unique highest-logit token at every position. Floating-point quirks normally paper over that, but the assumption that no two logits will ever tie is exactly the kind of assumption you don't want underwriting a hiring decision.
The deeper issue is that the model is asked to do two jobs with one set of weights: parse a document into structured fields (the part LLMs are good at), and score a candidate against a rubric (the part LLMs are uniquely bad at, because rubric scoring is a discriminative task and chat models are trained to be generative). The tool's own prompt for experience is two lines long, per Kinsky's quoted rubric — read the Production section in the repo: instructions about analyzing work and volunteer sections for real-world or internship experience, plus a special-consideration line that awards extra for founder or early-stage engineer roles. No anchors. No examples. No definition of "real-world." The model is being asked to invent a calibration it was never trained on, and the result is whatever happens to come out of the sampler. That's why an intern and a principal engineer both get 25/25: the prompt can't tell them apart, and neither can the model.
The reproducibility budget is the only metric that matters
Most AI-in-hiring coverage focuses on bias — and deservedly so; the Brookings April 2025 study on gender, race, and intersectional bias in LLM-driven resume retrieval put real numbers behind the failure mode. But reproducibility is the failure mode people who aren't in the literature are about to discover, and it doesn't need a bias-detection study to demonstrate — it just needs Kinsky's terminal loop. A tool whose identical inputs produce non-identical outputs is a tool whose identical candidates produce non-identical outcomes. At any fixed cutoff, the failure rate of "this qualified candidate didn't make it past the screen" is structurally non-zero, and the candidates that fall on the wrong side of the cutoff are random with respect to merit. That's the function the tool is performing. Calling it a "filter" understates it; calling it a "luck filter" catches it.
There are two things worth keeping separate, even though they often get tangled together. The first is LLM bias — outputs that differ systematically across groups, the bias problem the literature has spent two years measuring. The second is LLM noise — outputs that differ across identical inputs, the reproducibility problem Kinsky is documenting. The first matters because fairness is a legal category and a moral category. The second matters because anything with this much noise is unfit for the actual decision even if you fix the bias. A noise-free version of a biased tool is still biased. A noise-heavy version of a fair tool is unfit to use.
Open source changed the optics but not the math
The interesting decision HackerRank made was opening the source. A closed-source LLM screener with 33-point variance would be the kind of "actuarial non-decision" enterprise software tends to hide; an open-source one is a reproducible experiment. Kinsky's loop is the unit-test the entire industry should have been writing since AI resume screeners started shipping in 2022. Anyone can replicate it — and many will, because the cost of doing so is a laptop, a pip install, and an hour. What they will find is what Kinsky found: the tool's accuracy, as a filter, is the same as flipping a weighted coin. Whatever signal the company thought they were buying is in the noise floor.
That distinction matters even more at the buyer side. A screening tool produces a ranking function whose top-K is unstable across runs — meaning its top-K is arbitrary. Companies buying these tools should be asking, before they wire one into Workday, Greenhouse, or Lever, what the tool's reproducibility budget is for the population they're screening. If your top-of-funnel conversion is 10% and your screener has a 30% pass rate at the cutoff, the screen is responsible for roughly half of your funnel noise. Halving the variance by switching to a smaller, deterministic model and tighter prompts would do more for hire quality than any number of model upgrades. Anyone who's been on the receiving end of an unexplained rejection knows this already.
What to do this week
If you're a job seeker:
- Assume a non-trivial share of the screen is a coin flip. Use that as license to apply to roles your gut says you're a fit for, even when your heuristic says you're not.
- The resume rubric HackerRank-style tools optimistically measure is heavy on open source and personal projects. If you have those, surface them more prominently — GitHub README polish, a one-paragraph portfolio, a working demo URL. The tool is explicitly grading on artifacts that look like artifacts.
- If you have none of those, your path through this filter is rougher regardless of quality. Lean on referrals and on company-specific application tracks that bypass the automated screen.
If you're an engineer with a say in how your company screens:
- Run Kinsky's loop on your own tool with your own population. The "100 runs against the same resume" test is the smallest possible reproducible experiment and you should have its output before you trust it.
- Treat any LLM-based screener that returns a single candidate score as inadmissible. Demand either a structured decomposition (the model returns per-rubric scores so you can audit which parts are stable) or a calibration band (each score comes with a standard deviation across N runs).
- If the screener doesn't expose its rubric, what you have is a vibe check with extra steps. The vibe check is the part you don't want.
If you're running the screener yourself:
- Lower the temperature only after you have measured the temperature=1 distribution — the noise floor has to be known to be lowered.
- Replace single-call score generation with multi-sample consensus, or with discriminative models trained on labeled paired comparisons (the actual right tool for the job).
- The single most valuable line in the open-source repo is the
temperature: 0.1default. Change it to0, document the new spread, and ship the difference.
The feature, renamed
The industry-wide reflex when a reproducibility paper appears is to call the problem "non-determinism" and promise a fix in the next model. Non-determinism is the property, not a bug to patch — and it's a direct consequence of how these models generate text. A model that returns 100/100 with seed 0 and 73/100 with seed 1 is doing exactly what it was trained to do; the prompt engineer has not yet built a system that constrains the sampler. The fix is to stop pretending the model is a sensor when it's a sampler, and to put determinism back into the pipeline by routing it through a part of the system that actually has it. Structured extraction can be done deterministically. Rubric scoring, with the right anchors, can be done deterministically. The middle distance — "judge me on my projects, please" — is where the sampler takes over, and the sampler is supposed to take over there. The honest answer is to admit that's a part of the decision a human has to make.
Kinsky's post is honest about that in a way the industry usually isn't. He isn't angry at HackerRank. He's angry at himself for thinking the tool was testing something it wasn't. Plenty of other readers will be angry at HackerRank; they're right to be, but only about the secondary thing. The primary thing is that the entire category of tool is built on a category error, and the open-source release is the moment that became undeniable. Once you see the same resume swing from 66 to 99 on a hundred deterministic-looking runs, every score that came out of every other LLM screener starts to look like the same number — just with a different seed you can't reproduce.
Disclosure
Drafted with AI assistance. Primary source: Dan Kinsky's 28 Jun 2026 post at danunparsed.com/p/hackerrank-open-source-ats, fetched and cached locally on 29 Jun 2026. GitHub repo interviewstreet/hiring-agent confirmed live via the GitHub REST API on the same date. Brookings 25 Apr 2025 piece on bias is cited only for the bias vs. noise distinction in the body, not for any specific finding. Per-claim attribution and live numbers are in the Sources section below.
Sources
- HackerRank's open-source ATS — Dan Kinsky, "HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74/100. No — 88/100. Actually 83/100.",
danunparsed.com/p/hackerrank-open-source-ats, 28 Jun 2026. Primary source for all experimental claims in the body (66–99 spread, 65% cutoff failure rate, 48–64 Gemini band, 98/100 technical-skills consistency, 25/25 experience rubric outcome). Fetched 29 Jun 2026. - The GitHub repo itself —
github.com/interviewstreet/hiring-agent, MIT-licensed Python project, 3,592 stars / 745 forks / 253 open issues at time of writing. Repo created 2025-07-29; first viral LinkedIn/Reddit pass ~Oct 2025 per Kinsky's footnote. Confirmed via GitHub REST API on 29 Jun 2026. - The HN discussion — Hacker News item
48713832. 730 points / 309 comments at time of writing; thread moving. Used for the temperature-zero analysis and the broader engineering reaction. - Brookings 25 Apr 2025 on bias in LLM-based resume screening — Kyra Wilson and Aylin Caliskan, "Gender, race, and intersectional bias in AI resume screening via language model retrieval,"
brookings.edu/articles/gender-race-and-intersectional-bias-in-ai-resume-screening-via-language-model-retrieval/. Used only for the bias vs. noise distinction; no specific findings paraphrased. - The Reddit r/leetcode pass — referenced in Kinsky's correction footnote (footnote 1) as one of the two original viral-sharing surfaces, 28 Jun 2026. Linked but not directly fetched (Reddit returned a block page to my fetch attempt).