Research
We measured 73,580 AI peer reviews. Models favor their own answers.
Every AI council has to answer one question: who judges the answers? We run cross-model peer review in production, which produces something unusual — a large record of models grading answers when one of the answers is their own. The result is unambiguous: language models systematically rank their own work higher, the strongest models do it most, and it changes real outcomes.
The setup
In every LLM Council run, several models answer the same question, then each model ranks the anonymized answers — including, unknowingly, its own. Because the same answer is ranked both by its author and by non-authors, each council is a natural paired experiment: how does a model rank its own answer versus how everyone else ranks that identical answer?
We measured this across 73,580 paired judgments from production councils. No lab prompts, no synthetic tasks — real questions from real users.
The findings
Models favor their own answers even when they cannot see the author names. On average, a model places its own answer about a third of a rank position higher than its peers place the same answer. Forty-seven of the forty-nine models measured showed positive self-preference.
The bias scales with capability. The strongest frontier models showed the largest self-preference — up to +1.72 rank positions — while smaller workhorse models showed the least. As models get better, they get more convinced by their own reasoning, not less.
It changes outcomes. When we recompute historical councils with self-votes removed, the winning answer changes in 4.0% of councils. One verdict in twenty-five was decided by models voting for themselves.
- 73,580 paired same-answer judgments, production data
- 47 of 49 models rank their own answers higher
- Strongest frontier models: up to +1.72 rank positions of self-preference
- Removing self-votes changes the winner in 4.0% of councils
Why this matters for every AI council
Multi-model products are arriving fast — Perplexity now ships a Model Council on its Max plan, and open-source councils run everywhere from GitHub to Hugging Face. Most share one design decision: a single synthesizer model acts as the judge that writes the final answer.
Our data says that design inherits a measurable bias. A judge that favors its own reasoning — and cannot be audited — is a silent thumb on the scale. This is consistent with peer-reviewed findings that LLM evaluators recognize and favor their own generations; what our data adds is scale, production conditions, and the capability gradient.
What we are doing about it
Anonymous cross-model peer review makes the bias measurable. Correcting it is the next step: a bias-corrected verdict that drops self-votes and subtracts each model’s measured self-preference constant is in testing on our staging environment now. We will publish the before/after results — including cases where correction changes the served answer — when the experiment reads out.
Two honest limits. First, a changed verdict is not automatically a better verdict; proving quality improvement requires outcome data, which is exactly what the live experiment measures. Second, these constants describe our council configurations and traffic; other pipelines will differ in magnitude, though the direction is consistent with published research.
The one-line takeaway
If an AI system lets a model judge work it produced — or work its siblings produced — assume self-preference unless it is measured and corrected. In our data, that assumption was right 47 times out of 49.