opinionsredditApril 21, 2026 at 06:55 PM

New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%.

More info, including charts, per-case metrics, raw judge outputs, and the parsed answer dump: https://github.com/lechmazur/position_bias This benchmark isolates one basic and frustrating failure mode. The model-average first-shown pick rate is 63%. GPT-5.4 (high) is the most position-sensitive model in the run. Many models don't just pick the first story more often, they also rate it higher. Average first-position rating bonus is +0.26 on a 1-7 scale. Mistral Large 3 is the outlier in the op

Read original →

Related Articles

OpenAI Livestream

ChatGPT Images 2.0

The "just wait 6 months" argument from 2025 survived exactly one iteration

Mistral Medium 3.5 on AMD Strix Halo: Painfully Slow (Plan for Overnight Runs)