opinionsreddit2026년 4월 21일 오후 06:55

New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%.

More info, including charts, per-case metrics, raw judge outputs, and the parsed answer dump: https://github.com/lechmazur/position_bias This benchmark isolates one basic and frustrating failure mode. The model-average first-shown pick rate is 63%. GPT-5.4 (high) is the most position-sensitive model in the run. Many models don't just pick the first story more often, they also rate it higher. Average first-position rating bonus is +0.26 on a 1-7 scale. Mistral Large 3 is the outlier in the op

원문 보기 →

관련 기사

OpenAI 라이브스트림

ChatGPT Images 2.0 출시, 이미지 생성 기능 대폭 업그레이드

2025년의 "6개월만 더 기다려" 주장이 단 한 번의 업데이트로 무너졌다

AMD Strix Halo에서 Mistral Medium 3.5 돌려봤더니 느려 죽겠네—밤새 돌려야 함