Laravel

GPT-5.5 Tops AI Benchmarks but Hallucinates More Than Predecessor, Costs 20% More via API

April 26, 2026 · 3:58 PM

OpenAI's latest model, GPT-5.5, has reclaimed the top spot on AI benchmarks, but its tendency to hallucinate and higher API costs raise concerns.

Key Findings:

Benchmark Performance: GPT-5.5 leads the Artificial Analysis Intelligence Index with 60 points, edging out Claude Opus 4.7 and Gemini 3.1 Pro Preview (both at 57). However, its hallucination rate on the AA Omniscience benchmark is 86%, compared to 36% for Claude Opus 4.7 and 50% for Gemini 3.1 Pro Preview.
Token Efficiency: Despite a doubled listed API price ($5/$30 per million input/output tokens), the model uses about 40% fewer tokens than GPT-5.4, resulting in a net cost increase of roughly 20%.
BullshitBench Struggles: On a benchmark that tests a model's ability to refuse nonsensical questions, GPT-5.5 managed only a 45% pushback rate—similar to GPT-5.4. GPT-5.5 Pro fared worse at 35%. Claude models from Anthropic lead this benchmark, while OpenAI and Google models often accept nonsense confidently.

Expert Insight: Peter Gostev, AI Capability Lead at Arena.ai, notes that increasing compute for reasoning doesn't automatically improve pushback. Models may rationalize nonsense rather than refuse it, suggesting training methodology matters more than scale.

Bottom Line: GPT-5.5 delivers strong performance but at a higher cost and with persistent hallucination issues. For tasks requiring factual accuracy and careful reasoning, alternatives like Claude Opus 4.7 may be safer choices.

GPT-5.5 Tops AI Benchmarks but Hallucinates More Than Predecessor, Costs 20% More via API

We Care About Your Privacy

How and why we process data