Researchers have released MobilityBench, a benchmark designed to evaluate how well large language models (LLMs) handle real-world route-planning tasks. Unlike simple navigation benchmarks, MobilityBench uses real user queries and a deterministic API-replay sandbox to test agents on complex, preference-constrained trip planning.
The benchmark reveals that current LLMs can handle straightforward route suggestions but struggle significantly when required to incorporate multiple constraints, such as preferred times, cost limits, and specific waypoints. This gap highlights a key hurdle for developing more capable agentic AI systems that can assist with everyday decision-making.
The findings are expected to guide future research on improving LLM reasoning under realistic constraints.