A new benchmark called ConTextual evaluates how well multimodal AI models can simultaneously process and reason over text and images in scenes where text is embedded in the visual context. Early results show that even advanced models struggle with joint reasoning in text-rich environments.
The benchmark tests models on tasks that require understanding the relationship between written text and surrounding visual elements, such as interpreting signs, diagrams, or documents. Researchers found that while models perform well on either pure text or pure image tasks, their performance drops significantly when they must combine both modalities.
"Current models often treat text and image as separate streams, failing to integrate them for complex reasoning," the team noted. ConTextual aims to drive progress by providing a standardized evaluation set.
The study highlights the need for better multimodal integration, especially for applications like autonomous navigation, assistive technology, and document analysis.