DailyGlimpse

LAVE: Can Large Language Models Match Fine-Tuned Models in Document Visual Question Answering?

AI
April 26, 2026 · 4:28 PM
LAVE: Can Large Language Models Match Fine-Tuned Models in Document Visual Question Answering?

A new benchmark, LAVE, evaluates large language models (LLMs) on the Docmatix dataset for zero-shot visual question answering (VQA) without fine-tuning. The study explores whether LLMs, when prompted with image and text inputs, can rival the performance of specialized models fine-tuned on document images.

Preliminary results suggest that while LLMs show promise in understanding document layouts and extracting information, fine-tuned models still achieve higher accuracy, particularly on complex queries requiring spatial reasoning. However, the gap narrows with larger LLMs and improved prompting strategies.

The paper raises critical questions about the necessity of task-specific fine-tuning in an era of increasingly capable foundation models.