Laravel

Recent advances in language model pre-training have largely relied on massive text corpora. Tasks like masked language modeling have proven effective, but a persistent gap remains between pre-training objectives (e.g., language modeling) and downstream tasks (e.g., table question answering). This often requires an enormous amount of pre-training data to achieve meaningful improvements. A new approach, TAPEX (Table Pre-training via Execution), tackles this by using synthetic data instead of real-world text.

Overview

In the paper "TAPEX: Table Pre-training via Learning a Neural SQL Executor," researchers propose a novel method: pre-training a model to mimic a SQL executor using a fully synthetic corpus. The process involves systematically sampling executable SQL queries and their outputs over tables, then training a language model (such as BART) to predict those outputs. This closes the gap between pre-training and downstream tasks by focusing on the structural reasoning required for table-based question answering.

Pre-training Process

At each step, a table is retrieved from the web. For example, consider an Olympic Games table. A sample SQL query like SELECT City WHERE Country = France ORDER BY Year ASC LIMIT 1 is executed using an off-the-shelf SQL engine (e.g., MySQL) to get the result "Paris." The model is then trained to produce this result from the concatenated SQL query and flattened table. Synthetic SQL queries offer controlled diversity and scale, unlike natural language sentences, enabling the creation of a high-quality, large-scale pre-training corpus.

Fine-tuning and Usage

During fine-tuning, the model receives a natural language question paired with a table and outputs the answer. TAPEX is available in 🤗 Transformers, and you can fine-tune it using the provided scripts. Pre-trained models like microsoft/tapex-large-sql-execution come with interactive widgets for easy experimentation.

Performance

TAPEX achieves state-of-the-art results on four benchmark datasets: WikiSQL (89.6% denotation accuracy, +2.3% over previous SOTA), TabFact (84.2% accuracy, +3.2% over SOTA), SQA (74.5% accuracy, +3.5% over SOTA), and WikiTableQuestions (57.5% accuracy, +4.8% over SOTA). These gains are significant, especially over BART baselines (e.g., +15.9% on SQA). Notably, TAPEX is the first table pre-training method to exploit synthetic executable programs.

Comparing to Previous Work

Earlier table pre-training models like TAPAS and TaBERT used general-purpose tasks such as language modeling on real data. TAPEX takes a different route by sacrificing naturalness for a domain-adaptive task—SQL execution—which better aligns with the structural reasoning needed for table question answering. This paradigm shift demonstrates the power of leveraging synthetic data to overcome the limitations of real-world text.

TAPEX: Redefining Table Pre-training by Learning SQL Execution from Synthetic Data

Overview

Pre-training Process

Fine-tuning and Usage

Performance

Comparing to Previous Work

We Care About Your Privacy

How and why we process data