Developers can now query Hugging Face datasets using plain English, thanks to a new integration that combines the Hugging Face Dataset Viewer API with MotherDuck's DuckDB-NSQL-7B model. This approach allows users to fetch dataset previews and convert natural language questions into SQL queries without manual query writing.
The Hugging Face Dataset Viewer API provides a simple way to access dataset metadata and sample rows. By connecting this API to DuckDB-NSQL-7B, a large language model fine-tuned for text-to-SQL tasks, users can ask questions like "How many rows have a rating above 4?" and receive the corresponding SQL query and results.
MotherDuck, a serverless analytics platform built on DuckDB, powers the SQL execution engine. The NSQL-7B model, developed by DuckDB Labs, is specialized in translating natural language to SQL, making it ideal for this workflow. The entire pipeline runs efficiently, enabling real-time querying of datasets hosted on Hugging Face.
To set up the system, developers need to configure the Hugging Face Dataset Viewer API to return a sample of the dataset in JSON format. This sample is then fed into the NSQL-7B model along with the user's question. The model generates a SQL query, which is executed against the dataset stored in MotherDuck. The results are returned in a readable format.
This integration lowers the barrier for non-technical users to explore datasets, democratizing data access. It also showcases the growing synergy between large language models and data infrastructure, where natural language interfaces are becoming standard.
For a hands-on demonstration, refer to the original article on MotherDuck's blog.