DailyGlimpse

DuckDB Integration Unlocks SQL Queries on Hugging Face Datasets

AI
April 26, 2026 · 4:54 PM
DuckDB Integration Unlocks SQL Queries on Hugging Face Datasets

The Hugging Face Hub has introduced a new feature that allows users to run SQL queries on any of its 50,000+ datasets using DuckDB, a high-performance analytical database. This integration aims to make dataset exploration more accessible, leveraging SQL's popularity as the third most-used programming language according to StackOverflow's 2022 survey.

Previously, the Hub's dataset viewer automatically converted all public datasets into Parquet files for efficient storage and analysis. Now, with DuckDB's httpfs extension, users can query these remote Parquet files directly via SQL without downloading them. The process is straightforward: obtain the Parquet file URLs through the /parquet API endpoint, install and load the httpfs extension in DuckDB, and run queries.

For example, to analyze the blog authorship corpus, a user can fetch the Parquet file URLs and execute a query like:

import requests
import duckdb

r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=blog_authorship_corpus")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
url = urls[0]

con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
con.sql(f"""SELECT horoscope, 
    count(*), 
    AVG(LENGTH(text)) AS avg_blog_length 
    FROM '{url}' 
    GROUP BY horoscope 
    ORDER BY avg_blog_length 
    DESC LIMIT(5)""")

This capability is especially valuable for large datasets common in the LLM era, as Parquet's columnar format and DuckDB's analytical speed reduce overhead. The Hub's dataset viewer automatically shards big datasets into 500MB Parquet files, and DuckDB supports querying multiple files simultaneously.

"Understanding dataset content is crucial for model quality," the Hugging Face team stated. "By enabling SQL queries, we're empowering users to gain deeper insights and promoting open access to data."

The feature is now available for all public datasets on the Hub, with documentation providing further details.