DailyGlimpse

Databricks Boosts Hugging Face with Spark Integration, Slashes Training Time by 40%

AI
April 26, 2026 · 4:59 PM
Databricks Boosts Hugging Face with Spark Integration, Slashes Training Time by 40%

Generative AI has been transforming industries, and Databricks, the data and AI company, is making it easier for organizations to fine-tune large language models (LLMs) with their latest contribution to the Hugging Face ecosystem. The company has released official code that enables seamless conversion of Apache Spark dataframes into Hugging Face datasets, cutting processing time by up to 40%.

Previously, users had to write Spark dataframes to Parquet files and then reload them into Hugging Face datasets, a cumbersome process that wasted time and resources. For a 16GB dataset, this method took about 22 minutes. With the new from_spark function in the Hugging Face Datasets library, the same task now takes only 12 minutes.

"It's been great to see Databricks release models and datasets to the community, and now we see them extending that work with direct open source commitment to Hugging Face. Spark is one of the most efficient engines for working with data at scale, and it's great to see that users can now benefit from that technology to more effectively fine tune models from Hugging Face." — Clem Delangue, Hugging Face CEO

The integration combines Spark's efficiency in handling large-scale data transformations with Hugging Face's optimizations like memory-mapping and smart caching. This is crucial for organizations that need to augment AI models with their proprietary data to achieve better domain-specific performance.

Databricks plans further contributions, including streaming support for even faster dataset loading. The company is also enhancing other open source projects, such as MLflow with transformers and Langchain support, and releasing AI Functions for Databricks SQL.

This article was originally published on April 26, 2023 in Databricks's blog.