DailyGlimpse

Scaling Document Near-Deduplication for Large Language Models: Lessons from BigCode

AI
April 26, 2026 · 4:56 PM
Scaling Document Near-Deduplication for Large Language Models: Lessons from BigCode

Large language models (LLMs) are only as good as the data they are trained on. As the saying goes, "garbage in, garbage out" — and one of the most critical steps in preparing high-quality training data is deduplication. In a new technical overview, a researcher shares the challenges and solutions behind the massive near-deduplication efforts for BigCode and BigScience projects.

Why Deduplication Matters

Duplicated data can cause models to regurgitate training examples verbatim, harm privacy, and inflate benchmark scores. Deduplication also makes training more efficient — often achieving the same or better performance with fewer steps. Moreover, it reduces storage and transfer costs, making it easier for smaller teams to work with large datasets.

From BigScience to BigCode

The journey began with a LinkedIn conversation when Huu Nguyen invited the author to work on deduplication for BigScience. The scale was daunting, with terabytes of text and thousands of dollars in cloud compute. After learning through trial and error, the same techniques were adapted for BigCode's code datasets, where deduplication also improved model performance.

The MinHash Approach

The core method used in BigCode is MinHash combined with Locality-Sensitive Hashing (LSH). The process has two main steps:

  1. Shingling and MinHashing: Each document is broken into n-grams (shingles), which are hashed to create a compact signature.
  2. LSH: Signatures are grouped into bands to efficiently find near-duplicate candidates without comparing every pair of documents.

Parameters like the number of permutations, similarity threshold, and n-gram size can be tuned. A demo is available here to explore the math.

Comparing Deduplication Methods

The overview includes a comparison of deduplication results from various datasets. For example, on the ROOTS corpus (multilingual), document-level deduplication removed 0.07-2.7% of data, while substring deduplication removed 10.61-32.30%. For code datasets, MinHash+LSH has been used in models like CodeParrot and The Stack.

Key Takeaways

Deduplication is not a one-size-fits-all process. The choice of method and parameters depends on the data type, language, and desired trade-offs between precision and recall. The author emphasizes that thorough deduplication is essential for building reliable LLMs, especially as models grow larger and data quality becomes ever more critical.