Laravel

A Language Model From Scratch: Training EsperBERTo

Over the past months, the Hugging Face team has improved the transformers and tokenizers libraries to make it easier to train a new language model from scratch. This post walks through training a small RoBERTa-like model (84 million parameters) on Esperanto, a constructed language designed for easy learning and international communication.

Why Esperanto?

Esperanto is relatively low-resource, with about 2 million speakers, making this demo more interesting than training yet another English model. Its highly regular grammar—nouns end in -o, adjectives in -a—means even small datasets can yield meaningful linguistic results. Plus, the language's goal of fostering world peace aligns with the NLP community's mission.

Note: You don't need to know Esperanto to follow along, but if you want to learn, Duolingo offers a course with 280k active learners.

Our model is named… EsperBERTo.

Step 1: Find a Dataset

We use the Esperanto portion of the OSCAR corpus (299 MB) from INRIA, combined with the Esperanto sub-corpus of the Leipzig Corpora Collection (news, literature, Wikipedia). The final training corpus is 3 GB. For better results, use larger datasets.

Step 2: Train a Tokenizer

We train a byte-level BPE tokenizer (like GPT-2) with RoBERTa special tokens and a vocabulary size of 52,000. Byte-level BPE ensures no unknown tokens.

from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path("./eo_data/").glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>", "<pad>", "</s>", "<unk>", "<mask>",
])
tokenizer.save_model(".", "esperberto")

Training takes about 5 minutes. The output includes vocab.json and merges.txt. This tokenizer is optimized for Esperanto, representing native words unsplit and handling diacritics (ĉ, ĝ, ĥ, ĵ, ŝ, ŭ). Sequence encoding is ~30% shorter than with the pretrained GPT-2 tokenizer.

Usage example:

from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./models/EsperBERTo-small/vocab.json",
    "./models/EsperBERTo-small/merges.txt",
)
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

print(tokenizer.encode("Mi estas Julien."))
# Encoding(num_tokens=7, ...)
# tokens: ['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']

Step 3: Train a Language Model from Scratch

Using the run_language_modeling.py script from transformers (set --model_name_or_path to None to train from scratch), we train a RoBERTa-like model on masked language modeling. We need to:

Implement a Dataset subclass to load text files.
Choose hyperparameters (e.g., 6 layers, 768 hidden size, 12 attention heads).

The model is trained on the tokenized Esperanto corpus to predict masked tokens.

Step 4: Check the Model

After training, the model can generate coherent Esperanto-like text (though it may contain weird sequences).

Step 5: Fine-tune on a Downstream Task

Fine-tune the model on part-of-speech tagging to evaluate its understanding of Esperanto grammar.

Step 6: Share Your Model

Upload your model to the Hugging Face Hub for the community.

Conclusion

This guide demonstrates the complete pipeline for training a language model from scratch, using a unique language to highlight the process. The tools and techniques are applicable to any language or domain.

Esperanto Meets AI: A Step-by-Step Guide to Training a Language Model from Scratch