A Language Model From Scratch: Training EsperBERTo
Over the past months, the Hugging Face team has improved the transformers and tokenizers libraries to make it easier to train a new language model from scratch. This post walks through training a small RoBERTa-like model (84 million parameters) on Esperanto, a constructed language designed for easy learning and international communication.
Why Esperanto?
Esperanto is relatively low-resource, with about 2 million speakers, making this demo more interesting than training yet another English model. Its highly regular grammar—nouns end in -o, adjectives in -a—means even small datasets can yield meaningful linguistic results. Plus, the language's goal of fostering world peace aligns with the NLP community's mission.
Note: You don't need to know Esperanto to follow along, but if you want to learn, Duolingo offers a course with 280k active learners.
Our model is named… EsperBERTo.
Step 1: Find a Dataset
We use the Esperanto portion of the OSCAR corpus (299 MB) from INRIA, combined with the Esperanto sub-corpus of the Leipzig Corpora Collection (news, literature, Wikipedia). The final training corpus is 3 GB. For better results, use larger datasets.
Step 2: Train a Tokenizer
We train a byte-level BPE tokenizer (like GPT-2) with RoBERTa special tokens and a vocabulary size of 52,000. Byte-level BPE ensures no unknown tokens.
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path("./eo_data/").glob("**/*.txt")]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>", "<pad>", "</s>", "<unk>", "<mask>",
])
tokenizer.save_model(".", "esperberto")
Training takes about 5 minutes. The output includes vocab.json and merges.txt. This tokenizer is optimized for Esperanto, representing native words unsplit and handling diacritics (ĉ, ĝ, ĥ, ĵ, ŝ, ŭ). Sequence encoding is ~30% shorter than with the pretrained GPT-2 tokenizer.
Usage example:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
tokenizer = ByteLevelBPETokenizer(
"./models/EsperBERTo-small/vocab.json",
"./models/EsperBERTo-small/merges.txt",
)
tokenizer._tokenizer.post_processor = BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)
print(tokenizer.encode("Mi estas Julien."))
# Encoding(num_tokens=7, ...)
# tokens: ['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']
Step 3: Train a Language Model from Scratch
Using the run_language_modeling.py script from transformers (set --model_name_or_path to None to train from scratch), we train a RoBERTa-like model on masked language modeling. We need to:
- Implement a
Datasetsubclass to load text files. - Choose hyperparameters (e.g., 6 layers, 768 hidden size, 12 attention heads).
The model is trained on the tokenized Esperanto corpus to predict masked tokens.
Step 4: Check the Model
After training, the model can generate coherent Esperanto-like text (though it may contain weird sequences).
Step 5: Fine-tune on a Downstream Task
Fine-tune the model on part-of-speech tagging to evaluate its understanding of Esperanto grammar.
Step 6: Share Your Model
Upload your model to the Hugging Face Hub for the community.
Conclusion
This guide demonstrates the complete pipeline for training a language model from scratch, using a unique language to highlight the process. The tools and techniques are applicable to any language or domain.