Introduction for Biologists: What Is a Language Model?
The models used to study proteins draw heavily from large language models like BERT and GPT. To understand how they work, let's go back to 2016, before Trump's election, before Brexit, when deep learning was the hottest new technique. Deep learning uses artificial neural networks to learn complex patterns in data, but it needs massive datasets—often unavailable for niche tasks.
Imagine training a model to determine if an English sentence is grammatically correct. You'd collect examples like:
| Text | Label |
|---|---|
| The judge told the jurors to think carefully. | Correct |
| The judge told that the jurors to think carefully. | Incorrect |
In 2016, models were randomly initialized for each task, forcing them to learn everything from scratch. This is like learning a task from foreign language examples:
| Text | Label |
|---|---|
| Is í an stiúrthóir is fearr ar domhan! | 1 |
| Is fuath liom an scannán seo. | 0 |
Without pre-existing knowledge, you'd need thousands of examples to spot patterns. But if you already know English, the same task—sentiment analysis—becomes trivial:
| Text | Label |
|---|---|
| She's the best director in the world! | 1 |
| I hate this movie. | 0 |
The Critical Breakthrough: Transfer Learning
Transfer learning allows models to leverage prior knowledge. In 2018, ULMFiT and BERT showed that pre-training on abundant text data, then fine-tuning on specific tasks, yielded massive gains—equivalent to having 100x more data. The reason? Neural networks learn underlying structure (grammar, patterns) that transfers across tasks. This same approach is now applied to proteins: models pre-trained on vast protein sequences learn fundamental biological properties, which can then be fine-tuned for predicting structure, function, or interactions. The era of protein language models has begun.
Explore our example notebooks to see these models in action: