Laravel

Introduction for Biologists: What Is a Language Model?

The models used to study proteins draw heavily from large language models like BERT and GPT. To understand how they work, let's go back to 2016, before Trump's election, before Brexit, when deep learning was the hottest new technique. Deep learning uses artificial neural networks to learn complex patterns in data, but it needs massive datasets—often unavailable for niche tasks.

Imagine training a model to determine if an English sentence is grammatically correct. You'd collect examples like:

Text	Label
The judge told the jurors to think carefully.	Correct
The judge told that the jurors to think carefully.	Incorrect

In 2016, models were randomly initialized for each task, forcing them to learn everything from scratch. This is like learning a task from foreign language examples:

Text	Label
Is í an stiúrthóir is fearr ar domhan!	1
Is fuath liom an scannán seo.	0

Without pre-existing knowledge, you'd need thousands of examples to spot patterns. But if you already know English, the same task—sentiment analysis—becomes trivial:

Text	Label
She's the best director in the world!	1
I hate this movie.	0

The Critical Breakthrough: Transfer Learning

Transfer learning allows models to leverage prior knowledge. In 2018, ULMFiT and BERT showed that pre-training on abundant text data, then fine-tuning on specific tasks, yielded massive gains—equivalent to having 100x more data. The reason? Neural networks learn underlying structure (grammar, patterns) that transfers across tasks. This same approach is now applied to proteins: models pre-trained on vast protein sequences learn fundamental biological properties, which can then be fine-tuned for predicting structure, function, or interactions. The era of protein language models has begun.

Explore our example notebooks to see these models in action:

Unlocking Protein Secrets: How AI Language Models Revolutionize Biology

Introduction for Biologists: What Is a Language Model?

The Critical Breakthrough: Transfer Learning

We Care About Your Privacy

How and why we process data