Meta AI and BigScience recently open-sourced very large language models that won't fit into the memory (RAM or GPU) of most consumer hardware. At Hugging Face, part of our mission is to make even those large models accessible, so we developed tools to allow you to run those models even if you don't own a supercomputer. All the examples in this blog post run on a free Colab instance (with limited RAM and disk space). If you have access to more disk space, feel free to pick larger checkpoints.
Here is how we can run OPT-6.7B:
import torch
from transformers import pipeline
# This works on a base Colab instance.
# Pick a larger checkpoint if you have time to wait and enough disk space!
checkpoint = "facebook/opt-6.7b"
generator = pipeline("text-generation", model=checkpoint, device_map="auto", torch_dtype=torch.float16)
# Perform inference
generator("More and more large language models are opensourced so Hugging Face has")
We'll explain what each argument does in a moment, but first consider the traditional model loading pipeline in PyTorch: it usually consists of:
- Create the model
- Load its weights in memory (in an object usually called
state_dict) - Load those weights into the created model
- Move the model to the device for inference
While that has worked well in the past, very large models make this approach challenging. Here the model has 6.7 billion parameters. In the default precision, just step 1 (creating the model) takes roughly 26.8GB in RAM (1 parameter in float32 takes 4 bytes). This can't even fit in the RAM you get on Colab.
Then step 2 loads a second copy of the model (another 26.8GB in RAM in default precision). If you try to load the largest models, like BLOOM or OPT-176B (both with 176 billion parameters), you would need 1.4 terabytes of CPU RAM. That is excessive! And all this just to move the model to one (or several) GPU(s) at step 4.
Clearly we need something smarter. In this blog post, we explain how Accelerate leverages PyTorch features to load and run inference with very large models, even if they don't fit in RAM or one GPU. In a nutshell, it changes the process above like this:
- Create an empty (i.e., without weights) model
- Decide where each layer will go (when multiple devices are available)
- Load parts of its weights into memory
- Load those weights into the empty model
- Move the weights to the device for inference
- Repeat from step 3 for the next weights until all weights are loaded
Creating an empty model
PyTorch 1.9 introduced a new kind of device called the meta device. This allows us to create tensors without any data attached: a tensor on the meta device only needs a shape. As long as you are on the meta device, you can create arbitrarily large tensors without worrying about CPU (or GPU) RAM.
For instance, the following code will crash on Colab:
import torch
large_tensor = torch.randn(100000, 100000)
This large tensor requires 4 * 10**10 bytes (default precision is FP32, so each element takes 4 bytes) — 40GB of RAM. The same on the meta device works just fine:
import torch
large_tensor = torch.randn(100000, 100000, device="meta")
If you display this tensor, PyTorch prints:
tensor(..., device='meta', size=(100000, 100000))
There is no data associated, just a shape.
You can instantiate a model directly on the meta device:
large_model = torch.nn.Linear(100000, 100000, device="meta")
But for an existing model, this syntax would require rewriting all modeling code so that each submodule accepts a device keyword argument. Since this was impractical for the 150 models in the Transformers library, we developed a context manager that instantiates an empty model for you.
Here is how you can instantiate an empty version of BLOOM:
from accelerate import init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained("bigscience/bloom")
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
This works on any model, but you get back a shell you can't use directly: some operations are implemented for the meta device, but not all yet. For instance, you can use the large_model defined above with an input, but not the BLOOM model. Even when using it, the output will be a tensor on the meta device, so you get the shape of the result but nothing more.
As further work, the PyTorch team is developing a new class FakeTensor, which is like tensors on the meta device but with device information (on top of shape and dtype).
Since we know the shape of each weight, we can know how much memory they will consume once we load the pretrained tensors fully. Therefore, we can decide how to split our model across CPUs and GPUs.
Computing a device map
Before loading the pretrained weights, we need to know where to put them. This way we free CPU RAM each time we have placed a weight in its right location. This can be done with the empty model on the meta device, since we only need the shape and dtype of each tensor to compute its memory footprint.
Accelerate provides a function to automatically determine a device map from an empty model. It tries to maximize the use of all available GPUs, then CPU RAM, and finally flags weights that don't fit for disk offload. Let's look at OPT-13b:
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained("facebook/opt-13b")
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
device_map = infer_auto_device_map(model)
This returns a dictionary mapping modules or weights to a device. On a machine with one Titan RTX, we get:
{'model.decoder.embed_tokens': 0,
'model.decoder.embed_positions': 0,
'model.decoder.final_layer_norm': 0,
'model.decoder.layers.0': 0,
...
'model.decoder.layers.9': 0,
'model.decoder.layers.10.self_attn': 0,
'model.decoder.layers.10.activation_fn': 0,
'model.decoder.layers.10.self_attn_layer_norm': 0,
'model.decoder.layers.10.fc1': 'cpu',
'model.decoder.layers.10.fc2': 'cpu',
'model.decoder.layers.10.final_layer_norm': 'cpu',
...
'model.decoder.layers.17': 'cpu',
'model.decoder.layers.18': ...}
Sharding state dicts
Once we have a device map, we load the weights in chunks. The pretrained weights are usually stored as one or several checkpoint files. For very large models, these files can be gigabytes each. Accelerate loads and processes them in a memory-efficient way.
Running a model split on several devices
Finally, we run the model with the device_map="auto" argument in the pipeline. This uses the predefined device map to distribute layers across GPUs and CPU. The model runs inference as usual, but under the hood, data moves between devices as needed.
With these techniques, even massive models like BLOOM-176B can be run on a single machine with multiple GPUs and sufficient CPU RAM. This democratizes access to state-of-the-art language models for researchers and enthusiasts without supercomputers.