Laravel

A new guide shows how to run the Vicuna 13B open-source chatbot on a single AMD GPU using ROCm and GPTQ quantization.

ChatGPT has revolutionized AI, but running large models often requires multiple expensive GPUs. Now, a team from UC Berkeley, CMU, Stanford, and UC San Diego has created Vicuna, a 13-billion-parameter chatbot that achieves over 90% of ChatGPT's quality—and it can run on a single AMD GPU.

What is Vicuna?

Vicuna is an open-source chatbot fine-tuned from LLaMA using 70,000 user-shared conversations from ShareGPT. The training cost was only around $300. To reduce memory requirements, the model uses GPTQ quantization, which compresses it to 4-bit precision without significant accuracy loss.

Running Vicuna on AMD GPU with ROCm

This guide demonstrates running Vicuna 13B on an AMD Instinct MI210 or Radeon RX 6900 XT with ROCm 5.4.3 and PyTorch 2.0.

Step-by-Step

Install ROCm on Ubuntu 22.04:

sudo apt update && sudo apt upgrade -y
wget https://repo.radeon.com/amdgpu-install/5.4.3/ubuntu/jammy/amdgpu-install_5.4.50403-1_all.deb
sudo apt-get install ./amdgpu-install_5.4.50403-1_all.deb
sudo amdgpu-install --usecase=hiplibsdk,rocm,dkms
sudo reboot

Verify installation:
```
rocm-smi
sudo rocminfo
```

Pull Docker container:

docker pull rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview
docker run --device=/dev/kfd --device=/dev/dri --group-add video --shm-size=8g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host -it --name vicuna_test -v ${PWD}:/workspace -e USER=${USER} rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview

Download quantized model:

git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
python download-model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g

Compile GPTQ kernels and run:

git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa
python setup_cuda.py install
python llama_inference.py ../../models/vicuna-13b --wbits 4 --load ../../models/vicuna-13b/vicuna-13b_4_actorder.safetensors --groupsize 128 --text "Your input text here"

The quantized model uses about 7GB of GPU memory, making it feasible on a single AMD GPU. This opens the door for researchers and hobbyists to run powerful chatbots locally.

Run a ChatGPT-Like Chatbot on a Single AMD GPU with ROCm

What is Vicuna?

Running Vicuna on AMD GPU with ROCm

Step-by-Step

We Care About Your Privacy

How and why we process data