DailyGlimpse

Run a ChatGPT-Like Chatbot on a Single AMD GPU with ROCm

AI
April 26, 2026 · 4:57 PM
Run a ChatGPT-Like Chatbot on a Single AMD GPU with ROCm

A new guide shows how to run the Vicuna 13B open-source chatbot on a single AMD GPU using ROCm and GPTQ quantization.

ChatGPT has revolutionized AI, but running large models often requires multiple expensive GPUs. Now, a team from UC Berkeley, CMU, Stanford, and UC San Diego has created Vicuna, a 13-billion-parameter chatbot that achieves over 90% of ChatGPT's quality—and it can run on a single AMD GPU.

What is Vicuna?

Vicuna is an open-source chatbot fine-tuned from LLaMA using 70,000 user-shared conversations from ShareGPT. The training cost was only around $300. To reduce memory requirements, the model uses GPTQ quantization, which compresses it to 4-bit precision without significant accuracy loss.

Running Vicuna on AMD GPU with ROCm

This guide demonstrates running Vicuna 13B on an AMD Instinct MI210 or Radeon RX 6900 XT with ROCm 5.4.3 and PyTorch 2.0.

Step-by-Step

  1. Install ROCm on Ubuntu 22.04:
    sudo apt update && sudo apt upgrade -y
    wget https://repo.radeon.com/amdgpu-install/5.4.3/ubuntu/jammy/amdgpu-install_5.4.50403-1_all.deb
    sudo apt-get install ./amdgpu-install_5.4.50403-1_all.deb
    sudo amdgpu-install --usecase=hiplibsdk,rocm,dkms
    sudo reboot
    
  2. Verify installation:
    rocm-smi
    sudo rocminfo
    
  3. Pull Docker container:
    docker pull rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview
    docker run --device=/dev/kfd --device=/dev/dri --group-add video --shm-size=8g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host -it --name vicuna_test -v ${PWD}:/workspace -e USER=${USER} rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview
    
  4. Download quantized model:
    git clone https://github.com/oobabooga/text-generation-webui.git
    cd text-generation-webui
    python download-model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g
    
  5. Compile GPTQ kernels and run:
    git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
    cd GPTQ-for-LLaMa
    python setup_cuda.py install
    python llama_inference.py ../../models/vicuna-13b --wbits 4 --load ../../models/vicuna-13b/vicuna-13b_4_actorder.safetensors --groupsize 128 --text "Your input text here"
    

The quantized model uses about 7GB of GPU memory, making it feasible on a single AMD GPU. This opens the door for researchers and hobbyists to run powerful chatbots locally.