This blog post demonstrates how to fine-tune pre-trained Transformer models on Graphcore Intelligence Processing Units (IPUs) using the Hugging Face Optimum library. As a practical example, we provide a step-by-step guide and notebook that trains a vision transformer (ViT) model on a large chest X-ray dataset.
Introducing Vision Transformer (ViT) Models
In 2017, Google AI researchers introduced the transformer architecture, characterized by a self-attention mechanism, which quickly became the standard for natural language processing (NLP). While transformers like GPT and BERT excel in language tasks, they are versatile enough for computer vision (CV). The vision transformer (ViT), introduced in a 2021 paper by Google Research, applies self-attention to image recognition. Instead of processing pixel arrays like convolutional neural networks (CNNs), ViT divides an image into patches—similar to words in a sentence—and encodes each patch into a vector. Pre-training allows ViT to learn internal image representations, which can be used for downstream tasks like classification by adding a linear layer on top of the [CLS] token.
ViT models have shown higher accuracy with lower computational cost compared to CNNs and are used in applications such as image classification, object detection, and segmentation. In healthcare, they have been applied to detect COVID-19, femur fractures, emphysema, breast cancer, and Alzheimer's disease.
Why ViT Models Are a Perfect Fit for IPU
Graphcore IPUs are well-suited for ViT models due to their ability to parallelize training through data pipelining and model parallelism. The IPU's MIMD architecture and IPU-Fabric scale-out solution accelerate this process. Pipeline parallelism increases batch size, improves memory access efficiency, and reduces communication time for data parallel learning.
The Hugging Face Optimum Graphcore library includes pre-optimized transformer models, making it easy to achieve high performance when running ViT on IPUs. Graphcore provides ready-to-use IPU-trained model checkpoints and configuration files, allowing users to leverage pre-trained checkpoints from the Hugging Face model hub without training from scratch. Optimum shortens the AI model development lifecycle by letting users plug and play any public dataset.
For this post, we use a ViT model pre-trained on ImageNet-21k, based on the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al. We fine-tune it on the ChestX-ray14 Dataset.
The Value of ViT Models for X-ray Classification
Medical imaging tasks are challenging because radiologists must detect subtle differences in X-rays. Computer-aided detection and diagnosis (CAD) techniques can improve clinician workflows and patient outcomes. However, developing X-ray classification models faces hurdles: training from scratch requires massive labeled data, high resolution demands powerful compute, and multi-label problems like pulmonary diagnosis are complex.
Using Hugging Face Optimum, we avoid training from scratch by using model weights from the Hugging Face model hub. We use google/vit-base-patch16-224-in21k checkpoints, converted from the TIMM repository and pre-trained on 14 million images from ImageNet-21k. The configuration is available through the Graphcore-ViT model card. For first-time IPU users, refer to the IPU Programmer's Guide, PyTorch basics tutorial, and Hugging Face Optimum Notebooks.