In a previous article, we covered the theoretical foundations of machine learning on graphs. Now, we dive into practical implementation: how to perform graph classification using the Hugging Face Transformers library. This tutorial focuses on Microsoft's Graphormer, currently the only graph transformer model available in Transformers, and walks through data loading, preprocessing, model setup, and training.
Requirements
You'll need datasets and transformers (version >= 4.27.2). Install them with:
pip install -U datasets transformers
Data
You can use your own graph datasets or those available on the Hugging Face Hub. We'll use the ogbg-molhiv dataset from the Open Graph Benchmark.
Loading
Loading a graph dataset from the Hub is straightforward:
from datasets import load_dataset
dataset = load_dataset("OGB/ogbg-molhiv")
dataset = dataset.shuffle(seed=0)
This dataset includes train, validation, and test splits, each containing columns like edge_index, edge_attr, y, num_nodes, and node_feat.
You can visualize graphs using libraries like NetworkX and matplotlib:
import networkx as nx
import matplotlib.pyplot as plt
graph = dataset["train"][0]
edges = graph["edge_index"]
num_nodes = graph["num_nodes"]
G = nx.Graph()
G.add_nodes_from(range(num_nodes))
G.add_edges_from([(edges[0][i], edges[1][i]) for i in range(len(edges[0]))])
nx.draw(G)
plt.show()
Format
Graph datasets on the Hub are stored as lists of graphs in JSONL format. Each graph is a dictionary with:
edge_index: list of two lists of integers representing edges.num_nodes: integer, total number of nodes (assumed sequentially numbered).y: list of labels (integers for classification, floats for regression, or lists for multi-task).node_feat(optional): list of lists of integers for node features.edge_attr(optional): list of lists of integers for edge attributes.
Preprocessing
Graphormer requires specific preprocessing to generate features like degree information and shortest path matrices. Use:
from transformers.models.graphormer.collating_graphormer import preprocess_item, GraphormerDataCollator
dataset_processed = dataset.map(preprocess_item, batched=False)
Alternatively, you can enable on-the-fly processing in the data collator for large graphs.
Model
Loading
Load a pretrained Graphormer model and fine-tune it for your downstream task. For binary classification, set num_classes=2:
from transformers import GraphormerForGraphClassification
model = GraphormerForGraphClassification.from_pretrained(
"clefourrier/pcqm4mv2_graphormer_base",
num_classes=2,
ignore_mismatched_sizes=True,
)
You can also create a randomly initialized model from scratch.
Training
Use the Trainer class with a TrainingArguments configuration and an evaluation metric. For details, check the full notebook linked in the original article.
Ending Note
This tutorial demonstrated graph classification using Graphormer in Hugging Face Transformers. With the Hub's datasets and pre-trained models, you can quickly adapt this workflow to your own graph classification tasks.