Active learning in machine learning involves iteratively adding labeled data, retraining a model, and serving it to users. This article explores how to combine AutoNLP and Prodigy to create an efficient active learning workflow.
AutoNLP
AutoNLP, developed by Hugging Face, enables building state-of-the-art deep learning models with minimal coding. It leverages Hugging Face's transformers, datasets, and inference-api to automatically fine-tune and deploy models. At the time of writing, AutoNLP supports tasks such as binary classification, regression, multi-class classification, token classification (e.g., named entity recognition), question answering, and summarization, among others. It also supports multiple languages, including English, French, German, Spanish, Hindi, Dutch, and Swedish, with custom model options for unsupported languages.
Prodigy
Prodigy, created by Explosion (the makers of spaCy), is a web-based annotation tool for real-time data labeling. It supports NLP tasks like named entity recognition and text classification, as well as computer vision and custom tasks. Prodigy is a commercial tool, but a demo is available here.
Dataset
For this project, we used the BBC News Classification dataset from Kaggle, which contains news articles labeled with five categories: business, entertainment, politics, sport, and tech. Training a multi-class classification model on this dataset with AutoNLP is straightforward:
- Download the dataset.
- Create a new project on AutoNLP.
- Upload the training dataset and choose auto-splitting.
- Accept the pricing (as low as $10 per model) and train.
After about 15 minutes, the best model achieved 98.67% accuracy. However, the dataset lacked entity recognition labels, so we used Prodigy to annotate for named entity recognition (NER).
Active Learning
With Prodigy installed, we ran the following command to start labeling:
prodigy ner.manual bbc blank:en BBC_News_Train.csv --label PERSON,ORG,PRODUCT,LOCATION
This creates a dataset called bbc, uses the spaCy blank:en tokenizer, and applies the specified labels. After labeling around 20 samples, we trained a token classification model using AutoNLP. The initial model performed poorly (86% accuracy, 0 precision/recall), as expected with so little data. After labeling 70 samples, accuracy rose to 92%, with precision and recall around 0.52 and 0.42, respectively. Despite improvements, the model was not yet satisfactory.
To convert Prodigy annotations to AutoNLP format, we used a script that extracts tokens and IOB tags, saving them as a JSONL file. This file can then be used to train a token classification model in AutoNLP, following the same steps as before but selecting the "Token Classification" task.
This iterative process demonstrates how active learning with AutoNLP and Prodigy can progressively improve model performance with minimal manual effort.