DailyGlimpse

Build a Text Classifier with Kili and HuggingFace AutoTrain: A Step-by-Step Guide

AI
April 26, 2026 · 5:37 PM
Build a Text Classifier with Kili and HuggingFace AutoTrain: A Step-by-Step Guide

Understanding user feedback is vital for any customer-focused business, but analyzing thousands of reviews manually is costly and time-consuming. Machine learning offers a scalable solution, and with Auto ML tools, you can build powerful models with minimal coding.

This tutorial walks you through creating an active learning pipeline for text classification using HuggingFace AutoTrain and Kili. Kili is a data annotation platform that enables you to build high-quality training datasets, while AutoTrain automates model training and hyperparameter tuning.

We'll use a real-world example: classifying user reviews of the Medium app from the Google Play Store. By the end, you'll categorize reviews into topics and perform sentiment analysis to gain actionable insights.

What is AutoTrain?

AutoTrain is an automated machine learning framework built on top of HuggingFace transformers, datasets, and inference APIs. It handles data cleaning, model selection, and hyperparameter optimization automatically. Currently, it supports binary and multi-label text classification, token classification, extractive question answering, summarization, and text scoring for multiple languages.

What is Kili?

Kili is an end-to-end AI training platform for data-centric businesses. It provides collaborative annotation tools for text, image, video, PDF, and voice data, along with quality management features. You can access it via a web interface or powerful GraphQL and Python APIs. A free developer account is available to try its features.

Project Setup

We extracted about 40,000 Medium reviews from the Google Play Store. Our goal is to classify them into four categories:

  • Subscription: Feedback about Medium's paid membership.
  • Content: Opinions on the quality and variety of articles.
  • Interface: Thoughts on UI, search, recommendations, and payments.
  • User Experience: General sentiments not fitting other categories.

We'll also add two extra labels for ambiguous or multi-category reviews: "Other" and "Multiple."

Step 1: Create a Project in Kili

You can create a multi-class text classification project via Kili's web interface. After creating the project, upload your dataset (up to 25,000 samples by default). Configure the labeling interface with the defined categories on the Settings page.

Step 2: Label Data

Using Kili's interface, annotate a subset of reviews. The platform supports active learning: you can start with a small labeled set, train an initial model, and then label more samples that the model is uncertain about. This iterative process reduces labeling effort while improving model performance.

Step 3: Export Labeled Data

Once you have a sufficient number of labeled reviews, export the data using Kili's API or direct download. The data will be in a format compatible with HuggingFace datasets.

Step 4: Train with AutoTrain

With your labeled dataset ready, use AutoTrain to fine-tune a transformer model. Simply provide your dataset and specify the task (text classification). AutoTrain will handle the rest, including model selection and hyperparameter tuning.

# Example command (conceptual)
autotrain text-classification --model roberta-base --data ./labeled_data.csv

Step 5: Train a Custom Model (Optional)

If you prefer more control, you can build a model without AutoTrain using HuggingFace transformers. This involves writing a training script, but gives you flexibility to tweak every parameter.

Results and Analysis

After training, evaluate your model on a held-out test set. Apply it to the remaining unlabeled reviews to classify them. Then, perform sentiment analysis on each category to understand user satisfaction. For example, you might find that users love the content but are frustrated with subscription pricing.

Conclusion

By combining Kili's annotation platform with HuggingFace AutoTrain, you can quickly build a production-ready text classifier with minimal effort. This pipeline scales to thousands of reviews and can be adapted to other classification tasks. The complete code and dataset are available on the GitHub repository.

Start analyzing user feedback at scale today!