DailyGlimpse

Object Detection Leaderboard: Decoding Metrics and Avoiding Evaluation Pitfalls

AI
April 26, 2026 · 4:41 PM
Object Detection Leaderboard: Decoding Metrics and Avoiding Evaluation Pitfalls

Welcome to our latest dive into the world of leaderboards and model evaluation. In a previous post, we navigated the waters of evaluating Large Language Models. Today, we set sail to a different, yet equally challenging domain – Object Detection.

Recently, we released our Object Detection Leaderboard, ranking object detection models available in the Hub according to several metrics. In this blog, we will demonstrate how the models were evaluated and demystify the popular metrics used in Object Detection, from Intersection over Union (IoU) to Average Precision (AP) and Average Recall (AR). More importantly, we will spotlight the inherent divergences and pitfalls that can occur during evaluation, ensuring that you're equipped with the knowledge not just to understand but to assess model performance critically.

Every developer and researcher aims for a model that can accurately detect and delineate objects. Our Object Detection Leaderboard is the right place to find an open-source model that best fits their application needs. But what does "accurate" truly mean in this context? Which metrics should one trust? How are they computed? And, perhaps more crucially, why may some models present divergent results in different reports? All these questions will be answered in this blog.

So, let's embark on this exploration together and unlock the secrets of the Object Detection Leaderboard!

What's Object Detection?

In the field of Computer Vision, Object Detection refers to the task of identifying and localizing individual objects within an image. Unlike image classification, where the task is to determine the predominant object or scene in the image, object detection not only categorizes the object classes present but also provides spatial information, drawing bounding boxes around each detected object. An object detector can also output a "score" (or "confidence") per detection, representing the probability that the detected object belongs to the predicted class.

The following image, for instance, shows five detections: one "ball" with a confidence of 98% and four "person" with confidences of 98%, 95%, 97%, and 97%.

Object detection models are versatile and have a wide range of applications across various domains. Some use cases include vision in autonomous vehicles, face detection, surveillance and security, medical imaging, augmented reality, sports analysis, smart cities, and gesture recognition.

The Hugging Face Hub has hundreds of object detection models pre-trained on different datasets, able to identify and localize various object classes.

One specific type of object detection models, called zero-shot, can receive additional text queries to search for target objects described in the text. These models can detect objects they haven't seen during training, instead of being constrained to the set of classes used during training.

The diversity of detectors goes beyond the range of output classes they can recognize. They vary in terms of underlying architectures, model sizes, processing speeds, and prediction accuracy.

A popular metric used to evaluate the accuracy of predictions made by an object detection model is the Average Precision (AP) and its variants, which will be explained later in this blog.

Evaluating an object detection model encompasses several components, like a dataset with ground-truth annotations, detections (output predictions), and metrics.

First, a benchmarking dataset containing images with ground-truth bounding box annotations is chosen and fed into the object detection model. The model predicts bounding boxes for each image, assigning associated class labels and confidence scores to each box. During the evaluation phase, these predicted bounding boxes are compared with the ground-truth boxes in the dataset. The evaluation yields a set of metrics, each ranging between [0, 1], reflecting a specific evaluation criterion. In the next section, we'll dive into the computation of the metrics in detail.

Metrics

This section will delve into the definition of Average Precision and Average Recall, their variations, and their associated computation methodologies.

What's Average Precision and how to compute it?

Average Precision (AP) is a single number that summarizes the Precision x Recall curve. Before we explain how to compute it, we first need to understand the concept of Intersection over Union (IoU), and how to classify a detection as a True Positive or a False Positive.

IoU is a metric represented by a number between 0 and 1 that measures the overlap between the predicted bounding box and the actual (ground truth) bounding box. It's computed by dividing the area where the two boxes overlap by the area covered...