Skip to content

Comparing object detection models


Mar 1, 2019 - 15 minute read

Jaroslaw Szczegielniak
Jaroslaw Szczegielniak
See all Jaroslaw's posts

HC NHS Survey Blog Ad


Computer Vision with AI is amazing technology. Our eyes and brains have evolved to easily search complex images for details with incredible speed. But our ability to repeat this reliably and consistently over long durations or with similar images is limited. We get bored, we get tired, we get distracted. And there are many business and health applications where the implications of human failure are so high, it’s worth investing significant resources to either augment or replace the people performing the visual checks.

There are many tools, techniques and models that professional data science teams can deploy to find those visual markers. In general, we call this object detection. In this article I explore some of the ways to measure how effectively those methods work and help us to choose the best one for any given problem. I’ve written this article mainly with aspiring machine learning and computer vision specialists in mind. If you’re interested in the subject and would like to understand it from a more business-oriented point of view, please contact me, we are always willing to help.

What is object detection

Object detection is a computer technology allowing us to annotate images with information about objects and their locations in a picture. It is a natural extension of the image classification technology - where image classification assigns a label to a picture, object detection = localization + classification.

Using one of the images provided by Microsoft in Object Detection QuickStart, we can see the difference between image classification and object detection below:

Image classification

Object detection

Object detection benefits are more obvious if the image contains multiple overlapping objects (taken from CrowdHuman dataset. Shao, 2018):

Simple bounding-boxes returned with labels add very useful information, that may be used in further analysis of the picture. For example, we can count objects, we can determine how close or far they are from each other. If we analyze a sequence of frames, we can even predict collisions or the progression of objects over time - such as marks on our skin or scans of our organs.

Before we can deploy a solution, we need a number of trained models/techniques to compare in a highly controlled and fair way. Only then can we choose the best one for that particular job. To be comparable, our tests must be rigorous, and to increase our certainty that we have chosen the “best”, the data volumes will be high. Comparing them properly is a complex undertaking and we should not underestimate the challenge.

After this high-level introduction to the problem,  it is time to dive deeper into the specific techniques for establishing method fitness and comparing to find the best.

Which detection is the better one?

Typically detection tools return a list of objects giving data about the location on the image, their size, what that object is and the confidence of the classification. Depending on the complexity of the image and the application, that list could have only one entry or there could be hundreds per image, described in a format similar to the snippet below:

For a moment please don’t worry about the returned “confidence” attribute – we will come back to it in a moment.

Ok, so we have multiple tools, each of them returning many items. How hard can it be to work out which is the best one? Let’s try to determine it using our previous fork example and visualization of 4 sample results from different methods (i.e. object detection models):

Method A:

Looks pretty good but a part of the fork is cropped out:

Method B

The cropping is better but there is a phantom form detected on the left side: 

Method C

Cropping is pretty good:

Method D

What?? It’s sort of near the fork but doesn’t really look correct: 

Intuitively I would prefer Method A or Method C, but how should I explain it to the computer? To do this we need to list factors to consider when calculating “a score” for a result, and a “ground truth” describing all objects visible on an image with their true locations. In our case, the “truth” could be visualized like presented below.


Naive approach: binary scoring

How can we compare our results for Methods A-D to this truth (T)? A naive way would be to use a binary score, similar to those we might use in classification tasks:

  • Score = 1 if a result matches T (i.e. each detected object has the same coordinates that are defined in the “ground truth”).
  • Score = 0 in any other case.

Unfortunately in this case, simple does not mean reasonable – all our results A-D would get equal score = 0, which is not useful. Because chances to get the perfect match are close to 0, in practice we cannot use this score to compare any results, thus we need to keep looking.

Building a set of useful metrics

We need to find a way to calculate a value between 0 and 1, where 1 means a perfect match, and 0 means no match at all. This function should reflect the following factors:

  1. Does the detection result contain all the objects that are visible on the image?
  2. Does the detection result contain some objects that in fact are not present on the image?
  3. Are detected objects in the locations matching the ground-truth?
  4. How close detected bounding boxes are to the true boxes (as defined by the ground-truth annotations)?

One more factor is the “confidence” value, that we have ignored so far. The value between 0 (not sure at all) and 1 (pretty sure) reflects how confident the given model is of the accuracy of its prediction. It is important to remember, that a confidence value is subjective and cannot be compared to values returned by different models. For example results with confidence 0.9 from one “overly optimistic” model may, in fact, be worse, than results with confidence 0.6 from another, more realistic one.

Thankfully there is no need to re-invent a wheel for object detection purposes, as we can rely on Precision and Recall metrics well known in classification tasks, that have the following meaning in object detection context:

  • Precision: how many of the detected objects are really existing in the analyzed image (“ground truth”)?
    • 0 means that no “true” object was detected, 1 means that all detected objects are “true” objects.
    • To put it simpler, if for every 100 image sections that the model believes are forks, 89 of them will be actual forks, the Precision equals to 89/100.
  • Recall: how many of objects that should be detected (i.e. “true” objects), were detected?
    • 0 means that no “true” object was detected, 1 means that all “true” objects were detected (but it doesn’t care if any “false” objects were detected as well).
    • Continuing with our fork example, if for every 100 actual forks tested by the model, 97 of them would be detected correctly, the Recall equals to 97/100.

Sometimes it may be easier to remember what low score means:

  • Low Precision means that many detected objects were incorrectly labeled.
  • Low Recall means that many objects were not detected at all.

As you can see, Precision and Recall describes the same result from two different perspectives and only together they provide us with a complete picture.

To describe Precision and Recall more formally, we need to introduce 4 additional terms:

  • TP – “true positive” – a number of detected objects that should be detected (i.e. “true” objects are included in the “ground truth”.
  • FP – “false positives” – a number of detected objects that shouldn’t be detected (i.e. not included in the “ground truth”).
  • FN – “false negatives” – a number of objects that weren’t detected despite being included in the “ground truth”.
  • TN – “true negatives” – added only for completeness, because it doesn’t make much sense in object detection context (and it is not required in following calculations). It is quite handy in classification tasks though.

Having these values, we can construct equations for Precision and Recall:

Precision = TP / (TP + FP) (i.e. TP / all detected objects).

Recall = TP / (TP + FN) (i.e. TP / all “ground truth” objects).

If you ever find this confusing, the following image (from Wikipedia article) always does the trick for me: 

Unfortunately, even having defined Precision and Recall, we still don’t know at least two important things:

  • How to calculate TP, FP and FN for object detection results? As we have determined previously, we cannot expect model to provide us results that exactly match the defined “ground truth”.
  • How to compare two results without a single metric value? It is often tricky, especially when we need to deal with a trade-off between Precision and Recall (i.e. increasing Precision we decrease Recall and vice versa).

I will try to answer these questions in the remaining sections.

Calculating precision and recall for object detection results

It is time to calculate TP, FP and FN, and it is usually done using yet another value – IoU (Intersection over Union):

Green rectangle – expected (“truth”) result; Red rectangle – calculated result;   Red rectangle – intersection; Gray area – union;

With known both intersection and union areas for “ground truth” and returned result, we simply divide them to obtain IoU. With a value between 0 and 1 (inclusive), IoU = 0 means completely separate areas, and 1 means the perfect match.

Now we are finally ready to calculate the true positives and false positives / negatives (TP, FP, FN), using following the steps:

  1. We set IoU threshold value (i.e. fixed value we will use to compare with calculated IoU values). The higher the value, the more accuracy we expect from our model. In extreme cases we can set it to 0 (virtually “turning it off”) or to 1 (enforcing “perfect match” for true positives).
  2. We prepare list of “ground truth” annotations, grouped by associated image.
  3. We sort all obtained results by “confidence” in descending order (where result here means a single detected object instance together with coordinates of its “bounding box and link to the related image and its “ground truth” annotations).
  4. For each result (starting from the most “confident”)
    1. We calculate IoU between the result and ALL instances of the same class of objects within linked “ground truth” annotations.
    2. For each calculated IoU value we compare it to the fixed IoU threshold. If IoU >= IoU threshold: We mark the result as “true positive” (TP), and then we remove the corresponding “ground truth” annotation from the linked “ground truths” (so it is not used again in IoU calculation for subsequent results).
  5. When all results are processed, we can calculate FP and FN:
    1. FP = all results not marked as “true positives” (TP) (in any execution of step 4a above).
    2. FN = all items left in our “ground truth” annotations list (i.e. not removed during any execution of step 4b above).

This way we have all values we need to calculate Precision and Recall, but we still lack a simple way to compare results provided by different models.

Average Precision (AP and mAP)

The most common approach to end with a single value allowing for model comparison is calculating Average Precision (AP) – calculated for a single object class across all test images, and finally mean Average Precision (mAP) – a single value that can be used to compare models handling detection of any number of object classes.

While I wasn’t able to determine when exactly this metric was used for the first time to compare different object detection models, its mainstream use started together with the popularisation of large datasets and challenges such as Pascal VOC (Visual Object Classes) or COCO (Common Objects in Context).

Overall, the mAP calculation is divided into 2 main steps:

  1. Calculate AP (average precision) per each class of detected objects and configured IoU threshold.
  2. Calculate mAP as a mean from previously calculated AP values.

The main difference between various approaches to mAP calculation are related to the IoU threshold value. In some cases, a fixed value is used (e.g. 0.5 in Pascal VOC), while in others an array of different values is used to calculate average precision (AP) per each class and threshold combination, and the final mAP value is the mean from all these results.

Apart from that – to my knowledge – the AP calculation for a given class and IoU threshold is mostly the same, and – to some surprise – doesn’t start with calculating Precision and Recall using the steps described previously ;-). A slightly changed process is used to calculate the AP instead (changes start from step 4.3 below):

  1. We set IoU threshold value (i.e. fixed value we will use to compare with calculated IoU values). The higher the value, the more accuracy we expect from our model. In extreme cases we can set it to 0 (virtually “turning it off”) or to 1 (enforcing “perfect match” for true positives).
  2. We prepare a list of “ground truth” annotations, grouped by associated image.
  3. We sort all obtained results by “confidence” in descending order (where result here means a single detected object instance together with coordinates of its “bounding box and link to the related image and its “ground truth” annotations).
  4. For each result (starting from the most “confident”)
    1. We calculate IoU between the result and ALL instances of the same class of objects within linked “ground truth” annotations.
    2. For each calculated IoU value we compare it to the fixed IoU threshold. If IoU >= IoU threshold: We mark the result as “true positive” (TP), and then we remove the corresponding “ground truth” annotation from the linked “ground truths” (so it is not used again in IoU calculation for subsequent results)
    3. We calculate cumulative:
      • TP count (i.e. count of all TP values encountered so far).
      • Precision (as TP count / total number of so far processed results).
      • Recall (as TP count / total number of object instances in the “ground truth” set).
  5. Having all results processed, we end up with calculations like the table below (first 2 columns contain input data,  Is TP? was calculated using the 4.2 formula, the remaining 3 columns were calculated using 4.3 formula):
    # Confidence Is TP? TP Count Precision Recall
    1 0,90 1 1,00 1,00 0,17
    2 0,90 1 2,00 1,00 0,33
    3 0,80 0 2,00 0,67 0,33
    4 0,80 0 2,00 0,50 0,33
    5 0,70 1 3,00 0,60 0,50
    6 0,70 0 3,00 0,50 0,50
    7 0,70 0 3,00 0,43 0,50
    8 0,60 1 4,00 0,50 0,67
    9 0,40 1 5,00 0,56 0,83
    10 0,20 0 5,00 0,50 0,83
    11 0,20 0 5,00 0,45 0,83
    12 0,10 1 6,00 0,50 1,00
    13 0,05 0 6,00 0,46 1,00
    14 0,05 0 6,00 0,43 1,00
    As you can see, for a typical model Precision goes down (because it should be more likely that results with higher “confidence” are “true positives” (TP), than low confidence ones), and Recall can go only up (because its value increases together with number of TP encountered when processing sorted results, regardless of number of False Positives along the way). Please note, that it may not be completely the case if your model returns relatively more “true positives” with low confidence than with high one – but I would say that it indicates some more general problem with measured model anyway.

    In the case above all “ground truth” positives were found and returned by the model (6 of them), though with various confidence, therefore, the Recall gets up to the maximum 1 value. In theory, we could calculate some sort of average precision weighted by recall directly from the table above, but this is not how the Average Precision (AP) is obtained (to make it more stable and “comparable” between various models). Thus we need one last step.
  6. We decide what values of Recall should be considered to calculate Precision in order to obtain its mean. In many cases 11 following Recall values are used: Rt = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0].
    For each of these values (Rti), the Precision is calculated as the MAXIMUM precision from table in step 5, selected from all rows with Recall equal or higher than Rti.
    In our case it leads to the following results:
    Rti Prec.
    0,00 1,00
    0,10 1,00
    0,20 1,00
    0,30 1,00
    0,40 0,60
    0,50 0,60
    0,60 0,56
    0,70 0,56
    0,80 0,56
    0,90 0,50
    1,00 0,50
    To compare “raw” and “smoothen” precision values (respectively from tables in point 5 and 6), it may be visualized as follows:
    As you can see, the last calculation ensures, that Precision only go “flat”, or down as Recall increases.
    The last thing left to do is to calculate the Average Precision (AP) from the table above, by adding all Prec. values and dividing them by 11 (number of pre-selected Rt values), what in this case equals to 0,72.

As long as we are dealing with a models with single class of objects, that is all. If – what is a much more likely scenario – there are more classes (e.g. 21 in Pascal VOC or 80 in COCO datasets), you need to calculate AP for each class, and then calculate mAP as a mean of obtained AP values (sum of AP values divided by their count).

As I’ve mentioned before, in some cases you may want to calculate not a single AP per class, but several AP values, each per different IoU threshold. For example in case of COCO challenge, the main metric is mAP calculated for a number of IoU thresholds from range 0.5 to 0.95 .

Regardless which approach is taken, if it is used consistently, the obtained mAP value allows to directly compare results of different models (or variants of models) on the same test dataset.

Is mAP enough?

Having mAP calculated, it is tempting to blindly trust and use it to choose production models. I would strongly discourage it though, as unfortunately, it is not that simple.

As with every “average”, mAP may be misleading, especially if we don’t consider all detected objects as equally important. Depending on your needs you may expect better detection on large objects than on smaller ones, or you may consider correct detecting people much more important, than detecting trees. In such case you still may use mAP as a “rough” estimation of the object detection model quality, but you need to use some more specialized techniques and metrics as well.

For example, in case of object counting, the AP/mAP value is immune to false positives with low confidence, as long as you have already covered “ground truth” objects with higher-confidence results.

Please consider following table:

# Confidence Is_TP? TP Count Precision Recall   Rti Prec.
1 0,90 1 1,00 1,00 0,20   0,00 1,00
2 0,90 1 2,00 1,00 0,40   0,10 1,00
3 0,80 0 2,00 0,67 0,40   0,20 1,00
4 0,80 0 2,00 0,50 0,40   0,30 1,00
5 0,70 1 3,00 0,60 0,60   0,40 1,00
6 0,70 0 3,00 0,50 0,60   0,50 0,60
7 0,70 0 3,00 0,43 0,60   0,60 0,60
8 0,60 1 4,00 0,50 0,80   0,70 0,56
9 0,40 1 5,00 0,56 1,00   0,80 0,56
10 0,20 0 5,00 0,50 1,00   0,90 0,56
11 0,20 0 5,00 0,45 1,00   1,00 0,56
              AP 0,77

Because to calculate Average Precision (AP) we are interested in maximum Precision (representing number of correctly classified objects within all detected objects) above a given Recall (representing number of correctly detected and classified objects within all objects that should be detected), as soon as Recall rises to 1.0 in the row 9 above (because rows 1 - 9 contain all objects that should be detected, True Positives TP), no number of additional invalid detections (False Positives - FP) is going to change the result. In the example above detections from row 10 and 11 don’t have any impact on the mAP result - and even thousands of following “false” results would not change it.

You may say that you shouldn’t consider results with low confidence anyway – and you would be right in most cases of course - but this is something that you need to remember. Especially if for whatever reason avoiding incorrect detections (False Positives) is more important, than correct detection and classification of all of the existing objects (True Positives).

Having proper tools, it is always worth to get hands dirty and use them in practice to get a better understanding of them and their limitations. In the case of object detection task, it would be a comparison of different models working on the same dataset. As this article – as usual in my case - got too long already, this is a topic for another one though 😉.

So finally...

If you got all the way to here, thanks very much for taking the time. I hope it helped to deepen your understanding of object detection and the strategies we can devise to help us pick the best models and techniques for a particular problem. If detecting objects within images is the key to unlocking value then we need to invest time and resources to make sure we’re doing the best job that we can. If you’d like to understand in more detail how we use these techniques (and others) to help our clients create value from data, please make drop me a line.

HC NHS Survey Blog Ad
Jaroslaw Szczegielniak
Jaroslaw Szczegielniak
See all Jaroslaw's posts

Related posts

You might be also interested in


Start your project with Objectivity

CTA Pattern - Contact - Middle

We use necessary cookies for the functionality of our website, as well as optional cookies for analytic, performance and/or marketing purposes. Collecting and reporting information via optional cookies helps us improve our website and reach out to you with information regarding our organisaton or offer. To read more or decline the use of some cookies please see our Cookie Settings.