Facebook AI Open Source Discovery Transformer (DETR)

Contents

Introduction

Occasionally, a library or machine learning framework changes the landscape of the field. Nowadays, Facebook opened such a framework: DETR o DEtection TRansformer.

Sense transformer

In this post, we will quickly understand the concept of object detection and then we will dive directly into DETR and what it brings.

Object detection at a glance

A Computer Vision, object detection is a task where we want our model to distinguish the foreground objects from the background and predict the locations and categories of the objects present in the image. Current deep learning approaches attempt to solve the object detection task as either a classification hurdle or a regression hurdle or both..

As an example, in the RCNN algorithm, several regions of interest are identified from the input image. Subsequently, these regions are classified as objects or as background and, finally, a regression model is used to generate the bounding boxes for the identified objects.

The YOLO framework (You Only Look Once), Besides, handles object detection in a different way. Takes the entire image in a single instance and predicts the bounding box coordinates and class probabilities for these boxes.

For more information on object detection, see these posts:

We present DEtection TRansformer (DETR) of Facebook AI

As you saw in the previous section, current deep learning algorithms perform multi-step object detection. They also suffer from the problem of almost duplicates, In other words, false positives. To simplify, Facebook AI researchers have devised DETR, an innovative and efficient approach to solving the problem of object detection.

The original paper is here, open source code is hereand you can consult the Colab notebook here.

detection transformer

Source: https://arxiv.org/pdf/2005.12872.pdf

This new model is quite simple and you don't need to install any library to use it. DETR treats an object detection obstacle as a direct set prediction obstacle with the help of a transformer-based encoder-decoder architecture. Per set, I mean the bounding box set. Transformers are the new generation of deep learning models that have performed outstandingly in the domain of NLP.

The authors of this post have evaluated DETR on one of the most popular object detection data sets., COCO, versus a very competitive Faster R-CNN baseline.

In the results, the DETR achieved comparable performances. More accurately, DETR demonstrates significantly better performance on large objects. Despite this, didn't work as well on small objects. I'm sure the researchers will figure it out very soon.

DETR architecture

The general architecture of DETR is quite simple to understand. Contains three main components:

  • a CNN backbone
  • an encoder-decoder transformer
  • a simple feed-through network

object detection transformer

Source: https://arxiv.org/pdf/2005.12872.pdf

Here, CNN backbone generates feature map from input image. Subsequently, the output of the CNN backbone is converted to a one-dimensional feature map that is passed to the Transformer encoder as input. The output of this encoder is N number of fixed length embeds (vector), where N is the number of objects in the image assumed by the model.

The Transformer decoder decodes these embeddings at the bounding box coordinates with the help of the decoder's attention mechanism and the encoder itself..

In summary, feedforward neural networks predict normalized central coordinates, the height and width of the bounding boxes and the linear layer predicts the class label using a softmax function.

Final thoughts

This is a truly exciting framework for all deep learning and computer vision enthusiasts.. A big thank you to Facebook for sharing their approach with the community.

Time to buckle up and use this for our next deep learning project!!

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.