This article was automatically translated from the original Turkish version.
Detection Transformer (DETR) is an innovative artificial intelligence model developed by Facebook AI in 2020 that adopts an end-to-end learning approach for object detection tasks. Unlike traditional object detection methods, DETR is the first model to center its architecture around the Transformer framework to predict both the locations and classes of objects in an image.

DETR sample schematic (Medium)
Traditional object detection systems typically involve a multi-stage processing pipeline. These stages include feature extraction, region proposal, classification, and bounding box refinement. Such systems commonly rely on CNN (Convolutional Neural Network) architectures and require specialized processing steps like Non-Maximum Suppression (NMS).
DETR simplifies this classical workflow by providing an end-to-end solution using only a CNN and a Transformer architecture. This eliminates the need for independent stages and handcrafted rules.
The DETR architecture consists of three key components:
CNN-based feature extraction: Typically a convolutional neural network such as ResNet generates low-dimensional yet meaningful feature maps from the input image.
Transformer encoder-decoder structure: The Transformer module processes these feature maps. The encoder transforms the input into meaningful vector representations, while the decoder uses these representations through “object queries” to perform object detection.
FFN (Feed-Forward Network): The decoder’s outputs are converted into class labels and bounding box predictions for each detected object.

Sample schematic (Medium)
The Transformer architecture at the core of DETR is inspired by the 2017 paper “Attention is All You Need” by Vaswani et al. Transformers operate using self-attention, multi-head attention, and feed-forward network layers. DETR is among the first successful applications of this structure to image-based tasks.
DETR generates a fixed number of learned “object queries,” each representing a potential object in the image. The model performs all object predictions in parallel over these queries. Class and box assignments are made using the Hungarian algorithm, ensuring a matching strategy that eliminates redundant or overlapping predictions.

Hungarian algorithm (Medium)
As a result, classical filtering operations such as Non-Maximum Suppression (NMS) are no longer required. Each prediction is directly assigned to an object or to the “no object” class.
DETR employs a loss function composed of two main components: class prediction and bounding box localization. The total loss is minimized by matching the model’s predictions with ground truth objects. Special adjustments are made for the “no object” class to mitigate class imbalance during loss computation.
End-to-end learning: DETR simplifies the entire object detection pipeline by performing all steps within a single model.
Generalization capability: It can be easily adapted to different datasets without requiring complex hand-tuned operations.
Transformer advantages: It enables learning long-range dependencies and supports parallel processing.
Eliminates NMS requirement: Direct matching of predictions to objects removes the need for post-processing filtering.
DETR can learn more slowly than classical approaches when detecting small objects. It requires long training times and large datasets. Additionally, latency issues may limit its use in real-time applications.
DETR is a pioneering work that transformed the paradigm of object detection by successfully applying the Transformer architecture. By overcoming the limitations of traditional methods, it offers a simpler and more general solution. It is widely regarded as a foundational milestone that paved the way for the next generation of detectors in computer vision and artificial intelligence.
Background
Architectural Components
Transformer Architecture
Working Principle
Loss Function
Advantages and Innovations
Limitations