Multiformer

More heads are better than one.

19 min readDec 13, 2023

Among the most safety-critical deployments of artificial intelligence (AI) technology in the world today, autonomous driving perception stacks simultaneously perform numerous computer vision tasks to understand the scenes they encounter, typically including semantic segmentation, object detection, and depth estimation as fundamental components. With multiple task modules at play, feeding them with a common set of features from a single backbone is not only desirable from a parameter efficiency and inference speed standpoint, but is actually known to be beneficial to the learning of each individual task. Developing models with efficient multi-task performance allows for training and deployment of autonomous systems on smaller hardware, and democratizes research in this field by lowering the barrier to entry. Multiformer is a vision architecture designed to perform the essential tasks for autonomous perception using a small memory footprint by exploiting the benefits in parameter efficiency offered by both transformers and multi-task learning. Training experiments demonstrate the effective simultaneous learning of monocular depth estimation, semantic segmentation, and 2D object detection using the combination of a shared backbone and task-specific modules which altogether amount to just over 8 million parameters. Isolating object detection as the target task, training with simultaneous supervision from of all three tasks shows a clear advantage over transfer learning only.

Multiformer inference on an urban scene from the SHIFT dataset.

Multiformer was inspired by the desire to experiment with full autonomy stacks at manageable scales. The Hugging Face implementation of the model and pretrained weights are publicly available. Training and inference scripts as well as setup instructions can be found in the project GitHub. For a detailed breakdown of training methodology and results, see the Weights & Biases report.

The contributions of Multiformer can be summarized as follows:

A unified transformer-based vision architecture for complementary perception tasks: object detection, semantic segmentation, and depth estimation, with extensibility to panoptic segmentation and future work in monocular 3D object detection.
Multi-task learning framework with a shared backbone allows for the learning of more robust and generalizable features with high parameter efficiency, resulting in a lightweight model that is performant on multiple tasks and which can be run on smaller hardware with higher framerates. Experiments show multi-task training offers a clear advantage in the learning of an individual task compared with transfer learning only.
Model code and pretrained weights are made available for use and future development by the community, offering the opportunity to explore interesting research questions such as the effects of multi-task learning on synthetic-to-real transfer learning.
Modular design allows for the flexible scaling of components to larger (or smaller) sizes depending on memory budget, the future replacement of modules with more recent or preferred designs, and the frictionless addition of new task modules.

To achieve the aspirations of a powerful-yet-lightweight perception module, Multiformer exploits the benefits of multi-task learning, synthetic data, and hierarchical transformers to generate the most descriptive features possible from its ~8M (million) parameters. Let’s explore how these components come together and allow Multiformer to produce impressive multi-task performance with a small memory footprint.

Related Work

Multi-Task Learning

Multi-Task Learning (MTL) is a field of study that explores the advantages of learning multiple related tasks simultaneously with shared parameters, and is particularly relevant to autonomous vehicle (AV) perception, where visual input data is used to infer multiple complementary computer vision tasks needed to perform safe navigation. MTL is intuitively more similar to human learning, where fundamental tasks like walking, grasping, and object manipulation can be learned, then applied to more complex tasks like pitching a baseball. An analogy in the context of computer vision is using knowledge of a scene’s geometry and contents to inform 3D object detection and tracking. Such knowledge priors allow for much faster learning using fewer examples, since the network can leverage knowledge learned from related tasks.

“The human ability to rapidly learn with few examples is dependent on this process of learning concepts which are generalizable across multiple settings and leveraging these concepts for fast learning.”
— From Crawshaw’s survey of MTL in Deep Neural Networks.

Research in MTL has shown that training networks on multiple related tasks enables them to learn representations that are common across all tasks, producing more robust and generalizable features. It induces a bias that the model must perform well at multiple tasks, and therefore the features learned must be generally useful to all of them, preventing overfitting on noise and task idiosyncrasies. Training on multiple tasks simultaneously allows the model to learn features for one task which may be complementary or useful to another, but which might not have been learned if the tasks were trained separately. Further, the presence of multiple supervision signals acts as a form of regularization on the gradients at each training step, helpful particularly when training volatile tasks like object detection, where sharp loss gradients can toss around weights and slow convergence. Moreover, the weight sharing between tasks in MTL comes with a much smaller memory footprint, meaning higher training and inference framerates on smaller hardware.

Predicting simultaneous tasks with a single Convolutional Neural Network (CNN) is an idea that spans nearly back to the “Big Bang” of CNNs in 2012, when GPU acceleration enabled AlexNet to disintegrate the previous state of the art methods in image classification by 10.8%. OverFeat used a single network to simultaneously perform image classification while achieving state-of-the-art scores in object localization and detection tasks in the 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013). In 2014, Eigen & Fergus used a common backbone to perform joint semantic segmentation, surface normal estimation, and dense depth prediction using the NYUDepth datasets. Multi-task Network Cascades (MNCs) predict each task in a user-defined order, appending the predictions of the previous task to the inputs of the next task to create a cascading flow of information. Other approaches investigated the sharing of information between multiple shared backbones, as in Multi-Gate Mixture-of-Experts, or separate backbones, such as in Cross-Stitch Networks.

UberNet was another trailblazer in the design philosophy of attaching multiple task decoders to a shared backbone (the “shared trunk” template in Crawshaw’s MTL survey), addressing the challenge of training on multiple datasets to cover the disjoint annotation spaces for seven total training tasks: boundary detection, normal estimation, saliency estimation, semantic segmentation, human part segmentation, semantic boundary detection, and object detection, by only calculating loss when labels for a task are available, and asynchronously updating weights only after observing a sufficient number of training samples that relate to them.

“One can expect that such all-in-one, ‘swiss knife’ architectures will become indispensable for general AI, involving for instance robots that will be able to recognize the scene they are in, recognize objects, navigate towards them, and manipulate them. Furthermore, having a single visual module to address a multitude of tasks will make it possible to explore methods that improve performance on all of them, rather than developping [sic] narrow, problem-specific techniques.”
— A keen foresight in 2016 from UberNet.

As an early example of the “shared trunk” design being applied in autonomous navigation, MultiNet performed simultaneous road segmentation, vehicle detection, and street classification in 2016 using the popular VGG and ResNet backbones of the time, remarking on the importance of using shared weights to increase inference speed for AV perception, and beating the previous state of the art on the KITTI road segmentation task with real-time inference framerates.

Although it has proven time and again to be effective at simultaneously improving parameter efficiency and individual task performances, multi-task learning must be executed carefully in order to work. If multiple supervision signals are applied incautiously, negative transfer or destructive interference can occur, where competing and non-aligned needs of the tasks hurt their individual or collective performance. That is why it is very important when designing multi-task architectures to share just the right amount of the feature space between tasks, balancing the size of the backbone with the task-specific modules, as well as the magnitude of their loss signals so that the gradient landscape is not dominated by any single task. This balancing of losses is generally done using empirically determined, user-defined constants, but GradNorm proposes an approach to normalize the training gradients across tasks in a learned way, so that each task contributes equally at each training step, avoiding overfitting on individual tasks, and reducing the requirement for expensive and time-consuming hyperparameter grid searching to find ideal loss coefficients.

Synthetic Data

A major hindrance in MTL research is that there are unfortunately few open-source real datasets which provide annotations for more than a few tasks at most. Synthetic data has become a popular field of research (especially in the automotive industry, where data collection and labeling is particularly expensive), not only because it allows for sandboxing concepts in a safe and controlled environment, but also because it has been shown to pretrain models with knowledge that is transferable to the real domain. Since synthetic datasets are created programmatically, annotations can be derived directly from the simulation without the need for human labeling. This allows synthetic datasets like SHIFT to provide a full annotation suite, and creates the opportunity to explore the effects of multi-task learning and prediction distillation without complicated data loading setups that combine multiple datasets.

This chart from the 2022 SHIFT paper gives us a recent view of how rare a full annotation suite really is, in both the real and synthetic domains.

The transfer of knowledge from the synthetic to real domain is not perfect, however, because of the so-called “domain gap,” which describes any difference in the distributions between domains, for example, those found in the color/noise/blur of images, the agent types and behaviors seen, or the sensor responses to weather and lighting. Fortunately, because of the promising rewards that synthetic data has to offer, there has been much investigation in overcoming this obstacle, from modifying network architectures to lessen its impact by learning domain-invariant features, to increasing the realism of the data being rendered to reduce the domain gap preemptively.

Synthetic-to-real domain adaptation (DA) is a field of research which aims to unlock the full potential of synthetic-to-real transfer learning by reducing the impact of the domain gap. Color-space adaptations like the Reinhard method can reduce the domain gap without affecting the image geometry, preserving annotation alignment. Generative Adversarial Networks (GANs) have been applied to reduce domain gap at the pixel level, the feature level, or both by making training examples indistinguishable between domains. Ganin & Lempitsky (2014) demonstrated a simple unsupervised approach to aligning the learned feature space by backpropagating the reversed loss gradient from an auxiliary domain classification head. ADVENT achieved unsupervised DA (UDA) in semantic segmentation via entropy minimization on the target domain predictions, and FADA demonstrated the importance of fine-grained feature alignment at the class level.

Synscapes demonstrated the power of addressing the synth-to-real domain gap at the data generation level, using photorealistic rendering techniques tuned to carefully match to the domain of the real Cityscapes dataset in the sensor, agent, and annotation spaces, producing superior transfer learning results in semantic segmentation and object detection compared to previous synthetic driving datasets. While Synscapes demonstrates the benefit of photorealism in improving synth-to-real transfer learning, it also reveals that the synthetic datasets need not be realistic in order to be useful, as the two less-photorealistic synthetic datasets benchmarked for comparison still show significant improvements across the board when compared with the target-only training, seen in the chart below:

Comparison of transfer learning results from multiple synthetic sources to the target Cityscapes domain on the semantic segmentation task. Note that while the curated photorealism of Synscapes provides superior results, all examples of fine-tuned synth-to-real transfer learning beat the Cityscapes-only benchmarks.

In low-level computer vision tasks like optical flow, which depend on understanding scene geometry more than semantic detail, photorealism has proved far less important. For example, the delightfully absurd Flying Chairs dataset gained widespread adoption for pretraining optical flow networks despite a complete lack of realism, because it provides a large variety of motion patterns and scenes from which models can learn fundamental concepts of motion.

Hierarchical Transformers

Vision Transformers (ViT) have shown an exceptional descriptive capacity by using pure transformer architectures, but are generally large and difficult to train. For real-time robotics applications like autonomous driving, more nimble architectures are appropriate. In the interest of parameter efficiency and inference speed, Multiformer utilizes a recent family of lightweight yet powerful image encoders called Hierarchical Transformers. These architectures benefit from the self-attention mechanism of transformers which enables them to learn dynamic relationships between feature pixels with global context in ways previously impossible for Convolutional Neural Network (CNN) backbones, while the strategic incorporation of convolution layers infuses them with the useful inductive biases of CNNs that make them highly efficient at learning image data, which pure transformer architectures must otherwise learn from scratch.

Recent work using this family of models has shown that their features carry more descriptive power in fewer parameters and with comparable data efficiency to traditional CNN backbones. This study shows that they are also quite capable of learning representations that are useful across multiple tasks. The second rendition of Pyramid Vision Transformer (PVTv2) stands out among its peers in this family of models, leading to its adoption as the backbone for subsequent breakthroughs in lightweight transformer-based vision architectures. Segformer achieved remarkable results in semantic segmentation by attaching a lightweight all-MLP decoding head to PVTv2, while Global-Local Path Network (GLPN) took a slightly different approach to develop a lightweight head for monocular depth estimation using a form of attention to combine information across feature layers. Panoptic Segformer silently revealed the potency of PVTv2 in a table that showed this backbone taking a razor thin second place behind the top-performing Swin-L-backed model with less than half the parameters and FLOPs, shown here:

Chart from Panoptic Segformer shows a clear dominance in parameter efficiency for the PVTv2-B5 backbone.

Detection Transformers

Detection Transformers (DETR) was the first attempt to deploy transformers on the object detection task by passing learned positional embeddings called object queries through the decoder that selectively embed information into themselves along the way by iteratively attending to the encoded feature maps, then an MLP head decodes this information so that each query becomes a box proposal, which may be given a “no object” classification. Loss is calculated choosing a set of prediction boxes through bipartite matching with the ground truth via the Hungarian algorithm. At inference, a confidence threshold is used to filter the non-object box proposals, as they have independent probabilities of belonging to each class that need not sum to one (logits are passed through a sigmoid rather than softmax function), so although we get a box proposal for each query, negative proposals learn to show a low probability for all classes.

DETR marked the first fully end-to-end object detector, eliminating the need for traditional hand-crafted components. For example, the loss design of DETR discourages duplicate box proposals for a single object instance by penalizing high confidence scores in any of the object queries not assigned during bipartite matching, while traditional object detectors must use non-maximum suppression (NMS) post-processing to filter out the many duplicate box proposals they generate. Second, concerns over region proposals vs sliding windows become irrelevant, as the queries learn to sample information and localize themselves in an end-to-end fashion.

Deformable DETR offered a large improvement on the performance and convergence time of DETR by introducing deformable attention modules, which drastically reduce the complexity of attention in the transformer layers by learning to selectively sample the feature space (inspired by Deformable Convolution). These modules learn to estimate positional offsets which are used by the queries to selectively attend to a fixed number of relevant key points in the feature volume, rather than endure the computational cost of attending over the entire feature maps.

Thanks to the reduced complexity of deformable attention, Deformable DETR is able to make use of the multi-scale feature maps which typically contain important information for fine-grained understanding, but which were necessarily discarded by DETR to prevent untenable complexity, and this leads to a pronounced gain in small box performance. The advantages in performance and convergence time offered by the Deformable DETR design are demonstrated in the two graphics below:

Results from Deformable DETR demonstrating superior performance over DETR in 10x fewer training epochs.

Methodology

Architecture

Multiformer is a combination of networks that perform complementary computer vision tasks and have demonstrated successful pairing with the PVTv2 backbone. Together, they provide a strong set of predictions across fundamental tasks in autonomous navigation with a wieldy parameter count of 8.1M and 8.3M parameters in the M0 and M1 variants, respectively. This model opens the door for research in small-scale autonomy stacks which can be trained and tested on consumer hardware and deployed in smaller units, and also provides a promising scenario for expansion into 3D object detection. A diagram of the model is shown below in Fig. 1:

Figure 1: Anatomy of the Multiformer model.

In the spirit of keeping things compact, the smallest (B0) version of PVTv2 backbone was used in all training configurations. Additionally, the 2D detection module was downsized substantially from the default Deformable DETR configuration which comprises 6 encoder and 6 decoder layers with a hidden size of 1024, and instead uses a much more compact configuration of 3 encoder and 3 decoder layers with hidden size 256, reducing the number of weights in the 2D detection module by 71% while still proving effective, perhaps another testament to the potency of the PVTv2 features.

Training Configurations

All training runs started with backbone weights that were pretrained first on semantic segmentation, then simultaneously on semantic segmentation and depth estimation as part of the Multitask Segformer project. Models were trained with a batch size of 8 over 150,000 training steps, optimized using AdamW with a cosine learning rate schedule from 0.0002 to zero. Following the approach of DETR, the learning rate for the backbone weights was an order of magnitude lower than the detection module. The depth and semantic heads were pretrained with the backbone, and their learning rates were set to 1/5 that of the detection module.

Two configurations, M0 and M1, were trained to convergence, with the latter differing only in the inclusion of the largest encoder feature map into the Deformable DETR module input after a spatial reduction via 2x2 convolution (see Fig. 1). Early training of the M0 architecture showed small box performance lagging considerably behind the larger sizes, which is consistent with the results in Deformable DETR and typical for the object detection task at large, but is likely also exacerbated by the precision of the synthetic ground truth labels, which will include many challenging, tiny examples that human annotators would be likely to miss. However, since the high-resolution feature maps excluded by default in Deformable DETR are known to be informative for dense prediction tasks like semantic segmentation and depth, it is a reasonable hypothesis that incorporating this discarded information would be beneficial to the fine-grained detection of small objects. To test this, the M1 architecture was adapted to include the largest feature map by first passing it through a 2x2 convolution to reduce the spatial dimension before adding it to the flattened hidden states (see Fig. 1). Ablations were tested on the M0 configuration for expedience.

An additional M0 configuration was trained that concatenated the semantic logits and predicted log depth onto the inputs of the Deformable DETR encoder. This was an attempt at prediction distillation to aid the detection task and lay the groundwork for expansion into 3D detection, but while the model still performed reasonably well, the naïve concatenation seemed to confuse learning and led to degraded performance overall. A more sophisticated injection of this knowledge, or a longer training cycle to allow time to properly model its relationship may yield better results.

Loss

The semantic task is trained with cross-entropy. Monocular depth is trained with a linear combination of scale-invariant log (SiLog) and mean absolute error (MAE) losses (the latter being included to enforce true-scale accuracy). The 2D detection module uses the loss design from Deformable DETR: bipartite matching with the Hungarian algorithm, focal loss for the box classification, and a mixture of normalized center coordinate L1 loss and generalized IoU (GIoU) loss for the box proposals. The lambda constants for semantic segmentation (semseg), depth, and 2D detection losses were set to 5.0, 1.0, and 1.0 respectively. The SiLog and MAE components of the depth loss were given lambdas of 1.0 and 0.1 to balance their magnitudes, and the 2D detection loss components (focal loss, l1 center loss, GIoU loss) used the lambdas from the DETR literature: 1.0, 5.0, and 2.0, respectively. Written formally, the loss is calculated as follows:

Results

Please see the full training report for interactive, enlargeable charts covering all the details of training. A summary graphic of the most pertinent evaluation metrics is below:

Given that the Deformable DETR module weights were randomly initialized for each training, this model was able to learn strong 2D detection with a surprisingly manageable 150,000 training steps of batch size 8, or 1.2M total image examples (8 epochs of the 1Hz front camera data in the SHIFT dataset). Deformable DETR was trained 50 epochs on the COCO 2017 train split of 118K images, equivalent to 5.9M total image examples, but this is not a clean comparison, since the classes and domain in the SHIFT dataset are far less diverse, with only 6 annotated object categories appearing only in the context of driving, as compared to the COCO dataset which has 80 object categories appearing in many contexts, so future work should establish a better comparison to monitor the impact of the multi-task learning, but this would require depth pseudo-labeling as COCO doesn’t provide it. Further, this model backbone was pretrained in-domain on the auxiliary tasks, whereas the backbone for Deformable DETR transferred ImageNet pretrained weights into the COCO domain, further obfuscating any comparison. Also, as mentioned above, the 2D detection module used in Multiformer is less than a third the size of the default Deformable DETR configuration, with half the number of encoder and decoder layers using 1/4 their default hidden dimension, so this would also contribute to faster convergence due to substantially fewer trainable parameters.

While the addition of the largest feature map in the M1 configuration does increase the training footprint of the model substantially (roughly 20% at this image resolution), it led to significant improvements across the board, with +1.25 overall mAP, +1.23 mAP medium, and +0.52 mAP small, which represent 3.19%, 1.92%, and 2.66% performance gains, respectively. Large box performance also saw a notable gain of +1.32 mAP in the M1 configuration, meaning that the information contained in the previously discarded feature map was beneficial to more than just fine-grained detection performance.

Ablations

A training run of the Multiformer M0 was done with supervision from 2D detection only, starting with the same backbone that was pretrained on the auxiliary tasks, but not passing any supervision signal from those tasks at train time, providing a clean comparison of multi-task training to transfer learning. The results show a clear advantage to the multi-task training over transfer learning, showing +3.25 mAP, which is an 8.73% relative gain in performance, and a clear signal that multi-task training should be utilized whenever training labels are available or a teacher network can be utilized for pseudo-labeling.

Islam et al. (2020) found that CNNs actually do have the ability to encode positional information from the zero-padding in their layers. This property has enabled both the Convolutional Vision Transformer (CvT) and PVTv2 to demonstrate the safe removal of position embeddings from their architectures by instead relying on zero-padding in strategically placed convolutional layers to infuse positional information naturally, allowing these transformer-based feature extractors to easily run inference on images of different resolutions than they were trained on. Deformable DETR, however, adds positional encoding into the feature layers it takes from any backbone, whatever the type. Since hierarchical transformers purposefully encode positional information into their feature maps using zero-padding, a question arises: do we actually need to add more positional embeddings on top of these features before they are passed through Deformable DETR, or is this redundant? Our test showed that there was a slight advantage to leaving these additional positional embeddings in, and since the sine embeddings provide this benefit with no additional learned parameters, it seems appropriate to use them always.

Conclusion

The PVTv2 backbone has shown adept ability to learn a common feature space which is useful to all three of the vision tasks learned, even using the smallest (B0) size. Further, similar to the findings of Segformer in semantic segmentation and GLPN in monocular depth estimation, the successful extraction of 2D detection from this feature space can be done with a surprisingly small number of parameters, achieving compelling performance with a Deformable DETR configuration which is less than 1/3 the default size for this module. Isolating 2D detection as the target task, multi-task training with auxiliary supervision signals from semantic segmentation and depth showed a clear advantage over training 2D detection with transfer learning only, with an increase of +3.25 mAP (an 8.7% relative gain in performance), providing a strong case for using this training method whenever training labels are available, or a teacher network can be used to generate pseudo-labels, even when the final application only requires a single task. The representations learned will be more robust and general, as knowledge learned in other tasks can be leveraged to perform the target task.

The combination of multi-task prowess and parameter efficiency in computer vision demonstrated by a model like Multiformer is highly complementary to autonomous robotics applications, particularly at smaller scales, since it can perform inference at higher framerates and on smaller hardware. As the required footprint of the perception module decreases, the newfound memory surplus can be used to explore combinations with planning and control modules on consumer-sized hardware, democratizing research in full-stack autonomy. Further, since these weights have been trained on data captured in the CARLA simulator, they provide a great starting point for researching and developing autonomous vehicle stacks in that environment.

Future Work

Multiformer has been developed for and within resource-constrained environments, but future studies should scale the architecture to use larger configurations of PVTv2, Segformer, GLPN, and Deformable DETR to monitor the performance deltas which can be achieved when more resources are made available.

The combination of dense depth prediction and 2D detection is a promising recipe for 3D object detection, which is the natural next extension for this model. Panoptic Segformer has already shown the potential for adapting Deformable DETR to the panoptic segmentation task, and there may be potential for exploiting common feature representations among these detection modalities to provide rich inputs for a 3D object detection module with relatively few parameters. Further, it may be possible to apply the deformable attention transformer architecture directly to the 3D object detection task, but it is yet to be determined whether the weights can be shared between the 2D and 3D tasks, or these tasks would instead show destructive interference and require separate Deformable DETR modules.

The multi-task learning in this model was balanced using empirically determined lambda values to weight the many loss components; however, it would be informative to test the incorporation of GradNorm into the architecture to avoid relying on handcrafted values, and instead learn to use the optimal combination of loss signals at each training step.

This model is being developed as part of a larger project for a lightweight open-source autonomous driving stack for CARLA, the domain in which the SHIFT dataset was captured, and so was not trained with the direct intention of transfer learning research, and has not yet been benchmarked on real datasets. However, it would be interesting to see if the synthetic-to-real transfer learning is better with multi-task learned features than it is with features learned only from the target task. This would mean that the much loathed synthetic-to-real domain gap can be partially overcome through multi-task learning in the synthetic space, where it is naturally more available. Experimenting with transfer learning of these pretrained weights to real datasets would be an excellent next step in determining the benefits of synthetic multi-task learning in a common comparison framework.