Image courtesy of NVidia

Synthetic Data and Autonomous Vehicles

The role of simulation in AV development

Nate Cibik
6 min readApr 19, 2021


Driving simulators play a key role in the development of Autonomous Vehicles. Collecting and labeling the data used for training the components of AV software stacks is an expensive and time consuming process, and it is difficult to obtain data that cover edge cases while abiding the rules of the road. Such datasets are also invariably biased to the location and sensor configuration of the collection, and without making special considerations, models trained on them will not generalize well. Augmentation of AV training data is necessary to reduce bias and driving scenario limitations, but this process produces artifacts that can compromise the quality of learning, when the neural networks learn to recognize these artifacts, rather than the driving scenarios. Further, whenever self-driving software is updated, it must be validated for safety, and the only way to feasibly do this is to test it in numerous virtual scenarios before deployment. For these reasons, synthetic data and simulation are quintessential elements in AV development, but have limitations of their own. In this post, we will explore the key roles of synthetic data in AV development, and the ways which these limitations are being addressed.

For a neural network to learn how to operate in any given driving scenario, it must be trained on labeled data which includes that scenario. For instance, in order to learn the proper behavior for a situation in which the car is drifting off of the road or into the oncoming lane, there must be data collected of such a situation. Clearly, data collection teams cannot safely obtain data representing these unsafe driving situations. There are many scenarios that an autonomous vehicle must be prepared to handle that we hope never to encounter in the real world. This is one reason that simulation is such a powerful tool in autonomous vehicle research, since these necessary edge cases can be hand-crafted and added to the training data.

Another key benefit of data generated through simulation is that the ground truth is known with certainty. Labeling real-world data is immensely time consuming and expensive, and the amount of data involved in the domain of autonomous driving is staggering. As Senior VP of Engineering at Cruise Mo Elshenawy remarked in his talk with NVidia’s Neda Cvijetic for GTC21: “in one month we have more data stored and processed and easily accessible than the entire movie industry ever created.” As one can imagine, labeling this much data is a gargantuan task. When generating data in with simulators, however, bounding boxes, segmentation, depth maps, trajectories, and velocities are all known by the simulator, so the annotation comes for free.

In training PilotNet, a Deep Neural Network for steering a car down a roadway using camera input, the NVidia team used classical computer vision techniques to augment the dataset to decrease the bias of the dataset. By applying rotations and translations to the camera images, they sought to represent driving scenarios that were not safe to collect by modifying the apparent position and orientation of the vehicle in the lane. These image transformations introduced artifacts into the data, and the team found that because there was a correlation between the desired control outputs and the type of artifact present in an image, the neural network learned to “cheat” on its key performance indicator (KPI) tests by learning the desired controls based on these artifacts, rather than the driving scenario itself.

The above example gives us an idea of why generating synthetic data to represent these driving scenarios would be so useful. But this comes with its own challenges.

First, there is a documented domain gap between simulated and real driving data, due in large part to the difficulty of creating realistic representations of the real world using computer graphics. To mitigate this issue, researchers have had success in using Generative Adversarial Networks (GANs) to increase the realism of synthetic data. This is done by training one neural network to modify the data in such a way that fools another neural network into thinking that the data comes from a real-world dataset. In a recent study, researchers at Ford found that such “Sim-to-Real” data was effective in reducing the bias of real world datasets, allowing models trained with a combination of real and synthetic data to generalize far better. However, as Alperen Degirmenci, Senior Deep Learning R&D Engineer at NVIDIA commented in his presentation during GTC21, GANs can also introduce artifacts into images. Considering their experience in catching DNNs cheating using artifacts from data augmentation, there is a possibility that these artifacts may cause similar issues.

Another clear direction in bridging the domain gap between real and synthetic data is simply to improve the graphics of the simulator to be more realistic. To this end, multiple companies are stepping up to the plate, deploying teams of seasoned 3D graphics developers to tackle the challenge. NVidia will be releasing version 2.0 of their Drive Sim later this year, which promises to be a huge step in this direction. With cutting edge graphics techniques such as ray-traced lighting, this simulator promises to bring synthetic data to a new level of realism. Another powerful simulator which has become the weapon of choice for autonomous trucking company Kodiak comes from Applied Intuition.

The second major challenge in harnessing the power of synthetic data is in the ease and convenience of its generation. Handcrafting each driving scenario from scratch in a simulated environment is almost as labor intensive as just collecting and labeling real data. Thus, a streamlined process by which AV developers can obtain the high quality synthetic data they need to cover an operational domain is a must. Parallel Domain is a company focused on providing efficient, procedural methods for generating synthetic data by creating numerous variations of domains captured in real-world data, through “combinatorial parameter sweeps, generating data for each vehicle, multiplied by every lighting condition, by every weather condition, by every paint color.” Further, they simulate the sensor readings of these recreated environments according to the specific configuration of the platform which captured the original data, making it easy for the developers to put this synthetic data directly into their ML training pipelines.

Simulation already plays an enormous role in validating the safety of updates for self-driving software. Whenever software stacks are updated, a company must verify that the new version is still safe for use, and that the software updates have not creates bugs, conflicts, or unexpected behavior. There is no other way to do this than to deploy the software in simulation at a massive scale in order to prove that it still operates as expected in all scenarios after these updates. As Jeff Garnier of Kodiak comments on building the company’s safety case: “it’s not the number of miles we simulate that matters, it’s whether we can get sufficient coverage of the full range of driving scenarios we see in our Operational Design Domain.” This again demonstrates the need to ensure full domain coverage in the data, which is only feasible through synthetic data generation.

As we have seen, driving simulators and synthetic data play a massive role in the development of autonomous vehicles. Necessary for dataset augmentation as well as safety assurance, simulators are utilized daily by autonomous vehicle companies, and the massive demand for high quality synthetic data in the AV space will continue to push the envelope in graphics as well as the ease of data generation. We can expect to see simulation and autonomous technologies growing symbiotically over the coming years.



Nate Cibik

Data Scientist with a focus in autonomous navigation