The Nexar Dashcam Crash Prediction Challenge presented a compelling task: predict whether a collision or near-collision event is imminent based on dashcam video footage. The training dataset comprised 1,500 real-world driving videos, each approximately 40 seconds long, annotated with precise timestamps for both the event (collision or near-collision) and the earliest moment it could be predicted – the alert time. The test set included 1,344 videos, half of which ended either 0.5, 1.0, or 1.5 seconds before an event.

Rather than focusing on finding the best model architecture or do hyperparameter tuning, adding optical flow data, or incorporating a transformer/LSTM secondary network based on predicted time-series, I wanted to keep it simple! I adopted a 100% data-centric approach. My goal was to see how far the performance could be pushed using only the feedback from a standard model – mvit_v2_s (a Multiscale Vision Transformer from PyTorch’s Torchvision library) – and the 3LC framework to iteratively refine the training data based on the model’s feedback. This included inspecting predictions to build understanding, removing ambiguous data, weighting valuable examples, and debugging via embedding space.

This was also first time I used 3LC on video models – usually it has been object detection or instance segmentation, so that was interesting in itself.

This article walks through the process, decisions, and insights that ultimately elevated my score from 0.71 to 0.898 on the leaderboard – and winning the competition – without altering the model architecture or playing with model parameters.

👉 Code repository available here

Registering the Videos in 3LC tables

The first step was converting the videos to individual 256Ă—256 frames and registering them as a table in 3LC. Early testing confirmed my suspicion: training on frames after the event/crash was detrimental. I observed that the videos after the crash was quite chaotic, and the goal was to predict when an alert should go off. So I trimmed those frames out during the registration step.

Each registered frame included the following metadata:

  • Time to Event
  • Time to Alert
  • Event Occurs (if the frame is within the alert-to-event window)
  • Has Event (if the video frame belongs to contains an event)

 

This setup allowed me to filter and inspect the dataset intelligently even before training.

Plot of video vs. frame number, colored by Event Occurs in green

Exploring Alert-to-Event Frames

Using 3LC’s filtering tools, I focused on frames between the alert and the event. Exploring those, I decided the model needed around 2 seconds of context to do a good prediction as that was subjectively “enough” change in the situation, so even though each training sample was one frame, I loaded 15 preceding frames (step size = 4) in the PyTorch dataset for each sample during training. The mvit_v2_s model accepts 16-frame inputs, so that worked well. At 30fps that was about 2 seconds for each training sample.

Filtering in only those frames that are between alert and event

I trained on 6,000 samples per epoch, using torch’s WeightedRandomSampler. Since each sample includes 16 frames, this meant 96,000 frames per epoch loaded. I reserved 20% of the videos for validation early on, later reducing that to 2%. I ensured that the frames from the same video were never split between the training and validation sets.

Each epoch took ~3 minutes on an NVIDIA A100. For validation, I sampled sparsely (every 64th frame for non-events, every 4th for events), keeping validation under a minute. I observed best results after 5–20 epochs of training.

After the initial training run, I selected the best-performing epoch (on the validation set) and ran inference on the training set, saving the results as a 3LC run. Inference on full training set alone took ~4 hours, so I ran it separately post-training. Usually I run inference and add to the 3LC run for each epoch, but here it would just take too long.

Now opening the training run in 3LC I could start to plot and analyze the original data combined with the training metrics captured for each frame and start to make necessary changes to the training data!

Middle view shows where events are in video, vs predicted at threshold 0.95 to the right

Refining the Training Data

Using the 3LC UI, I began by inspecting high-confidence “event” predictions that were incorrect. In many of these cases (subjectively speaking), I believe it would have been useful to receive an alert in real-life! However, since I wasn’t confident enough to re-label these frames, I chose to delete them from the training table instead.

I found a lot of frames where the model is quite certain that a collision is about to happen although the samples are not labelled as such

To do this, I first isolated the cases by using text filtering on the left-hand panel – targeting the specific videoIDs I’d identified – and then used the lasso tool to select the sequences of frames I wanted to remove.

Selecting sequences around false positives for deletion

After selecting, I deleted all frames currently filtered in. Deleting samples in a 3LC table doesn’t modify the underlying data; it simply creates a sparse revision. However, when you load this revision in Python and use it as a PyTorch dataset, the changes take effect immediately. This makes it possible to rapidly experiment with dataset versions without copying or rewriting files.

Final deletion step in UI

Analyzing Prediction Errors

I created a derived column in 3LC to highlight mismatches between event occurs and prediction > 0.95. Errors were minimal in non-event videos and mostly concentrated just before the alert in videos that had an event – was the model predicting events earlier than expected?

Most errors come around the time it goes from pre alert to alert, not so many errors on the right side for videos that have no alert/events – errors in green

To investigate further, I plotted the error frames in the model’s embedding space – they clustered tightly in one region. Most other frames were correctly predicted.

(Sidenote, in general, when capturing metrics from training data, I have found it is better to take an early epoch than a later one, to avoid results that have overfitted.)

Embedding space showing error clusters in green

Zooming in on the frames in embedding space where the model struggles.

Green frames are wrong frame predictions in embedding space

Further Training Data Edits

Based on all these insights, I made three key changes:

  1. Deleted the last 0.25 seconds before a crash, but only if an alert had been active for at least 0.5 seconds. My hypothesis: frames just before impact may reduce the model’s ability to learn early signals.
  2. Weighted up the alert-to-event frames so they appeared as frequently as all other frames. I added a weight column in 3LC and passed it to the WeightedRandomSampler.
  3. Doubled the weight of samples from videos with events (but outside the alert zone) to encourage the model to learn distinctions within the same video. Based on analysis above I saw that there was almost no errors on no-event videos.
Lassoing in these in embedding space shows that errors are mostly on frames that are before or after alert
Deleting frames from 0.25 to 0 before event as long as alert has been around for more than 0.5 seconds
Weighting remaining samples that are in the alert zone by 30x so they will show up as often in training as the other frames
Weighting event-video frames outside the alert-to-event window 2x

Things That Didn’t Work

One of the experiments I spent a lot of time on was weighting up the region in embedding space where the model consistently struggled to learn the correct outcome. My hope was that if the model got to see those samples more often, it would eventually learn them better.

Unfortunately, it didn’t help. Those difficult frames seemed inherently ambiguous. That said, I believe this approach could work in a larger-scale setting with unlabeled data. You could run inference with a trained model, capture embeddings, and use the hard-to-learn region as a filter for active learning – targeting those samples for human annotation.

It’s a direction worth exploring further.

Key Insights

A few takeaways from the final phase of the project:

  • The final training table scored 0.898 on 50% of the test set.
  • It took 61 epochs to reach that, which amounts to roughly 366,000 training samples out of ~1.2 million (actually much less than 366,000 since half the samples each epoch statistically was the event samples, from around ~30000 and there was also overlap between epochs due to random weighted resampling).
  • However, I achieved almost the same score with only 6–15 epochs, indicating that perhaps just ~2-5% of the data contained the most impactful training signals.
  • This reinforces the importance of targeted data selection. Simply throwing more data at the model wouldn’t help unless that data actually contributes to the model’s understanding.
  • A unsupervised pretraining approach on a larger chunk of unlabelled data followed by supervised fine-tuning on intelligently curated data could likely yield even stronger results.

Results

I was very happy with the improvement from 0.71 → 0.898 in the competition leaderboard, especially since my only changes were to the training data – no model architecture tweaks, no parameter sweeps!

This experiment underlined for me how powerful it can be to understand where your model struggles and to treat the data as the primary lever for performance.

The biggest challenge? Designing a fast but accurate validation strategy. The model converged quickly – so it was crucial that the validation also was fast and precise as well. I spent a lot of time tuning validation subsampling to balance speed and precision, and while I never fully perfected it, it was good enough to guide my decisions.

Below is a view of the final revisions from the initial table (I did however experiment a lot before this with other revisions).

Different table revisions

Final Thoughts

This was an extremely fun challenge and I feel I learned a lot! Huge thanks to Nexar Inc. and Daniel Moura for hosting it!

Dataset provided by Nexar Inc.

Original article written by Paul Endresen and posted on LinkedIn on May 9, 2025: https://www.linkedin.com/pulse/nexar-dashcam-kaggle-challenge-paul-endresen-ayd2c/?trackingId=htKIkR7lQAYGrWHdFRolUA%3D%3D

Moura, Daniel C., and Zvitia, Orly. “Nexar Collison Dataset.” Hugging Face, 2025, https://huggingface.co/datasets/nexar-ai/nexar_collision_prediction.

Dive deeper into your data

Light up the black box and pip install 3lc to gain the clarity you need to make meaningful changes to your models in moments. Remove the guesswork from your model training and iterate fast.
Get started now
or book a meeting to learn more.
3LC
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.