The Nexar Dashcam Crash Prediction Challenge presented a compelling task: predict whether a collision or near-collision event is imminent based on dashcam video footage. The training dataset comprised 1,500 real-world driving videos, each approximately 40 seconds long, annotated with precise timestamps for both the event (collision or near-collision) and the earliest moment it could be predicted – the alert time. The test set included 1,344 videos, half of which ended either 0.5, 1.0, or 1.5 seconds before an event.
Rather than focusing on finding the best model architecture or do hyperparameter tuning, adding optical flow data, or incorporating a transformer/LSTM secondary network based on predicted time-series, I wanted to keep it simple! I adopted a 100% data-centric approach. My goal was to see how far the performance could be pushed using only the feedback from a standard model – mvit_v2_s (a Multiscale Vision Transformer from PyTorch’s Torchvision library) – and the 3LC framework to iteratively refine the training data based on the model’s feedback. This included inspecting predictions to build understanding, removing ambiguous data, weighting valuable examples, and debugging via embedding space.
This was also first time I used 3LC on video models – usually it has been object detection or instance segmentation, so that was interesting in itself.
This article walks through the process, decisions, and insights that ultimately elevated my score from 0.71 to 0.898 on the leaderboard – and winning the competition – without altering the model architecture or playing with model parameters.
👉 Code repository available here
Registering the Videos in 3LC tables
The first step was converting the videos to individual 256Ă—256 frames and registering them as a table in 3LC. Early testing confirmed my suspicion: training on frames after the event/crash was detrimental. I observed that the videos after the crash was quite chaotic, and the goal was to predict when an alert should go off. So I trimmed those frames out during the registration step.
Each registered frame included the following metadata:
- Time to Event
- Time to Alert
- Event Occurs (if the frame is within the alert-to-event window)
- Has Event (if the video frame belongs to contains an event)
This setup allowed me to filter and inspect the dataset intelligently even before training.

Exploring Alert-to-Event Frames
Using 3LC’s filtering tools, I focused on frames between the alert and the event. Exploring those, I decided the model needed around 2 seconds of context to do a good prediction as that was subjectively “enough” change in the situation, so even though each training sample was one frame, I loaded 15 preceding frames (step size = 4) in the PyTorch dataset for each sample during training. The mvit_v2_s model accepts 16-frame inputs, so that worked well. At 30fps that was about 2 seconds for each training sample.

I trained on 6,000 samples per epoch, using torch’s WeightedRandomSampler. Since each sample includes 16 frames, this meant 96,000 frames per epoch loaded. I reserved 20% of the videos for validation early on, later reducing that to 2%. I ensured that the frames from the same video were never split between the training and validation sets.
Each epoch took ~3 minutes on an NVIDIA A100. For validation, I sampled sparsely (every 64th frame for non-events, every 4th for events), keeping validation under a minute. I observed best results after 5–20 epochs of training.
After the initial training run, I selected the best-performing epoch (on the validation set) and ran inference on the training set, saving the results as a 3LC run. Inference on full training set alone took ~4 hours, so I ran it separately post-training. Usually I run inference and add to the 3LC run for each epoch, but here it would just take too long.
Now opening the training run in 3LC I could start to plot and analyze the original data combined with the training metrics captured for each frame and start to make necessary changes to the training data!

Refining the Training Data
Using the 3LC UI, I began by inspecting high-confidence “event” predictions that were incorrect. In many of these cases (subjectively speaking), I believe it would have been useful to receive an alert in real-life! However, since I wasn’t confident enough to re-label these frames, I chose to delete them from the training table instead.

To do this, I first isolated the cases by using text filtering on the left-hand panel – targeting the specific videoIDs I’d identified – and then used the lasso tool to select the sequences of frames I wanted to remove.

After selecting, I deleted all frames currently filtered in. Deleting samples in a 3LC table doesn’t modify the underlying data; it simply creates a sparse revision. However, when you load this revision in Python and use it as a PyTorch dataset, the changes take effect immediately. This makes it possible to rapidly experiment with dataset versions without copying or rewriting files.

Analyzing Prediction Errors
I created a derived column in 3LC to highlight mismatches between event occurs and prediction > 0.95. Errors were minimal in non-event videos and mostly concentrated just before the alert in videos that had an event – was the model predicting events earlier than expected?

To investigate further, I plotted the error frames in the model’s embedding space – they clustered tightly in one region. Most other frames were correctly predicted.
(Sidenote, in general, when capturing metrics from training data, I have found it is better to take an early epoch than a later one, to avoid results that have overfitted.)

Zooming in on the frames in embedding space where the model struggles.

Further Training Data Edits
Based on all these insights, I made three key changes:
- Deleted the last 0.25 seconds before a crash, but only if an alert had been active for at least 0.5 seconds. My hypothesis: frames just before impact may reduce the model’s ability to learn early signals.
- Weighted up the alert-to-event frames so they appeared as frequently as all other frames. I added a weight column in 3LC and passed it to the WeightedRandomSampler.
- Doubled the weight of samples from videos with events (but outside the alert zone) to encourage the model to learn distinctions within the same video. Based on analysis above I saw that there was almost no errors on no-event videos.




Things That Didn’t Work
One of the experiments I spent a lot of time on was weighting up the region in embedding space where the model consistently struggled to learn the correct outcome. My hope was that if the model got to see those samples more often, it would eventually learn them better.
Unfortunately, it didn’t help. Those difficult frames seemed inherently ambiguous. That said, I believe this approach could work in a larger-scale setting with unlabeled data. You could run inference with a trained model, capture embeddings, and use the hard-to-learn region as a filter for active learning – targeting those samples for human annotation.
It’s a direction worth exploring further.
Key Insights
A few takeaways from the final phase of the project:
- The final training table scored 0.898 on 50% of the test set.
- It took 61 epochs to reach that, which amounts to roughly 366,000 training samples out of ~1.2 million (actually much less than 366,000 since half the samples each epoch statistically was the event samples, from around ~30000 and there was also overlap between epochs due to random weighted resampling).
- However, I achieved almost the same score with only 6–15 epochs, indicating that perhaps just ~2-5% of the data contained the most impactful training signals.
- This reinforces the importance of targeted data selection. Simply throwing more data at the model wouldn’t help unless that data actually contributes to the model’s understanding.
- A unsupervised pretraining approach on a larger chunk of unlabelled data followed by supervised fine-tuning on intelligently curated data could likely yield even stronger results.
Results
I was very happy with the improvement from 0.71 → 0.898 in the competition leaderboard, especially since my only changes were to the training data – no model architecture tweaks, no parameter sweeps!
This experiment underlined for me how powerful it can be to understand where your model struggles and to treat the data as the primary lever for performance.
The biggest challenge? Designing a fast but accurate validation strategy. The model converged quickly – so it was crucial that the validation also was fast and precise as well. I spent a lot of time tuning validation subsampling to balance speed and precision, and while I never fully perfected it, it was good enough to guide my decisions.
Below is a view of the final revisions from the initial table (I did however experiment a lot before this with other revisions).

Final Thoughts
This was an extremely fun challenge and I feel I learned a lot! Huge thanks to Nexar Inc. and Daniel Moura for hosting it!
Dataset provided by Nexar Inc.
Original article written by Paul Endresen and posted on LinkedIn on May 9, 2025: https://www.linkedin.com/pulse/nexar-dashcam-kaggle-challenge-paul-endresen-ayd2c/?trackingId=htKIkR7lQAYGrWHdFRolUA%3D%3D
Moura, Daniel C., and Zvitia, Orly. “Nexar Collison Dataset.” Hugging Face, 2025, https://huggingface.co/datasets/nexar-ai/nexar_collision_prediction.