Object tracking in video using a MOSSE tracker implemented in Rust

Thanks to @chriamue for building this browser-based demo using wasm-pack!

Tracking objects in videos is a typical problem in computer vision. Note that tracking is distinct from detection: we’re assuming we’ve already detected the object we want to track (by finding its centroid or bounding box). When working on tracking, we want to find the object of interest again in subsequent video frames.

Numerous tracking algorithms exist. For a nice overview, check out the 2010 paper ‘Visual Object Tracking using Adaptive Correlation Filters by David S. Bolme et al. In this paper, Bolme discusses the state of the art (at the time) and presents a novel tracking algorithm called MOSSE. As I understand it, MOSSE came to dominate the state of the art at the time because of its sheer speed (realtime 30+ FPS tracking on commodity hardware) and high fidelity.

To better understand this algorithm and the domain of Object Tracking in general, I took a stab at implementing this algorithm from scratch in Rust.

Due to the algorithm’s speed and low resource footprint, I found that it was quite trivial to track multiple objects simultaneously through the same video. While objects are tracked independently, it turned out to be easy to share the results of certain heavy operations on a frame between trackers.

I tested my implementation against a busy bike traffic scene. This scene is particularly challenging because the moving objects in it tend to overlap. Below video should give an impression of the results:

Overall, I found the algorithm to be relatively easy to implement, primarily because I could use the rustfft and image libraries to abstract away the complexities of dealing with Fourier transforms and images. I didn’t bother integrating with FFMPEG or another video decoding lib, so my MOSSE implementation only works on video frames stored as image files for now.

I ended up being pretty impressed by the algorithm. Tracking multiple objects in a video turned out to be very fast (sub-5ms per frame on 720p). I hardly paid attention at all to numerical optimizations and the potential for parallelism. There’s probably quite a bit of low-hanging fruit in terms of performance.

As most algorithms, MOSSE is pretty sensitive to its hyperparameters. Primarily the learning rate decides a lot of the tracking fidelity. As the algorithm tries to update its representation of the object it is tracking with each frame, high learning rates may lead to ‘promiscuous’ trackers. With a high learning rate and a busy scene, it is likely that the tracker will start following a different object when it overlaps with the tracked object for a long time. This is nicely visible in the example video (note the scooter). The only way to remedy this is to tune the learning rate and other hyperparameters to better ‘fit’ the scene you are tracking in. Setting the learning rate too low will make it hard to track objects that move fast or change shape rapidly. Additionally, in my experience, using higher resolution video makes it easier to track in busy scenes.

I have hosted the code and related instructions on Github at https://github.com/jjhbw/mosse-tracker.git. You can find the original video of the traffic scene used in the example here.

All ideas for improvements are of course welcome!