Computer vision: video background subtraction using a Gaussian Mixture Model in Rust

To get more of a feel for computer vision techniques and their applications, I decided to dive into one of computer vision’s most described applications: traffic monitoring. I wanted to not use OpenCV for once, and decided this was a great excuse to explore Rust, a modern systems language.

The full implementation discussed in this post can be found on Github.

Goal

Count traffic in a fixed-view, color surveillance camera feed.

Let’s cut this problem up into pieces that we can solve one at the time:

  1. Separate the moving foreground objects (the traffic) from the stationary background (the road, trees, traffic signs, etc.)
  2. Extracting the ‘features’ that define each foreground object (e.g. using corner or contour detection)
  3. Tracking these objects across video frames using the extracted features
  4. Optionally, we can classify each object (is it a car, cyclist, van, or bus?).

In this post I’ll discuss the first step: how to separate the traffic (the ‘foreground’) from the background of the video feed. This task is usually called background subtraction in computer vision lingo.

Background subtraction is a crucial step in many computer vision pipelines, making it a subject of intensive study. Many background subtraction algorithms have been developed in the last two decades. For reference, you can take a look at the brilliant bgslibrary, an extensive C++ library of background subtraction algorithms based on OpenCV. If you want to know more about this topic, the literature of the bgslibrary maintainer, Andrews Sobral, is a nice place to start.

In addition to the methods implemented in bgslibrary, OpenCV itself has some pretty nice functions for background subtraction, contour detection and object tracking. However, simply using these won’t really teach me anything about the algorithms involved. Moreover, it will lock me into using either C++ or OpenCV’s awkward Python wrappers. This being a learning project, I’d rather take the opportunity to build everything from scratch in a language I wanted to explore for some time: Rust.

Rust claims to be a systems programming language with compile-time memory safety guarantees and first class concurrency support, aiming for the niche currently occupied by the likes of C / C++. Ostensibly, its zero-cost abstractions make for highly performant code without sacrificing readability.

Modelling the background

To separate the foreground from the background of an image, you first need a model of the background. Simple techniques like just taking the median of a set of frames won’t cut it in complex dynamic scenes, as they are too easily fooled by variations in lighting.

To solve this, many probabilistic models have been developed over the years. In this post, I will discuss a method based on a Gaussian Mixture Model as it is relatively straightforward and computationally inexpensive.

In this method, the probability of a pixel belonging to the background at a certain time is modelled using an independent Gaussian mixture model for each pixel. The parameters for each pixel’s model are updated with each new frame. This method is in essence an on-line unsupervised clustering algorithm and was first described by Stauffer and Grimson in 1999. In 2004, it was slightly reformulated and extended by Zivkovic. For this post, I implemented the Stauffer-Grimson algorithm the way it is described in the Zivkovic paper, but did not implement Zivkovic’ extension as I wanted to see how well the original performed first.

While the state of the art may have moved on since 2004, I appreciate this method for its elegance and its relative computational efficiency (compared to e.g. deep-learning based methods).

In the sections below, I will briefly digest the algorithm as described in the Zivkovic paper.

The Gaussian Mixture Model (GMM)

For each pixel in a new frame, the decision that a pixel belongs to the background at a certain time $t$ given the 3-sized (R,G,B) pixel value vector $\vec{x}^{\,t}$ can be formulated like this:

$$\frac{p(\vec{x}^{\,t}|BG)}{p(\vec{x}^{\,t}|FG)} > 1$$

This means that we say a pixel belongs to the background if its background probability divided by its foreground probability is larger than 1.

To estimate the background and foreground probabilities we must first describe the entire probability distribution of pixel values, without discriminating between potential background and foreground observations. To do this, we assume the following Gaussian Mixture Model consisting of $M$ Gaussian components.

$$\hat{p}(\vec{x}^{\,t}|X_{T},BG+FG)=\displaystyle\sum_{m=1}^{M} \hat{\pi}_{m}\mathcal{N}(\vec{x}^{\,t}; \hat{\vec{\mu}}_{m},\hat{\Sigma}_{m})$$

In this model:

  • $X_{T}$ is the $T$-sized set of previous observations (training data), i.e. $X_{T} = \{ \vec{x}^{\,t}, ..., \vec{x}^{\,t-T} \}$
  • $\hat{\pi}_{m}$ is the (nonnegative) mixing weight, a parameter that indicates the ‘importance’ of Gaussian component $m$. The sum of the mixing weights of all components in the model should add up to 1. This does not follow directly from the math, so we are going to have to enforce this in the implementation (discussed below).
  • $\mathcal{N}$ denotes the multivariate Gaussian probability density function.
  • $\hat{\vec{\mu}}_{m}$ represents an estimate of the mean that describes Gaussian component $m$.
  • $\hat{\Sigma}_{m}$is the estimated covariance matrix that describes Gaussian component $m$. To limit mathematical and computational complexity, this covariance matrix is kept isotropic (i.e. a diagonal matrix with equal elements along the diagonal) and will thus look like this: $$\begin{bmatrix} \hat{\sigma^2}_{m} & 0 & 0 \\ 0 & \hat{\sigma^2}_{m} & 0 \\ 0 & 0 & \hat{\sigma^2}_{m} \end{bmatrix}$$ where $\hat{\sigma^2}_{m}$ is an estimate of the variance describing Gaussian component $m$. Considering only this shape of the covariance matrix will help us make some useful optimizations down the road.

To put it in words: the sum of the probabilities produced by the Gaussian components in $M$, weighted by their respective mixing weights $\hat{\pi}_{m}$ represents the sum of the background and foreground probabilities.

Updating the model parameters

To be able to use this model, we need good estimates of$\hat{\pi}_{m}$, $\hat{\vec{\mu}}_{m}$, and $\hat{\Sigma}_{m}$. We can do this in an on-line fashion, meaning we can update the model parameters with each new observation (i.e. pixel value $\vec{x}^{\,t}$). For each video frame, we apply the following recursive update equations to each component of each pixel’s GMM. For this, Zivkovic outlines the following equations:

$$\hat{\pi}_{m} \leftarrow \hat{\pi}_{m} + \alpha(o^{(t)}_{m} - \hat{\pi}_{m})$$
$$\hat{\vec{\mu}}_{m} \leftarrow \hat{\vec{\mu}}_{m} + o^{(t)}_{m}( \alpha / \hat{\pi}_{m}) \vec{\delta}_{m} $$
$$\hat{\sigma^2}_{m} \leftarrow \hat{\sigma^2}_{m} + o^{(t)}_{m} (\alpha / \hat{\pi}_{m} ) (\vec{\delta}_{m}^\intercal \vec{\delta}_{m}-\hat{\sigma^2}_{m})$$

Where:

$$\vec{\delta}_{m} = \vec{x}^{\,t} - \hat{\vec{\mu}}_{m} $$

The $\alpha$ parameter represents the learning rate, which we will discuss later.

$o^{(t)}_{m}$ is the ‘ownership parameter’, which is set to 1 if the new observation $\vec{x}^{\,t}$ is ‘close’ to the component $m$, and 0 if it is not. To determine whether the observation is ‘close’, the following heuristic based on the Mahalanobis distance$D^2$ is used:

$$ D^2 (\vec{x}^{\,t}) = \vec{\delta}_{m}^\intercal \vec{\delta}_{m}/\hat{\sigma^2}_{m} $$

So, when we use a ‘closeness’ cutoff value of 3:

$$ o^{(t)}_{m} = \begin{cases} 1, & \text{if } D^2 (\vec{x}^{\,t}) \leq 3\\ 0, & \text{otherwise} \end{cases} $$

In summary, we update the parameters of each Gaussian component depending on its current parameter values and on whether the observation $\vec{x}^{\,t}$ qualifies as being ‘close’ to this component (i.e. the component ‘explains’ the observation).

After each model update cycle, we have to normalise the mixing weights of all components in the model to ensure they add up to one.

Lifecycle of the Gaussian components

When none of the existing Gaussian components are ‘close’, we generate a new component to explain the observation using:

$$\hat{\pi}_{M+1} = \alpha$$
$$\hat{\vec{\mu}}_{M+1} = \vec{x}^{\,t}$$
$$\hat{\sigma^2}_{M+1} = \sigma_{0}$$

Where $\sigma_{0}$ is some sensible value for the initial variance. I picked 15.

Before starting the training procedure, we decide on the maximum number of Gaussian components that each pixel’s GMM may consist of. Increasing this number increases the flexibility of the GMM and may improve the fit for more complex situations, such as recurring fluctuations in lighting. However, having to compute and update more components will also increase the model’s computational complexity.

If the maximum number of components is reached but none of them adequately explain the new observation $\vec{x}^{\,t}$ (i.e. none of the components are ‘close’), we discard the components with the smallest mixing weight $\hat{\pi}_{m}$ before adding the new component.

Separating background and foreground probabilities

Once the online training stabilises, we end up with a decent model of each pixel’s probability distribution $\hat{p}(\vec{x}^{\,t}|X_{T},BG+FG)$. In other words, we can compute the probability of a certain pixel value occurring, but we still don’t know whether a certain value belongs to the background or the foreground.

To do this, we need a heuristic to tell which of the Gaussian components in the model describe the background and which the foreground. We can rely on a pretty simple assumption: foreground objects are ‘rare’ and the Gaussian components describing their pixel values will have lower mixing weights than those describing background objects. Consequently, all we need to do is find the set $B$ containing the largest components so we can compute just the background probability:

$$\hat{p}(\vec{x}^{\,t}|X_{T},BG) \sim \displaystyle\sum_{m=1}^{B} \hat{\pi}_{m}\mathcal{N}(\vec{x}^{\,t}; \hat{\vec{\mu}}_{m},\hat{\Sigma}_{m})$$

To find $B$, Zivkovic takes the following heuristic, initially presented in the Stauffer and Grimson paper:

$$B = \underset{b}{\arg\min}(\displaystyle\sum_{m=1}^{b}\hat{\pi}_{m} \gt (1 - c_{f}))$$

In words: we need to find the smallest set of $b$ components that together have an aggregate mixing weight that is higher than a certain value that we choose beforehand, represented by $c_{f}$. Choosing lower (more ‘strict’) values for $c_{f}$ will lead to higher background probability estimates across the image, reducing false positives at the cost of sensitivity to foreground objects.

We can simplify the computation of $B$by ensuring that the components are sorted by order of decreasing mixing weight $\pi_{m}$ after each model update cycle.

The learning rate

The learning rate, $\alpha$ determines how many frames the model ‘remembers’ when making predictions on a new frame. It is easier to reason about this rate by thinking of it as:

$$\alpha = 1 / T$$

Where $T$ is the size of the ‘history’ (i.e. the number of previous training frames on which the current state of the model is based). We need to decide on a rational value for $T$. To help us decide, we can reason about it like this: a higher value for $T$ gives us a lower learning rate $\alpha$, and ensures that an object that enters the frame needs to remain static for a longer period of time before it is considered part of the background. To estimate the number of frames that an object needs to remain still in the video, Zivkovic provides the following formula:

$$log( 1 - c_{f}) / log(1-\alpha)$$

When using $c_{f} = 0.1$ and $\alpha = 0.001$, we can calculate that the object will need to remain still (i.e. pixels belonging to the object will need to be approximately the same value) for 105 frames.

Dampening the learning rate

Note that we use $\alpha$ as the initial value for $\hat{\pi}_{M+1}$ when a new component is added to the GMM. This means that, when we start training the GMM on a completely new scene, the first few frames have an outsized impact on the naive model. To remedy this, we initiate $\alpha$ at 1.0, and update it as $\alpha = 1 / T$ with each new frame until $\alpha_{min}$ is reached. $\alpha_{min}$ can be considered a hyperparameter in my implementation of the model, and it is set to a fixed value before training.

Rust implementation details

Massive parallelisation

As you may already have realised, model updates and background predictions only consider one pixel at a time, without sharing any information between them. This makes parallelising the algorithm a no-brainer. Writing a parallel implementation is made even easier by Rust’s fantastic concurrency support.

Consider the following method method of the BackgroundModel struct, which takes a reference to a new frame from the video feed as a RgbImage.

pub fn update(&mut self, frame: &RgbImage) {
    self.pixel_models
        .iter_mut()
        .for_each(|(x, y, model)| {
            model.update(&pixel_to_vector(frame.get_pixel(*x, *y)));
        });
}

We iterate over the collection of models (one GMM for each pixel), yielding a mutable reference to each model along with the coordinates of the pixel to which it corresponds (x and y). We apply the .update() routine to each.

The Rayon crate makes it extremely trivial to run this operation in parallel. Below is the parallel implication of the same function. Try to spot the difference!

pub fn update(&mut self, frame: &RgbImage) {
    self.pixel_models
        .par_iter_mut() // <-- spoiler alert: it's in here.
        .for_each(|(x, y, model)| {
            model.update(&pixel_to_vector(frame.get_pixel(*x, *y)));
        });
}

With this minor change, the model update operations for all pixels are spread across $k$ threads. The value of$k$ is decided automatically by Rayon, defaulting to the number of physical CPU cores.

Optimising the multivariate Gaussian probability density function

Each time the background of a new frame needs to be predicted, we need to compute the multivariate Gaussian probability density for each component in each pixel’s GMM. This means we need to invoke the corresponding function a lot.

Luckily, several assumptions and computational simplifications made by the Stauffer-Grimson method allow us to remove several expensive linear algebra operations.

The function below takes the pixel value $\vec{x}^{\,t}$ as x, the mean of the component $\hat{\vec{\mu}}_{m}$ as mu and the variance $\hat{\sigma^2}_{m}$ as variance.

While technically not necessary (could just be stack-allocated arrays, not enough time to refactor), we use the Vector3<f32> data structures from the brilliant nalgebra library.

fn simplified_mvn_density(x: &Vector3<f32>, mu: &Vector3<f32>, variance: f32) -> f32 {
    let pixel_vector_size = 3;
    let inv_cov = &Matrix3::from_diagonal_element(1.0 / variance);
    let determinant = variance.powi(pixel_vector_size);

    // right hand side of the equation
    let rhs = inv_cov * (x - mu);

    // self.exp() means e^(self)
    // self.tr_dot() returns the dot product between the transpose of self and rhs.
    let top = (-0.5 * (x - mu).transpose().tr_dot(&rhs)).exp();
    let bottom = ((2.0 * f32::consts::PI).powi(pixel_vector_size) * determinant).sqrt();
    return top / bottom;
}

Because the algorithm uses a simple, isotropic covariance matrix (i.e. matrix is diagonal and all diagonal elements are equal), we can:

  • simply pass the function the value of $\hat{\sigma^2}_{m}$ as variance instead of the full covariance matrix
  • compute the determinant simply by taking the product of all diagonal elements
  • compute the inverse of the covariance matrix simply by taking the inverse of the diagonal entries.

Initially, my implementation did not exploit this simplification and instead computed the full matrix inverse and determinant using methods implemented in Nalgebra. The above optimisations produced a nice speedup (benchmarked using the awesome Criterion package on a 200x200 test image, single-threaded on my 3,1 GHz Intel Core i5 MacBook). Based on 5k iterations, we can see the that the new implementation (blue) is some 20% faster than the naive implementation (red).

Results

Segmentation quality

Due to the immaturity of Rust’s mp4/video parsing library ecosystem, I wrapped my implementation of the algorithm in a simple CLI tool that takes a collection of .png files as input data.

To test my implementation, I pulled up some datasets used for benchmarking background subtraction algorithms from this site, conveniently hosted by the University of Naples, Italy (thanks!).

I applied the algorithm on several benchmark scenes using the following parameters:

let settings = GaussianMixtureSettings {
    alphamin: 0.0001,
    max_components: 5,
    initial_variance: 15.0,
    cf: 0.1,
    mahal_threshold :  3.0 * 3.0, // note: we provide D^2
};

In the examples below, you can see the original video on the left and the binary mask representing the foreground/background segmentations on the right.

The HighwayI dataset

The HighwayII dataset

A few things stand out:

  • Segmentation works reasonably well. The algorithm quickly builds a decent model of each scene’s background.
  • Segmented objects in the binary mask are not fully ‘closed’. Smaller objects may look like ‘clouds’ of pixels. We may be able to solve this using a morphological closing filter, like the one implemented in the imageproc library or by tuning the model parameters (primarily the $c_{f}$ threshold, see below).
  • The shadow that a foreground object casts on the background is considered part of the object (mainly in HighwayI). This is ugly but expected, because we have not implemented any way to distinguish shadows. Shadow detection is an extension of the general background detection problem and is usually solved with an extra post-processing step. We will leave it for now.
  • The wind perturbing the foliage, like in the rightmost part of HighwayII, leads to some false positive noise. I think this is an inherent weakness of a pixel-local method.

Performance benchmarks

Let’s take a look at some benchmarks. All of them were obtained using the Criterion package on my 3.1 GHz Intel Core i5 (MacBook), while forcing Rayon to use only a single thread using the RAYON_NUM_THREADS=1 environment variable.

Updating the model

The relationship between the size of the input video frame and the time it takes to fully update the model with it is best explained using the following violin plot.

As you can see, the processing time per frame increases linearly with the size of the image. It stands to reason that the processing time also has an approximately linear inverse relationship with the number of available parallel threads.

Computing segmentations (predictions)

The figure below shows the time it takes for the model to segment the background of a single 320 x 240 frame, which averages at around 35ms. This is significantly higher than the time it takes to update the model, mainly because the computations are more demanding.

I think there is also some memory allocation overhead involved in building the binary mask, which is fixable by allocating a single binary image at startup that is overwritten with each prediction.

Tuning the parameters

Let’s see what the algorithm parameters actually do to impact segmentation quality.

Cf threshold

I expected that we may be able to reduce the noise in the segmented foreground objects by increasing the $c_{f}$ value. If we take a look at the corresponding equations, increasing $c_{f}$ should make the model assign a larger part of the probability distribution to the foreground. This may help in reducing ‘false negatives’, if you will.

If you take a closer look at the video below, we can see that the segmentation improves as expected. A lot of background noise still needs to be filtered, though.

On the left $c_{f} = 0.1$ and on the right $c_{f} = 0.3$.

Max. number of components

Theoretically, increasing the maximum number of components should make for a more flexible Gaussian mixture model, allowing it to fit more complex probability distributions. Adding additional components also adds serious computational complexity, so this change definitely does not come free.

Comparing max_components settings of 3, 5, 8 and 10, I did not observe any improvement in segmentation quality. This may be because the test data I have available is not sufficiently complex to benefit from a higher max_components setting. To see the impact of this parameter, we would first need training data with a more complex dynamic lighting situation, for example.

Conclusion

All things considered, this algorithm seems very well suited for background subtraction in simple static scenes. Surprisingly, the model parameters don’t require excessive tuning for new scenes and some sensible defaults work fine on new scenes out-of-the-box.

The algorithm struggles with complex lighting situations and with moving foliage. Shadows are also a problem, as they are usually an undesirable part of the foreground and the current model has no way of distinguishing a foreground object from the shadow it casts on the background. Maybe I will look into extending the model with a shadow detection module later on.

Interestingly, a more recent background subtraction method called LaBGeN promises to be more resistant to noise like that introduced by dynamic lighting. Instead of working only at the pixel level, the authors also propose a version of their algorithm that relies on optical flow representations. This algorithm has a completely different foundation and would also make for interesting study material.

This project gave me a nice feel for the Rust language, which contains some really interesting ideas like lifetimes, easy concurrency, zero-cost abstractions, and intuitive metaprogramming with generics. It turned out to be extremely easy to parallelize my code thanks to Rust’s Rayon library. However, all this goodness does come with a learning curve, and I would say Rust takes a lot more time master than, for example, Go.

Another minor thing I thought Rust lacked when implementing this algorithm is GCC’s --ffast-math flag, which sacrifices the accuracy of floating point math in C / C++ code in a bid to make it run faster. Using accurate f32 floating point math for this algorithm is complete overkill and I bet we can squeeze out a lot more performance if the Rust compiler had a flag like that. Luckily, I am not the only one who wants this and an implementation may one day land in stable Rust.

The full implementation discussed in this post can be found on Github.