Optimization or Architecture: How to Hack Kalman Filtering


Why neural networks may seem better than the KF even when they are not — and how to both fix this and improve your KF itself

This post introduces our recent paper from NeurIPS 2023. Code is available on PyPI.

Background

The Kalman Filter (KF) is a celebrated method for sequential forecasting and control since 1960. While many new methods were introduced in the last decades, the KF’s simple design makes it a practical, robust and competitive method to this day. The original paper from 1960 has 12K citations in the last 5 years alone. Its broad applications include navigation, medical treatment, marketing analysis, deep learning and even getting to the moon.

Illustration of the KF on Apollo mission (image by author)

Technically, the KF predicts the state x of a system (e.g. spaceship location) from a sequence of noisy observations (e.g. radar/camera). It estimates a distribution over the state (e.g. location estimate + uncertainty). Every time-step, it predicts the next state according to the dynamics model F, and increases the uncertainty according to the dynamics noise Q. Every observation, it updates the state and its uncertainty according to the new observation z and its noise R.

An illustration of a single step of the KF (image by author)

Kalman Filter or a Neural Network?

The KF prediction model F is linear, which is somewhat restrictive. So we built a fancy neural network on top of the KF. This gave us better prediction accuracy than the standard KF! Good for us!

Finalizing our experiments for a paper, we conducted some ablation tests. In one of them, we removed the neural network completely, and just optimized the internal KF parameters Q and R. Imagine the look at my face when this optimized KF outperformed not only the standard KF, but also my fancy network! The same KF model exactly, with the same 60-yo linear architecture, becomes superior just by changing the values of its noise parameters.

Prediction errors of the KF, the Neural KF (NKF), and the Optimized KF (OKF) (image from our paper)

The KF beating a neural network is interesting but anecdotal to our problem. More important is the methodological insight: before the extended tests, we were about to declare the network as superior to the KF — just because we didn’t compare the two properly.

Message 1: To make sure that your neural network is actually better than the KF, optimize the KF just as nicely as you do for the network.

Remark — does this mean that the KF is better than neural networks? We certainly make no such general claim. Our claim is about the methodology — that both models should be optimized similarly if you’d like to compare them. Having said that, we do demonstrate *anecdotally* that the KF can be better in the Doppler radar problem, despite the non-linearity in the problem. In fact, this was so hard for me to accept, that I lost a bet about my neural KF, along with many weeks of hyperparameter optimization and other tricks.

Optimizing the Kalman Filter

When comparing two architectures, optimize them similarly. Sounds somewhat trivial, doesn’t it? As it happens, this flaw was not unique to our research: in the literature of non-linear filtering, the linear KF (or its extension EKF) is usually used as a baseline for comparison, but is rarely optimized. And there is actually a reason for that: the standard KF parameters are “known” to already yield optimal predictions, so why bother optimizing further?

The standard closed-form equation for the KF parameters Q and R. We focus on the settings where offline data is available with both states {x} and observations {z}, hence the covariances can be calculated directly. Other methods to determine Q and R are typically intended for other settings, e.g. without data of {x}.

Unfortunately, the optimality of the closed-form equations does not hold in practice, as it relies on a set of quite strong assumptions, which rarely hold in the real world. In fact, in the simple, classic, low-dimensional problem of a Doppler radar, we found no less than 4 violations of the assumptions. In some cases, the violation is tricky to even notice: for example, we simulated iid observation noise — but in spherical coordinates. Once transforming to Cartesian coordinates, the noise is no longer iid!

The violation of 4 KF assumptions in the Doppler radar problem: non-linear dynamics; non-linear observations; inaccurate initialized distribution; and non-iid observation noise. (image by author)

Message 2: Do not trust the KF assumptions, and thus avoid the closed-form covariance estimation. Instead, optimize the parameters wrt your loss — just as with any other prediction model.

In other words, in the real world, noise covariance estimation is no longer a proxy to optimize the prediction errors. This discrepancy between the objectives creates surprising anomalies. In one experiment, we replace noise estimation with an *oracle* KF that knows the exact noise in the system. This oracle is still inferior to the Optimized KF — since accurate noise estimation is not the desired objective, but rather accurate state prediction. In another experiment, the KF *deteriorates* when it is fed with more data, since it effectively pursues a different objective than the MSE!

Test errors vs. train data size. The standard KF is not only inferior to the Optimized KF, but also deteriorates with the data, since its parameters are not set to optimize the desired objective. (image from our paper)

So how to optimize the KF?

Behind the standard noise-estimation method for KF tuning, stands the view of the KF parameters as representatives of the noise. This view is beneficial in some contexts. However, as discussed above, for the sake of optimization, we should “forget” about this role of the KF parameters, and just treat them as model parameters, whose objective is loss minimization. This alternative view also tells us how to optimize: just like any sequential prediction model, such as RNN! Given the data, we just make predictions, calculate loss, backpropagate gradients, update the model, and repeat.

The main difference from RNN, is that the parameters Q,R come in the form of a covariance matrix, so they should remain symmetric and positive definite. To handle this, we use the Cholesky decomposition to write Q=LL*, and optimize the entries of L. This guarantees that Q remains positive definite regardless of the optimization updates. This trick is used for both Q,R.

Pseudocode of the OKF training procedure (from our paper)

This optimization procedure was found fast and stable in all of our experiments, as the number of parameters was several orders of magnitude beneath typical neural networks. And while the training is easy to implement yourself, you may also use our PyPI package, as demonstrated in this example 🙂

Summary

As summarized in the diagram below, our main message is that the KF assumptions cannot be trusted, and thus we should optimize the KF directly — whether we use it as our primary prediction model, or as a reference for comparison with a new method.

Our simple training procedure is available in PyPI. More importantly, since our architecture remains identical to the original KF, any system using KF (or Extended-KF) can be easily upgraded to OKF just by re-learning the parameters — without adding a single line of code in inference time.

Summary of our scope and contribution. Since the KF assumptions are often violated, the KF must be optimized directly. (image from our paper)


Optimization or Architecture: How to Hack Kalman Filtering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


Why neural networks may seem better than the KF even when they are not — and how to both fix this and improve your KF itself

This post introduces our recent paper from NeurIPS 2023. Code is available on PyPI.

Background

The Kalman Filter (KF) is a celebrated method for sequential forecasting and control since 1960. While many new methods were introduced in the last decades, the KF’s simple design makes it a practical, robust and competitive method to this day. The original paper from 1960 has 12K citations in the last 5 years alone. Its broad applications include navigation, medical treatment, marketing analysis, deep learning and even getting to the moon.

Illustration of the KF on Apollo mission (image by author)

Technically, the KF predicts the state x of a system (e.g. spaceship location) from a sequence of noisy observations (e.g. radar/camera). It estimates a distribution over the state (e.g. location estimate + uncertainty). Every time-step, it predicts the next state according to the dynamics model F, and increases the uncertainty according to the dynamics noise Q. Every observation, it updates the state and its uncertainty according to the new observation z and its noise R.

An illustration of a single step of the KF (image by author)

Kalman Filter or a Neural Network?

The KF prediction model F is linear, which is somewhat restrictive. So we built a fancy neural network on top of the KF. This gave us better prediction accuracy than the standard KF! Good for us!

Finalizing our experiments for a paper, we conducted some ablation tests. In one of them, we removed the neural network completely, and just optimized the internal KF parameters Q and R. Imagine the look at my face when this optimized KF outperformed not only the standard KF, but also my fancy network! The same KF model exactly, with the same 60-yo linear architecture, becomes superior just by changing the values of its noise parameters.

Prediction errors of the KF, the Neural KF (NKF), and the Optimized KF (OKF) (image from our paper)

The KF beating a neural network is interesting but anecdotal to our problem. More important is the methodological insight: before the extended tests, we were about to declare the network as superior to the KF — just because we didn’t compare the two properly.

Message 1: To make sure that your neural network is actually better than the KF, optimize the KF just as nicely as you do for the network.

Remark — does this mean that the KF is better than neural networks? We certainly make no such general claim. Our claim is about the methodology — that both models should be optimized similarly if you’d like to compare them. Having said that, we do demonstrate *anecdotally* that the KF can be better in the Doppler radar problem, despite the non-linearity in the problem. In fact, this was so hard for me to accept, that I lost a bet about my neural KF, along with many weeks of hyperparameter optimization and other tricks.

Optimizing the Kalman Filter

When comparing two architectures, optimize them similarly. Sounds somewhat trivial, doesn’t it? As it happens, this flaw was not unique to our research: in the literature of non-linear filtering, the linear KF (or its extension EKF) is usually used as a baseline for comparison, but is rarely optimized. And there is actually a reason for that: the standard KF parameters are “known” to already yield optimal predictions, so why bother optimizing further?

The standard closed-form equation for the KF parameters Q and R. We focus on the settings where offline data is available with both states {x} and observations {z}, hence the covariances can be calculated directly. Other methods to determine Q and R are typically intended for other settings, e.g. without data of {x}.

Unfortunately, the optimality of the closed-form equations does not hold in practice, as it relies on a set of quite strong assumptions, which rarely hold in the real world. In fact, in the simple, classic, low-dimensional problem of a Doppler radar, we found no less than 4 violations of the assumptions. In some cases, the violation is tricky to even notice: for example, we simulated iid observation noise — but in spherical coordinates. Once transforming to Cartesian coordinates, the noise is no longer iid!

The violation of 4 KF assumptions in the Doppler radar problem: non-linear dynamics; non-linear observations; inaccurate initialized distribution; and non-iid observation noise. (image by author)

Message 2: Do not trust the KF assumptions, and thus avoid the closed-form covariance estimation. Instead, optimize the parameters wrt your loss — just as with any other prediction model.

In other words, in the real world, noise covariance estimation is no longer a proxy to optimize the prediction errors. This discrepancy between the objectives creates surprising anomalies. In one experiment, we replace noise estimation with an *oracle* KF that knows the exact noise in the system. This oracle is still inferior to the Optimized KF — since accurate noise estimation is not the desired objective, but rather accurate state prediction. In another experiment, the KF *deteriorates* when it is fed with more data, since it effectively pursues a different objective than the MSE!

Test errors vs. train data size. The standard KF is not only inferior to the Optimized KF, but also deteriorates with the data, since its parameters are not set to optimize the desired objective. (image from our paper)

So how to optimize the KF?

Behind the standard noise-estimation method for KF tuning, stands the view of the KF parameters as representatives of the noise. This view is beneficial in some contexts. However, as discussed above, for the sake of optimization, we should “forget” about this role of the KF parameters, and just treat them as model parameters, whose objective is loss minimization. This alternative view also tells us how to optimize: just like any sequential prediction model, such as RNN! Given the data, we just make predictions, calculate loss, backpropagate gradients, update the model, and repeat.

The main difference from RNN, is that the parameters Q,R come in the form of a covariance matrix, so they should remain symmetric and positive definite. To handle this, we use the Cholesky decomposition to write Q=LL*, and optimize the entries of L. This guarantees that Q remains positive definite regardless of the optimization updates. This trick is used for both Q,R.

Pseudocode of the OKF training procedure (from our paper)

This optimization procedure was found fast and stable in all of our experiments, as the number of parameters was several orders of magnitude beneath typical neural networks. And while the training is easy to implement yourself, you may also use our PyPI package, as demonstrated in this example 🙂

Summary

As summarized in the diagram below, our main message is that the KF assumptions cannot be trusted, and thus we should optimize the KF directly — whether we use it as our primary prediction model, or as a reference for comparison with a new method.

Our simple training procedure is available in PyPI. More importantly, since our architecture remains identical to the original KF, any system using KF (or Extended-KF) can be easily upgraded to OKF just by re-learning the parameters — without adding a single line of code in inference time.

Summary of our scope and contribution. Since the KF assumptions are often violated, the KF must be optimized directly. (image from our paper)


Optimization or Architecture: How to Hack Kalman Filtering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsarchitectureFilteringHackKalmanoptimizationTechnoblenderTechnology
Comments (0)
Add Comment