A Transport View of Filtering
Interpretating Bayes’ Law via Optimal Transport for Filtering Problems
This blog summarizes the works (Taghvaei & Hosseini, 2021) (Taghvaei & Hosseini, 2022) (Al-Jarrah et al., 2024) that study an optimal transport perspective of the Bayes’ law for updating the conditional distribution of the state given new observations in filtering problems.
Preliminaries on Optimal Transportation (OT) Theory
The Monge problem aims to find the optimal transport map $\mathrm{T}$ that maps a probability measure $\mu \in \mathbb{R}^n$ to $\nu \in \mathbb{R}^n$ such that \(\mathrm{T}_{\mathrm{\#}} \mu=\nu\). The optimization problem given a quadratic cost follows that:
\[\min_{\mathrm{T}\in \mathcal{T}(\mu, \nu)} \mathbb{E}_{\mathrm{Z}\sim \mu} \bigg[\frac{1}{2} \| \mathrm{T}(\mathrm{Z}) - \mathrm{Z}\|_2^2\bigg],\]where $\mathcal{T}(\mu, \nu)$ denotes all transportation plans that map $\mu$ to $\nu$. The dual formulation of the Monge problem is known as the Monge-Kantorovich (MK) problem
\[\mathrm{\min_{f\in \text{CVX}(\mu)} \mathbb{E}_{\mathrm{U} \sim \mu}[f(\mathrm{Z})] + \mathbb{E}_{\mathrm{V} \sim \nu}[f^*(\mathrm{Z})]},\]where \(\mathrm{f^*}\) denotes the convex conjugate of $\mathrm{f}$, i.e. \(\mathrm{f^*(\nu)}=\mathrm{\sup_{z\in \mathbb{R}^n} z^T \nu - f(z)}\), and \(\text{CVX}(\mu)\) denotes the set of all convex functions that are $\mathrm{\mu}$-integrable functions on $\mathrm{\mathbb{R}^n}$. Then the MK dual problem above has a unique minimizer $\bar f$. The map $\mathrm{\overline T}=\nabla \bar f$ is the solution to the Monge problem and is often referred to as the Brenier transport map (Peyré & Cuturi, 2019).
A Variational Formulation of Bayes’ Law
The conditional distribution of the state given new observations follows from a Bayes’ law:
\[\begin{align} \mathrm{P}_{\mathrm{X|Y}}(\mathrm{x}|\mathrm{y})= \dfrac{\mathrm{P_X}(\mathrm{x}) \mathrm{P}_{\mathrm{Y|X}}(\mathrm{y|x})}{\mathrm{P}_{\mathrm{Y}}{(\mathrm{y})}},\notag \end{align}\]where $\mathrm{x}\in\mathbb{R}^n$ and $\mathrm{y}\in \mathbb{R}^m$. Alternatively, the posterior distribution can be viewed as the pushforward of the prior via a parametric map. More specifically, we consider a map \(\mathrm{T}\) that maps the prior \(\mathrm{P}_{\mathrm{X}}\) to the posterior \(\mathrm{P}_{\mathrm{X\\|Y}}\); the map \(\mathrm{S}\) transports the independent coupling \(\mathrm{P}_{\mathrm{X}} \otimes \mathrm{P}_{\mathrm{Y}}\) to the joint distribution \(\mathrm{P}_{\mathrm{XY}}\) such that
\[\begin{align} (\mathrm{x,y}) \rightarrow \mathrm{S}(\mathrm{x, y})=(\mathrm{T}(\mathrm{x,y}), \mathrm{y}). \end{align}\]The corresponding OT problem follows that
\[\begin{align} \mathrm{\min_{S\in \mathcal{T}(P_X \otimes P_Y, P_{XY})} \mathbb{E}_{(X, Y)\sim P_X \otimes P_Y}\bigg[\| S(X,Y) - (X, Y) \|_2^2 \bigg]=\mathbb{E}_{(X, Y) \sim P_X \otimes P_Y} \bigg[\| T(X,Y) - X \|_2^2 \bigg]}.\notag \end{align}\]The optimal transportation problem is numerically infeasible to solve. We instead solve the MK dual
\[\begin{align} \mathrm{\min_{f(\cdot, y)\in CVX(P_X)} J(f):=\mathbb{E}_{(X,Y)\sim P_X\otimes P_Y}[f(X, Y)] + \mathbb{E}_{(X,Y)\sim P_{XY}}[f^*(X, Y)],}\label{cost_func} \end{align}\]where the constraint \(\mathrm{f(\cdot, y) \in \text{CVX}(P_X)}\) means that for any $\mathrm{y\in \mathbb{R}^m}$, $\mathrm{f(x;y)}$ is convex in $\mathrm{x}$ and $\mathrm{P_X}$ integrable. Similarly, $\mathrm{f^*(x, y)=\sup_z z^T x - f(z; y)}$ is the convex conjugate of $\mathrm{f(\cdot; y)}$ for any fixed y.
Computational Algorithms
Given the particle \(\mathrm{X_0^i}\sim \mathrm{P_X}\) and the corresponding observation \(\mathrm{Y_0^i \sim P_{Y\\|X}(\cdot \\| X_0^i)}\) for $\mathrm{i=1,2,\cdots, N}$, \(\mathrm{\{X_0^i, Y_0^j\}_{i,j=1}^N}\) and \(\mathrm{\{X_0^i, Y_0^i\}_{i=1}^N}\) represent samples from the independent couplings $\mathrm{P_X\otimes P_Y}$ and the joint distribution $\mathrm{P_{XY}}$, respectively. We now can define an unbiased cost function \eqref{cost_func} as follows:
\[\begin{align} \mathrm{J^{(N)}(f):=\frac{1}{N^2} \sum_{i,j=1}^N f(X_0^i, Y_0^j) + \frac{1}{N} \sum_{i=1}^N f^*(X_0^i, Y_0^i)}.\notag \end{align}\]The empirical approximation is then utilized to formulate optimization problems of the form:
\[\mathrm{\min_{f\in \mathcal{F}} J^{(N)}(f)},\]where $\mathcal{F}$ is a subset of functions \(\mathrm{f(x;y): \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}}\) such that \(\mathrm{x \rightarrow f(x; y)}\) is convex in $\mathrm{x}$.
Connections to Ensemble Kalman Filter
Consider a family of quadratic functions in $\mathrm{x}$ as follows:
\[\begin{align} \mathcal{F}_Q=\bigg\{\mathrm{(x,y)\rightarrow \frac{1}{2} x^\intercal A x + x^\intercal (Ky + b) \\|A\in S_+^n, K\in \mathbb{R}^{n\times m}, b \in \mathbb{R}^n}\bigg\}, \notag \end{align}\]where $\mathrm{S^n_+}$ denotes the set of positive-definite matrices. The underlying Brenier transport map follows that
\[\begin{align} \mathrm{\nabla_x f(x, y) = Ax + Ky + b.}\notag \end{align}\]Computing the conjugate $\mathrm{f^*}$ and repeatedly applying the cyclic property of traces, Eq.\eqref{cost_func} can be reduced to the formulation below:
\[\begin{align} &\mathrm{\min_{\theta \in \Theta} \frac{1}{2} Tr(A\Sigma_x) + \frac{1}{2} Tr(A^{-1}\Sigma_x) + \frac{1}{2} Tr(A^{-1} K\Sigma_y K^\intercal)} \notag\\ & \qquad\qquad\qquad\mathrm{- Tr(A^{-1} \Sigma_{xy} K^\intercal) + \frac{1}{2}(\tilde b - m_x)^\intercal A^{-1} (\tilde b - m_x)},\notag \end{align}\]where $\mathrm{m_x=E[x], m_y=E[y], \tilde b=b-Am_x - K m_y}$. $\mathrm{\Sigma_{x}, \Sigma_{y}, \Sigma_{xy}}$ denote the variance of $\mathrm{x, y}$ and the covariance between $\mathrm{x}$ and $\mathrm{y}$, respectively.
Taking the gradients w.r.t. $\mathrm{\tilde b}$, $\mathrm{K}$, and $\mathrm{A}$, respectively, we can get the solutions
\[\begin{align} \mathrm{\tilde b} &=\mathrm{m_x} \notag \\ \mathrm{K} &= \mathrm{\Sigma_{xy} \Sigma_{y}^{-1}} \notag \\ \mathrm{A} &= \mathrm{\Sigma_x^{-\frac{1}{2}} \big(\Sigma_x^{\frac{1}{2}} (\Sigma_x - \Sigma_{xy}\Sigma_y^{-1}\Sigma_{xy}^\intercal) \Sigma_x^{\frac{1}{2}}\big)^{\frac{1}{2}} \Sigma_x^{-\frac{1}{2}}}.\notag \end{align}\]The resulting transport map follows that
\[\begin{align} \mathrm{\nabla_x \bar f(x, y)=m_x + A(x-m_x) + K(y - m_y)}. \notag \end{align}\]Given samples $\mathrm{(X_0^i, Y_0^i)\sim P_{XY}}$, the above transport map is approximated as follows
\[\begin{align} \mathrm{X_1^i = \nabla_x \bar f^{(N)}(X_0^i, y) = m_x^{(N)} + A^{(N)}(X_0^i - m_x^{(N)}) + K^{(N)}(y - m_y^{(N)})}. \label{OT_EnKF} \end{align}\]In contrast, the classical EnKF algorithm follows that \(\begin{align} \mathrm{X_1^i = X_0^i + K^{(N)}(y - m_y^{(N)})}. \label{EnKF} \end{align}\)
We observe that both \eqref{OT_EnKF} and \eqref{EnKF} result in similar dynamics for the empirical mean and covariance. The discrepancy arises due to presence of \(\mathrm{A^{(N)}}\) in \eqref{OT_EnKF} to ensure that the evolution of the particles is optimal in terms of the quadratic transportation cost.
- Taghvaei, A., & Hosseini, B. (2021). An Optimal Transport Formulation of the Ensemble Kalman Filter. IEEE Transactions on Automatic Control, 66(7), 3197–3212.
- Taghvaei, A., & Hosseini, B. (2022). An Optimal Transport Formulation of Bayes’ Law for Nonlinear Filtering Algorithms. Proceedings of the IEEE Conference on Decision and Control (CDC).
- Al-Jarrah, M., Jin, N., Hosseini, B., & Taghvaei, A. (2024). Nonlinear Filtering with Brenier Optimal Transport Maps. ICML.
- Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport (Numbers 5-6, pp. 355–607). Foundations and Trends in Machine Learning.