A coupling of two random variables/ vectors represents a joint distribution, where the marginal distributions are denoted by and , respectively. Any joint distributions of and define a valid coupling. Although there are infinitely many of them, some of them are quite special and may facilitate our analysis.
Maximal Coupling
Total variation distance is defined as follows
where and are random variables defined on the state space with measurable set . Note that for any measurable set , the follow inequality always holds
.
There exists one pair of such that the above inequality holds. Any coupling that satisfies this property is known as maximal coupling.
Denoting by the zero-mass measure, there is a set according to the Hahn-Jordan decomposition such that and are both positive measures and .
Notably, it is clear that . It follows that . We have
Summing up the above equations and combining , we can obtain the area between the two pdf curves (Jacob, 2021)
In the end, we can summarize the different formulations of total variation
Applications
Suppose follow an (independent) binomial distribution and follow an (independent) Poisson distribution . Next, we couple and maximally. The overlap mass is equal to . That means .
Replace with , we have
where implies that given a large N and a fixed , Poisson distribution approximates the binomial distribution.
Convergence rates
By assuming is simulated from the stationary distribution in the beginning, we can introduce a coupling inequality such that
where is a random variable that enables and is also known as ‘‘meeting time’’. The equality can be achieved given the maximal coupling but the contruction may differ in different problems. In addition, meeting exactly sometimes is quite restricted; it is enough to consider close enough chains.
Synchronous Coupling
Synchronous coupling models the contraction of a pair of trajectories and can be used to prove the convergence of stochastic gradient Langevin dynamics for strongly log-concave distributions.
Let be the -th iterate of the following stochastic gradient Langevin algorithm.
where is the learning rate, is the temperature, is a standard -dimensional Gaussian vector, and is an unbiased estimate of the exact gradient .
Assumptions
Smoothness We say is -smooth if for some
Strong convexity
We say is -strongly convex if for some
Bounded variance The variance of stochastic gradient is upper bounded such that
A Crude Estimate
Assume assumptions , , and hold. For any learning rate and , where is a stationary point. Then
where denotes the probability measure of and .
Proof
Denote as the continuous-time interpolation of the stochastic gradient Langevin dynamics as follows
where . For any and a time that satisfies , it is apparent that is the same as , where denotes a distribution of a random variable. In addition, we define an auxiliary process that starts from the stationary distribution
Consider Itô’s formula for the sequence of
Taking defines a synchronous coupling. Arrange the terms
where the first inequality follows from and the strong-convexity property ; in particular, we don’t attempt to optimize the constants of for the item ; the second and third inequalities follow by bounded variance assumption and the smoothness assumption , respectively.
Now apply Grönwall’s inequality to the preceding inequality and take expectation respect to a coupling
Plugging some estimate of into Eq., we have
Recall that and have the same distribution . By the definition of distance, we have
where the first inequality follows by applying , the second one follows by an estimate of , and the last step follows from the initialization condition.