Couplings and Monte Carlo Methods (I)

A family of techniques to understand the convergence of random variables.

A coupling of two random variables/ vectors represents a joint distribution, where the marginal distributions are denoted by p and q, respectively. Any joint distributions of p and q define a valid coupling. Although there are infinitely many of them, some of them are quite special and may facilitate our analysis.

Maximal Coupling

Total variation distance is defined as follows

AX,|pq|TV=supXAP(XA)P(YA), where X and Y are random variables defined on the state space X with measurable set X. Note that for any measurable set AX, the follow inequality always holds

P(XA)P(YA)=P(XAXY)P(YAXY)P(XY).

There exists one pair of (X,Y) such that the above inequality holds. Any coupling that satisfies this property is known as maximal coupling.

Denoting by ν=pq the zero-mass measure, there is a set D according to the Hahn-Jordan decomposition such that ν+=(pq)(D) and ν=(qp)(Dc) are both positive measures and ν=ν+ν.

Notably, it is clear that D={x: p(x)q(x)}. It follows that supAXP(XA)P(YA)=supAXν(A)=ν(D)=P(XD)P(YD). We have

P(XD)P(YD)={x: p(x)q(x)}p(x)dx{x: p(x)q(x)}q(x)dx={x: p(x)<q(x)}q(x)dx{x: p(x)<q(x)}p(x)dx.

Summing up the above equations and combining |ab|=a+b2ab, we can obtain the area between the two pdf curves (Jacob, 2021)

|pq|TV=12|p(x)q(x)|dx=1min{p(x),q(x)}dx.

In the end, we can summarize the different formulations of total variation

pqTV=supP(XA)P(YA)=12|p(x)q(x)|dx=1min{p(x),q(x)}dx=inf(X,Y)coupling(p,q)P(XY).

Applications

Suppose (Xi) follow an (independent) binomial distribution Binomial(λ,N) and (Yi) follow an (independent) Poisson distribution Poisson(λ). Next, we couple Xi and Yi maximally. The overlap mass is equal to (1λ)eλ+λλeλ=1λ+λeλ. That means P(XiYi)=λ(1eλ)λ2.

Replace λ with λ/N, we have P(i=1NXii=1NYiN)λ2N, where implies that given a large N and a fixed λ, Poisson distribution approximates the binomial distribution.

Convergence rates

By assuming (Yt) is simulated from the stationary distribution in the beginning, we can introduce a coupling inequality such that

L(Xt)L(Yt)|P(τ>t), where τ is a random variable that enables Xt=Yt and is also known as ‘‘meeting time’’. The equality can be achieved given the maximal coupling but the contruction may differ in different problems. In addition, meeting exactly sometimes is quite restricted; it is enough to consider close enough chains.

Synchronous Coupling

Synchronous coupling models the contraction of a pair of trajectories and can be used to prove the convergence of stochastic gradient Langevin dynamics for strongly log-concave distributions.

Let θkRd be the k-th iterate of the following stochastic gradient Langevin algorithm. θk+1=θkηf~(θk)+2τηξk, where η is the learning rate, τ is the temperature, ξk is a standard d-dimensional Gaussian vector, and f~(θ) is an unbiased estimate of the exact gradient f(θ).

Assumptions

Smoothness We say f is L-smooth if for some L>0 (1)f(y)f(x)+f(x),yx+L2yx22x,yRd.

Strong convexity We say f is m-strongly convex if for some m>0 (2)f(x)f(y)+f(y),xy+m2yx22x,yRd.

Bounded variance The variance of stochastic gradient f~(x) is upper bounded such that (3)E[f~(x)f(x)2]σ2d,xRd.

A Crude Estimate

Assume assumptions 1, 2, and 3 hold. For any learning rate η(0,1m/L2) and ||θ0θ||dD, where θ is a stationary point. Then

W2(μk,π)emkη/22(dD+d/m)+2d(σ2+L2Gη)/m2,

where μk denotes the probability measure of θk and G:=25(τ+mD2+σ2).

Proof

Denote θt as the continuous-time interpolation of the stochastic gradient Langevin dynamics as follows

dθt=f~(θηtη)dt+2τdW~t,

where θ0=θ0. For any kN+ and a time t that satisfies t=kη, it is apparent that μ^t=L(θt) is the same as μk=L(θk), where L() denotes a distribution of a random variable. In addition, we define an auxiliary process (βt) that starts from the stationary distribution π

dβt=f(βt)dt+2τdWt.

Consider Itô’s formula for the sequence of 12θtβt2

   12dθtβt2=θtβt,dθtdβt+Tr[d2θtd2βt]=θtβt,(f(βt)f~(θηtη))dt+2τ(dW~tdWt)+2τTr[d2W~td2Wt].

Taking W~t=Wt defines a synchronous coupling. Arrange the terms

   12dθtβt2=θtβt,f(βt)f(θt)dt+θtβt,f(θt)f(θηtη)dt+θtβt,f(θηtη)f~(θ^ηtη)dtmθtβt2dt+m4θtβt2dt+1mf(θt)f(θηtη)2dt+m4θtβt2dt+1mf(θηtη)f~(θ^ηtη)2dtm2θtβt2dt+dσ2mdt+1mf(θt)f(θηtη)2dtm2θtβt2dt+dσ2mdt+L2mθtθηtη2dt

where the first inequality follows from ab(m2a)2+(1mb)2 and the strong-convexity property 1; in particular, we don’t attempt to optimize the constants of m2 for the item ||θtβt||2; the second and third inequalities follow by bounded variance assumption 3 and the smoothness assumption 2, respectively.

Now apply Grönwall’s inequality to the preceding inequality and take expectation respect to a coupling (θt,βt)Γ(μ^t,π) (4)Eθtβt2Eθ0β02emt+2m0tg(dσ2+L2Eθsθηsη2I:Discretization errorg)e(ts)mds.

Plugging some estimate of I into Eq.(4), we have    Eθtβt2Eθ0β02emt+2dm(σ2+L2Gη)0te(ts)mdsEθ0β02emt+2dm2(σ2+L2Gη).

Recall that θk and θ^tη have the same distribution μk. By the definition of W2 distance, we have

   W2(μk,π)emkη/2W2(μ0,π)+2d(σ2+L2Gη)/m2emkη/22(θ0θ+d/m)+2d(σ2+L2Gη)/m2emkη/22(dD+d/m)+2d(σ2+L2Gη)/m2,

where the first inequality follows by applying |a+b||a|+|b|, the second one follows by an estimate of W2(μ0,π), and the last step follows from the initialization condition.

Remark: How to achieve a sharper upper bound?

  1. Jacob, P. E. (2021). Couplings and Monte Carlo. Slides.