W,H
log p(W, H|V) . (2)
The point estimates are obtained via optimization methods applied to the posterior density Eq. 1. The optimization procedures used to solve Eq. 2 have various theoretical guarantees which hold under appropriate conditions on the prior and likelihood functions [1, 5– 8]. On the other hand, we might be interested in other quantities based on the posterior distribution, such as the moments or the nor- malizing constants. These quantities are useful in various applica- tions such as model selection [9] (i.e. estimating the ‘rank’ K of the model) or estimating the Bayesian predictive densities that would be useful for active learning. **Markov** **Chain** **Monte** **Carlo** (MCMC) **algorithms**, which aim to generate samples from the posterior distri- bution of interest, are one of the most popular approaches for esti- mating these quantities. However, these methods have received less attention, mainly due to their computational complexity and rather slow convergence of the standard methods, e.g. the Gibbs sampler.

En savoir plus
{maxime.vono,nicolas.dobigeon}@irit.fr, pierre.chainais@centralelille.fr
Abstract—Variable splitting is an old but widely used technique which aims at dividing an initial complicated optimization problem into simpler sub-problems. In this work, we take inspiration from this variable splitting idea in order to build efficient **Markov** **chain** **Monte** **Carlo** (MCMC) **algorithms**. Starting from an initial complex target distribution, auxiliary variables are introduced such that the marginal distribution of interest matches the initial one asymptotically. In addition to have theoretical guarantees, the benefits of such an asymptotically exact data augmentation (AXDA) are fourfold: (i) easier-to-sample full conditional distributions, (ii) possibility to embed while accelerating state-of-the-art MCMC approaches, (iii) possibility to distribute the inference and (iv) to respect data privacy issues. The proposed approach is illustrated on classical image processing and statistical learning problems.

En savoir plus
{maxime.vono,nicolas.dobigeon}@irit.fr, pierre.chainais@centralelille.fr
Abstract—Variable splitting is an old but widely used technique which aims at dividing an initial complicated optimization problem into simpler sub-problems. In this work, we take inspiration from this variable splitting idea in order to build efficient **Markov** **chain** **Monte** **Carlo** (MCMC) **algorithms**. Starting from an initial complex target distribution, auxiliary variables are introduced such that the marginal distribution of interest matches the initial one asymptotically. In addition to have theoretical guarantees, the benefits of such an asymptotically exact data augmentation (AXDA) are fourfold: (i) easier-to-sample full conditional distributions, (ii) possibility to embed while accelerating state-of-the-art MCMC approaches, (iii) possibility to distribute the inference and (iv) to respect data privacy issues. The proposed approach is illustrated on classical image processing and statistical learning problems.

En savoir plus
B. Approximation of C(β)
Another possibility is to approximate the normalizing constant C(β). Existing approximations can be classified into three categories: based on analytical developments, on sam- pling strategies or on a combination of both. A survey of the state-of-the-art approximation methods up to 2004 has been presented in [20]. The methods considered in [20] are the mean field, the tree-structured mean field and the Bethe energy (loopy Metropolis) approximations, as well as two sampling strategies based on Langevin MCMC **algorithms**. It is reported in [20] that mean field type approximations, which have been successfully used within EM [24], [25] and stochastic EM **algorithms** [26], generally perform poorly in MCMC **algorithms**. More recently, exact recursive expressions have been proposed to compute C(β) analytically [11]. However, to our knowledge, these recursive methods have only been successfully applied to small problems (i.e., for MRFs of size smaller than 40×40) with reduced spatial correlation β < 0.5. Another sampling-based approximation consists in estimat- ing C(β) by **Monte** **Carlo** integration [27, Ch. 3], at the expense of very substantial computation and possibly biased estimations (bias arises from the estimation error of C(β)). Better results can be obtained by using importance sampling or path sampling methods [28]. These methods have been applied to the estimation of β within an MCMC image processing algorithm in [19]. Although more precise than **Monte** **Carlo** integration, approximating C(β) by importance sampling or path sampling still requires substantial compu- tation and is generally unfeasible for large fields. This has motivated recent works that reduce computation by combining importance sampling with analytical approximations. More precisely, approximation methods that combine importance sampling with extrapolation schemes have been proposed for the Ising model (i.e., a 2-state Potts model) in [9] and for the 3-state Potts model in [10]. However, we have found that this extrapolation technique introduces significant bias [29]. C. Auxiliary Variables and Perfect Sampling

En savoir plus
As mentioned above, the realization that **Markov** chains could be used in a wide variety of situations only came (to mainstream statisticians) with Gelfand and Smith (1990), de- spite earlier publications in the statistical literature like Hastings (1970), Geman and Geman (1984) and Tanner and Wong (1987). Several reasons can be advanced: lack of computing machinery (think of the computers of 1970!), lack of background on **Markov** chains, lack of trust in the practicality of the method... It thus required visionary researchers like Alan Gelfand and Adrian Smith to spread the good news, backed up with a collection of papers that demonstrated, through a series of applications, that the method was easy to under- stand, easy to implement and practical (Gelfand et al. 1990, 1992, Smith and Gelfand 1992, Wakefield et al. 1994). The rapid emergence of the dedicated BUGS (Bayesian inference Using Gibbs Sampling) software as early as 1991 (when a paper on BUGS was presented at the Valencia meeting) was another compelling argument for adopting (at large) MCMC **algorithms**. 1

En savoir plus
After briefly reviewing the limitations of MCMC for tall data, introducing our notation and two running examples in Section 2, we first review the divide-and-conquer literature in Sec- tion 3. The rest of the paper is devoted to subsampling approaches. In Section 4, we discuss pseudo-marginal MH **algorithms**. These approaches are exact in the sense that they target the right posterior distribution. In Section 5, we review other exact approaches, before relaxing ex- actness in Section 6. Throughout, we focus on the assumptions and guarantees of each method. We also illustrate key methods on two running examples. Finally, in Section 7, we improve over our so-called confidence sampler in ( Bardenet et al. , 2014 ), which samples from a controlled approximation of the target. We demonstrate these improvements yield significant reductions in computational complexity at each iteration in Section 8. In particular, our improved confidence sampler can break the O(n) barrier of number of individual data point likelihood evaluations per iteration in favourable cases. Its main limitation is the requirement for cheap-to-evaluate proxies for the log-likelihood, with a known error. We provide examples of such proxies relying on Taylor expansions.

En savoir plus
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]

MCMC samplers are based on two original versions: the Bouncy Particle Sampler (BPS) of Bouchard-Cˆ ot´ e et al. (2018) and and the Zigzag Sampler of Bierkens et al. (2016). Bouchard-Cˆ ot´ e et al. (2018) exhibits that BPS can provide state-of-the-art performance compared with the reference HMC for high dimensional distributions, while Bierkens et al. (2016) shows that PDMP-based sampler is easier to scale in big data settings without introducing bias, while Bierkens et al. (2018) considers the application of PDMP for distributions on restricted domains. Fearnhead et al. (2016) unifies BPS and Zigzag sampler in the framework of PDMP and they choose the process velocity, at event times, over the unit sphere, based on the inner product between this velocity and the gradient of the potential function. (This perspective relates to the transition dynamics used in our paper.) To overcome the main dif- ficulty in PDMP-based samplers, which is the simulation of time-inhomogeneous Poisson process, Sherlock and Thiery (2017) and Vanetti et al. (2017) resort to a discretisation of such continuous-time samplers. Furthermore, pre-conditioning the velocity set is shown to accelerate the **algorithms**, as shown by Pakman et al. (2016). The outline of this chapter is as follows. Section 2 introduces the necessary back- ground of PDMP-based MCMC samplers, the techniques used in its implementation, and two specified samplers, BPS and the Zigzag sampler. In Section 3, we describe the methodology behind the coordinate sampler, provide some theoretical validation along with a proof of geometrical ergodicity, obtained under quite mild conditions, and we compare this proposal with the Zigzag sampler in an informal analysis. Section 4 further compares the efficiency of both approaches on banana-shaped dis- tributions, multivariate Gaussian distributions and a Bayesian logistic model, when effective sample size is measuring efficiency. Section 5 concludes by pointing out further research directions about this special MCMC sampler.

En savoir plus
143 En savoir plus

PACS numbers: 02.70.Tt, 05.10.Ln, 05.10.-a, 64.60.De, 75.10.Hk, 75.10.Nr
Keywords: **Monte** **Carlo** methods; Metropolis algorithm; factorized Metropolis filter; long-range interactions; spin glasses
**Markov**-**chain** **Monte** **Carlo** methods (MCMC) are powerful tools in many branches of science and engineer- ing [1–8]. For instance, MCMC plays a crucial role in the recent success of AlphaGo [9], and appears as a keystone of the potential next deep learning revolution [10, 11]. To estimate high-dimensional integrals, MCMC generates a **chain** of random configurations, called samples. The sta- tionary distribution is typically a Boltzmann distribution and the successive moves depend on the induced energy changes. Despite a now long history, the most successful and influential MCMC algorithm remains the founding Metropolis algorithm [12] for its generality and ease of use, ranked as one of the top 10 **algorithms** in the 20th century [13].

En savoir plus
The literature of AIS is vast, including methods based on sequential moment matching such as AMIS [7], that com- prises a Rao-Blackwellization of the temporal estimators, and APIS that incorporates multiple proposals [8]. Other recent methods have introduced **Markov** **chain** **Monte** **Carlo** (MCMC) mechanisms for the adaptation of the IS proposals [9], [10], [11]. The family of population **Monte** **Carlo** (PMC) methods also falls within AIS. Its key feature is arguably the use of resampling steps in the adaptation of the location parameters of the proposals [12], [13]. The seminal paper [14] introduced the PMC framework. Since then, other PMC **algorithms** have been proposed, increasing the resulting performance by the incor- poration of stochastic expectation-maximization mechanisms [15], non-linear transformation of the importance weights [16], or better weighting and resampling schemes [17]. The method we propose in this paper falls within the PMC framework.

En savoir plus
Abstract
Stochastic Gradient **Markov** **Chain** **Monte** **Carlo** (SG-MCMC) **algorithms** have be- come increasingly popular for Bayesian inference in large-scale applications. Even though these methods have proved useful in several scenarios, their performance is often limited by their bias. In this study, we propose a novel sampling algorithm that aims to reduce the bias of SG-MCMC while keeping the variance at a reason- able level. Our approach is based on a numerical sequence acceleration method, namely the Richardson-Romberg extrapolation, which simply boils down to run- ning almost the same SG-MCMC algorithm twice in parallel with different step sizes. We illustrate our framework on the popular Stochastic Gradient Langevin Dynamics (SGLD) algorithm and propose a novel SG-MCMC algorithm referred to as Stochastic Gradient Richardson-Romberg Langevin Dynamics (SGRRLD). We provide formal theoretical analysis and show that SGRRLD is asymptotically con- sistent, satisfies a central limit theorem, and its non-asymptotic bias and the mean squared-error can be bounded. Our results show that SGRRLD attains higher rates of convergence than SGLD in both finite-time and asymptotically, and it achieves the theoretical accuracy of the methods that are based on higher-order integrators. We support our findings using both synthetic and real data experiments.

En savoir plus
As mentioned above, the rate of convergence obtained for the total variation dis- tance is practically useless to analyze MCMC algorithm when the dimension of the state space becomes large. Despite the fact that the bounds are explicit, these results say in practice little more than ‘the **chain** converges for large n’; see [ JH01 ] and [ RR04 ]. On the contrary, as observed in several recent works in this direction (see [ Cot+13 ] and the references therein), the Wasserstein bounds are much more informative, at least for ap- propriately designed MCMC **algorithms**. One of the key to the success of our approach for Wasserstein distance (under of course appropriate assumptions on the transition ker- nel and particular choice of the distance) is the ability to couple MCMC **algorithms** ”naturally” by simply running two versions of the algorithm with the same random num- bers. This in contrast with the ”general” coupling construction used for total variation convergence where an attempt to make the two components equal is made only when the two versions of the **chain** meet in a coupling set (the probability meeting in a coupling set is typically not large and the probability of successfully coupling the **chain** is on the top of this vanishingly small in large dimension). We provide an illustration of this fact on a version of the Metropolis Adjusted Langevin Algorithm using an exponential integrator (EI-MALA) originally proposed by [ Ebe14 ].

En savoir plus
343 En savoir plus

Irreversible or non-reversible MCMC samplers have attracted a lot of attention for the last two decades. They break the detailed-balance condition while still obeying the global-balance one and leaving π invariant and, by doing so, have often been shown to have better convergence compared to their reversible counterpart. It is however still challenging to develop a construction methodol- ogy for irreversible kernels, which displays the generality of reversible schemes as the Metropolis- Hastings algorithm, while improving the convergence. Indeed, non-reversible MCMC **algorithms** can be directly built from the composition of reversible MCMC kernels (e.g. Deterministic Scan Gibbs samplers Gelfand and Smith ( 1990a )), but it is well-known that such a strategy can be rela- tively inefficient, in particular since it does not prevent diffusive behavior and backtracking in the resulting process. To circumvent this issue, popular solutions consist in extending the state space by introducing an additional variable and targeting the extended probability distribution

En savoir plus
π(x|y) ∝ exp −ky − Hxk 2 /2σ 2 − αk∇ d xk 1 . (19)
Image processing methods using ( 19 ) are almost exclusively based on MAP estimates of x that can be efficiency computed using proximal optimisation **algorithms** [ Green et al. , 2015 ]. Here we consider the problem of computing credibility regions for x, which we use to assess the confidence in the restored image. Precisely, we use MYULA to compute approximately the marginal 90% credibility interval for each image pixel, where we note that ( 19 ) is log-concave and admits the decomposition U (x) = f (x) + g(x), with f (x) = −ky − Hxk 2 /2σ 2 convex and Lipschitz

En savoir plus
unordered but they
an be sorted again. Nonetheless, we did not impose ordering when simulating the parameters with xed k's and, more importantly, did not restri
t ourselves to implement
ombine moves only on adja
ent
omponents as in Ri
hardson and Green (1997).
Hen
e, the above item that we nd most important is B.; whether the missing data z is kept tra
k of in all moves or not. It would indeed be interesting to
ompare the performan
e of two **algorithms**, in dis
rete or
ontinuous time, that are identi
al ex
ept for this aspe
t. (We re
all that Robert et al. (2000) did resort to
ompletion in their implementation of RJMCMC.)

En savoir plus
that a Metropolis{Hastings s
heme based on the same proposal does not work well, while a PMC algorithm produ
es
orre
t answers.
The PMC framework thus allows for a mu
h easier
onstru
tion of adaptive s
hemes, i.e. of proposals that
orre
t themselves against past performan
es, than in MCMC setups. Indeed, while adaptive importan
e sampling strategies have already been
onsidered in the pre-MCMC area, as in, e.g., Oh and Berger (1992,1993), the MCMC environment is mu
h harsher for adaptive **algorithms**, be
ause the adaptivity
an
els the Markovian nature of the sequen
e and thus
alls for more elaborate
onvergen
e studies to establish ergodi
ity. See, e.g., Andrieu and Robert (2001) and Haario et al. (1999,2001) for re
ent developments in this area. For PMC methods, ergodi
ity is not an issue sin
e the validity is obtained via importan
e sampling justi
ations.

En savoir plus
6 Conclusion
Nested **Monte**-**Carlo** Search can be used for problems that do not have good heuristics to guide the search. The use of ran- dom games implies that nested calls are not guaranteed to im- prove the search. On simple abstracts problems, a theoretical analysis of nested **Monte**-**Carlo** search was presented. It was also shown that memorizing the best sequence improves a lot the mean result of the search. Experiments on three differ- ent games gave very good results, ﬁnding a new world record of 80 moves at Morpion Solitaire, improving on previous al- gorithms at SameGame, and being more than 200,000 times faster than depth-ﬁrst search at 16x16 Sudoku modeled as a Constraint Satisfaction Problem.

En savoir plus
nested calls [7, 1]. These **algorithms** use a base heuristic which is improved with nested calls, whereas Nested **Monte**-**Carlo** Search uses random moves at the base level instead. Nested **Monte**-**Carlo** Search is an algorithm that uses no domain specific knowledge and which is widely applicable. However adding domain specific knowledge will prob- ably improve it, for example at Kakuro pruning more values using stronger consistency checks would certainly improve both the Forward Checking algorithm and the Nested **Monte**-**Carlo** search algorithm.

4. Reversible Jump MCMC Algorithm
4.1. General MCMC Method
[ 35 ] The **Markov** **chain** **Monte** **Carlo** (MCMC) method is
a powerful tool for Bayesian estimation. MCMC sampling was first introduced by Metropolis et al. [1953] to integrate over high dimensional probability distributions to make inference about model parameters. In Bayesian inference we are interested in finding the joint posterior distribution of the parameters. The difficulty is that the posterior distribution is typically found by multidimensional integration, which is only feasible for small-scale problems and hence many problems become intractable. When the full-conditional

En savoir plus