Efficiently Vectorized MCMC on Modern Accelerators

With the advent of automatic vectorization tools (e.g., JAX's ), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem -- loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.
View on arXiv@article{dance2025_2503.17405, title={ Efficiently Vectorized MCMC on Modern Accelerators }, author={ Hugh Dance and Pierre Glaser and Peter Orbanz and Ryan Adams }, journal={arXiv preprint arXiv:2503.17405}, year={ 2025 } }