Stencil computations occur in a multitude of scientific simulations and therefore have been the subject of many domain-specific languages including the OPS (Oxford Parallel library for Structured meshes) DSL embedded in C/C++/Fortran. OPS is currently used in several large partial differential equations (PDE) applications, and has been used as a vehicle to experiment with, and deploy performance improving optimisations. The key common bottleneck in most stencil codes is data movement, and other research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through a compiler because in larger-scale codes, there are a huge number of options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across a number of different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS apps, relying on run-time analysis and delayed execution. We observe speedups of 2x on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively. The approach is generally applicable to any stencil DSL that provides per loop data access information.
View on arXiv