Lachesis: Automated Generation of Persistent Partitionings for Big Data Applications

Persistent partitioning is effective in improving the performance by avoiding the expensive shuffling operation, while incurring relatively small overhead. However it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions. That is because user defined functions coded with an object-oriented language such as Python, Scala, Java, can contain arbitrary code that is opaque to the system and makes it hard to extract and reuse sub-computations for optimizing data placement. In addition, it is also challenging to predict the future workloads that may utilize the partitionings. We propose the Lachesis system, which allows UDFs to be decomposed into analyzable and reusable sub-computations and relies on a deep reinforcement learning model that infers which sub-computations should be used to partition the underlying data. This analysis is then used to automatically optimize the storage of the data across applications.
View on arXiv