Cutting Down Training Memory by Re-fowarding

Deep Neural Networks(DNNs) require huge GPU memory when training on modern image/video databases. Unfortunately, the GPU memory in off-the-shelf devices is always finite, which limits the image resolutions and batch sizes that could be used for better DNN performance. In this paper, we propose a novel training approach, called Re-forwarding, that substantially reduces memory usage in training. Our approach automatically finds a subset of vertices in a DNN computation graph, and stores tensors only at these vertices during the first forward. During backward, extra local forwards (called the Re-forwarding operations) are conducted to compute the missing tensors. The total training memory cost becomes the sum of (1) the memory cost of the subset of vertices and (2) the maximum memory cost of local forwards. Re-forwarding trades time overheads for memory costs and does not compromise any performance in testing. We present theories and algorithms that achieve optimal memory solutions on DNNs with both linear and arbitrary computation graphs. Experiments show that Re-forwarding cuts down up-to 80% of training memory with a moderate time overhead (around 40%) on popular DNNs such as Alexnet, VGG, ResNet, Densenet and Inception net.
View on arXiv