Deep learning has been successfully applied to various tasks, but its underlying mechanism remains unclear. Neural networks map input data to hidden states in deep layers. As deeper layers have fewer degrees of freedom, subsets of data are transformed into identical states in the deep layers. In this sense, deep learning can be considered as a hierarchical data grouping process. In this Letter, we discover that deep learning forces the size distributions of the data clusters to follow power laws with a different power exponent within each layer. In particular, we identify a critical layer where the cluster size distribution obeys a reciprocal relationship between rank and frequency, also known as Zipf's law. Deep learning ensures balanced data grouping by extracting similarities and differences between data. Furthermore, we verify that the data structure in the critical layer is most informative to reliably generate patterns of training data. Therefore, the criticality can explain the operational excellence of deep learning and provide a useful concept for probing optimal network architectures.
View on arXiv