Understanding the implicit bias of gradient-based methods is crucial in elucidating the operation of deep neural networks.We study the role that the scale of initialization plays in the solution selected by gradient descent, through a detailed analysis of diagonal linear networks. For regression with square loss, we show how the scale of initialization controls the transition between the “kernel” and non-kernel(“rich”) regimes, where the inductive bias is L2 max-margin and L1 max-margin,respectively. For classification with exp-loss, we show how the transition between the regimes is controlled by the relationship between the initialization scale and the training accuracy. We also show how increasing the depth of the network can push the selected solution closer to the “rich” regime, which usually has better generalization properties.
Edward Moroshko is a PhD student under the supervision of Prof. Daniel Soudry and Prof. Koby Crammer.
Zoom link: https://technion.zoom.us/j/92101344462