Before training a deep neural network, one needs to determine the values of many hyper-parameters such as weight decay, momentum, or the choice of the loss function. These values have a significant impact on the DNN performance, nevertheless, there is no practical mechanism for finding the optimal values. Moreover, some of these hyper-parameters are not independent, and changing one of them often requires adjusting others as well. We study how different hyper-parameters interact with each other, and how they affect the performance. For example, we show that when Batch-Normalization is used, weight decay is equivalent to learning rate scaling. Another example is in a continual learning scenario, where we suggest a simple adjustment to the loss function which reduces ``catastrophic forgetting'' significantly.
* Itay Golan is an MSc student under the supervision of Professor Daniel Soudry.
Zoom link: https://technion.zoom.us/j/93952269086