Yuandong Tian
Facebook AI Research
http://openreview.net/pdf?id=Hk85q85ee
In this paper, we use dynamical system to analyze the nonlinear weight dynamics of two-layered bias-free networks in the form of
, where σ(·) is ReLU nonlinearity. We assume that the input x follow Gaussian distribution. The network is trained using gradient descent to mimic the output of a teacher network of the same size with fixed parameters w∗ using l2 loss.
We first show that when K = 1, the nonlinear dynamics can be written in close form, and converges to w∗ with at least (1 − � ɛ�)/2 probability, if random weight initializations of proper standard derivation (∼ 1/√d) is used, verifying empirical practice [Glorot & Bengio (2010); He et al. (2015); LeCun et al. (2012)].
-
For networks with many ReLU nodes (K ≥ 2), we apply our close form dynamics and prove that when the teacher parameters
forms orthonormal bases, (1) a symmetric weight initialization yields a convergence to a saddle point and (2) a certain symmetry-breaking weight initialization yields global convergence to w∗ without local minima.
To our knowledge, this is the first proof that shows global convergence in nonlinear neural network without unrealistic assumptions on the independence of ReLU activations. In addition, we also give a concise gradient update formulation for a multilayer ReLU network when it follows a teacher of the same size with l2 loss. Simulations verify our theoretical analysis.