本文作者 Radford M. Neal 的 PhD Thesis. *(His writing style is like a physicist.)
Reading progress: 46/195
Main Contribution:
1ST PART
In chapter 2.1, Neal argues that under the condition of
1. Bayesian setting where we have prior, posterior
2. Two layer NN
3. Gaussian initialization of weights & bias (can be generalized)
4. Scale the variance inversely proportional to the square root of number of hidden units
Then for each dimension of the output:
1. Every dimension of output is independent. For any dimension:
2. the prior over the functions represented by the NN converges to a Gaussian Process of zero mean and constant variance (variance depends on the input).
3. The joint distribution converge to multivariate Gaussian with zero mean and interesting covariance.
Generalization: As long as the distribution is [independent & identical distribution; has zero mean; finite variance].
2nd PART
1. Tanh leads to smooth function prior
2. Step function {+1,-1} activation leads to locally Brownian prior.