Lecture 5 | Convergence in Neural Networks

accuracy (counting) is not differentiable! and cross entropy error is just an approx of the accuracy
sometimes, minimizing the cross entropy is not minimizing the accuracy

perceptron and sigmoid NN can find the decision boundary successfully

Now one more point. Perception -> 100% accuracy, while sigmoid NN can not reach 100% accuracy (assume NN's weights are bounded => length of weights vector is 1)

high dim -> no one knows -> only hypothesis

saddle point -> some eigen values of the hessian matrix are positive, and some are negative

R => how fast it converges

R > 1 => getting worse
R = 1 => no better no worse
R<1 => better


First consider the quadratic cases

Newton's method 参考 https://zhuanlan.zhihu.com/p/83320557 chapter4.1
注意不同的是,4.1里面是函数本身求根,这里是要求导数的根,所以多加一次导数形式就匹配了。optimal step for grad is the second order derivative (hessian matrix)'s inverse

difference dim may have different optimal \eta -> may converge in one direction, but diverge in the other -> have to get the min of all optimal \eta

coupled
solution -> normalization of data
quadratic term is approximated by Hessian Matrix
if eta = 1 ~> equals to Newton's method
curse of dim

but we dont need capture the whole Hessian matrix, right?

Hessian matrix and quadratic approximation may not be in the right direction
a number of methods to approximate the Hessian

all these 2nd order method fail in high dim


does bfgs and LM solves the stability issue?

why not using multi step information??

inverse of hessian -> inverse of partial derivative
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容