accuracy (counting) is not differentiable! and cross entropy error is just an approx of the accuracy
sometimes, minimizing the cross entropy is not minimizing the accuracy
perceptron and sigmoid NN can find the decision boundary successfully
Now one more point. Perception -> 100% accuracy, while sigmoid NN can not reach 100% accuracy (assume NN's weights are bounded => length of weights vector is 1)
high dim -> no one knows -> only hypothesis
saddle point -> some eigen values of the hessian matrix are positive, and some are negative
R => how fast it converges
R > 1 => getting worse
R = 1 => no better no worse
R<1 => better
First consider the quadratic cases
Newton's method 参考 https://zhuanlan.zhihu.com/p/83320557 chapter4.1
注意不同的是,4.1里面是函数本身求根,这里是要求导数的根,所以多加一次导数形式就匹配了。optimal step for grad is the second order derivative (hessian matrix)'s inverse
difference dim may have different optimal -> may converge in one direction, but diverge in the other -> have to get the min of all optimal
but we dont need capture the whole Hessian matrix, right?
all these 2nd order method fail in high dim
why not using multi step information??