http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture6.pdf
sigmoid
- Saturated neurons “kill” the gradients
- Sigmoid outputs are not zero-centered
- exp() is a bit compute expensive
tanh
- Squashes numbers to range [-1,1]
- zero centered (nice)
- Saturated neurons “kill” the gradients
ReLU
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice (e.g. 6x)
- Actually more biologically plausible than sigmoid
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture7.pdf
- Adam is a good default choice in most cases
- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)