- quadratic cost function:
- cross-entropy cost function: We human beings often learn fastest when we’re badly wrong about something. But our artificial has far more difficulty than when it’s just a little wrong. Cross-entropy avoids the problem of learning slowing down, which the learning rate of weight is controlled by “σ(z)-y”,the error in the output. In fact, the cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons.
Activation function(squashing function):
- tanh: It may be a better activation function than the sigmoid function. Under the same conditions, you’ll find the tanh networks train a little faster, but the final accuracies are very similar.
- rectified linear units: Experiments show that networks based on rectified linear units consistently outperformed networks based on sigmoid activation functions. There appears to be a real gain in moving to rectified linear units for problems.
Methods for initializing the weights
- Initializing with normalized Gaussians
- 1/ sqr(n_in) approach [2012, Yoshua Bengio]
- Mini-batch size
- Eta: learning rate η
- Lambda: The regularization parameter λ
Overfitting is a major problem in neural networks. Early stopping, Increasing the amount of training data and Regularization techniques are the approaches to reduce overfitting. below are Regularization techniques
- L1(lasso): L1 norm
- L2(ridge): L2 norm(euclidean distance)
- Dropout: Unlike L1 and L2 regularization, dropout doesn’t rely on modifying the cost function. Instead, modify the network itself. Randomly (and temporarily) deleting the hidden neurons in the networks, Repeatedly. It’s rather like we’re training different neural networks. And so the dropout procedure is like averaging the effects for a very large number of different networks. Dropout has been especially useful in training large, deep networks, where the problem of overfitting is often acute.
Expanding the training data(for MNIST):
- A simple way of expanding the training data is to displace each training image by a single pixel. This almost trivial change gives a substantial improvement in classification accuracy.
- [2003 Simard, Steinkraus and Platt] improved their MNIST performance to 99.6 percent using two convolutional-pooling layers, followed by a hidden fully-connected layer with 100 neurons. They didn’t have the advantage of using rectified linear units. the key to their improved performance was expanding the training data, They did this by rotating, translating, and skewing the MNIST trainning images. They also developed a process of “elastic distortion”, a way of emulating the random oscillations hand muscles undergo when a person is writing.
The origin of the term “softmax”: a “softened version of the maximum function.
Using an ensemble of networks:
Other models of artificial neuron：
- RBF: Radial Basis Function [Broomhead and Lowe, 1988]
- ART: Adaptive Resonance Theory [Carpenter and Crossberg, 1987]
- SOM: Self-Organizing Map [Kohonen, 1982]
- Cascade-Correlation: [Fahlman and LEbiere, 1990]
- Elman: [Elamn, 1990]
- Boltzmann: [Ackley et al., 1985]
Deep Neural Network:
- DNB-deep belief network: [Hinton et al., 2006]
- CNN-convolutional neural network: LeNet-5 [LeCun et al., 1998]
- 《机器学习》 周志华，清华大学出版社。
. [2012, Yoshua Bengio]
*Practical Recommendations for Gradient-Based Training of Deep Architectures, by Yoshua Bengio (2012).
. [2003 Simard, Steinkraus and Platt]
*Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt (2003).
. [LeCun et al., 1998]
*”Gradient-based learning applied to document recognition”, by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998).