Lec 06 Training Neural Networks I

阅读信息

576 词 3 分钟本页总访问量加载中...

Activation Functions

Sigmoid: squashes numbers to range [0,1].

\[\sigma(x)=\frac{1}{1+e^{-x}}\]

3 problems:

Saturated neurons "kill" the gradients. (When x is very negtive or very positive, sigmoid function is flat, so the gradient on x is nearly zero, which kill the upstream gradient.)
Sigmoid outputs are not zero-centered. (When the input to a neuron is always positive, the gradients on w are always all positive or all negative, because the gradient of sigmoid function is always positive.)
exp() is a bit compute expensive.

tanh: squashes numbers to range [-1,1].

tanh(x) s zero centered, but still kills gradients when saturated.

ReLU:

\[f(x)=\max(0,x)\]

ReLU does not saturated in +region, very computationally efficient, converges faster than sigmoid/tanh, and more biologically paulsible.

2 problems:

Not zero-centered.
Kill the gradients in -region.

Positive ReLU: cross the data cloud. Some data falls in the +region and output a positive value.
Dead ReLU: off the data cloud. The data always falls in the negtive region.

In practice, people like to initialize ReLU neurons with slightly positive bias (e.g. 0.01).

Leaky ReLU:

\[f(x)=\max(0.01x, x)\]

Compared to ReLU, leaky ReLU will not "die" in the negative region.

Parametric Rectifier (PReLU):

\[f(x)=\max(\alpha x, x)\]

Treat \(\alpha\) as a parameter that we can backprop and learn.

Exponential Linear Units (ELU):

\[ f(x)=\begin{cases}x &\text{if}\, x>0 \\ \alpha (\text{exp}(x)-1) &\text{if}\, x\le 0\end{cases} \]

ELU has all benefits of ReLU, closer to zero mean outputs, and its negative saturation regime compared with Leaky ReLU adds some robustness to noise.

Problem: computaion requires exp().

Maxout "Neuron": contains multiple linear transformations. Its output is the maximum among them.

\[\max(w_1^Tx+b_1,w_2^Tx+b_2)\]

Problem: double the number of parameters per neuron.

In practice:

Use ReLU. Be careful with your learning rates.
Try out Leaky ReLU / Maxout / ELU.
Try out tanh but don't expect much.
Don't use sigmoid.

Data Preprocessing

Data preprocessing is for zero mean.

original data → zero-centered data → normalized data

May also see PCA and Whitening of the data: original data → decorrelated data → whitened data.

Weight Initailization

What happends when W=0 init is used?
All neurons do the same thing and update in the same.

First idea: set all weights small random numbers. (Gaussian with zero mean and 1e-2 standard deviation.) W = 0.01 * np.random.rand(D, H)
Problem: in deeper networks, the signal is continuously reduced at each layer, eventually all activations become zero.

Second idea: use 1.0 instead of 0.01. But almost all neurions completely saturated, either 1 or -1 (using tanh). Gradients will be all zero.

Xavier initialization:

W = np.random.rand(fan_in, fan_out) / np.sqrt(fan_in)

Reasonable when using tanh, but break when using ReLU. Can use np.sqrt(fan_in / 2) instead of np.sqrt(fan_in).

Batch Normalization

Consider a batch of catications at some layer. To make each dimension unit gaussian, apply:

\[\hat{x}^{(k)}=\frac{x^{(k)}-\mathrm{E}[x^{(k)}]}{\sqrt{\mathrm{Var}[x^{(k)}]}}\]

Consider the input of N batch, and each batch has dimension D. First compute the empirical mean and variance independently for each dimension, and then normalize.

BN is usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

Squash the range if it wants to:

\[y^{(k)}=\gamma^{(k)}\hat{x}^{(k)}+\beta^{(k)}\]

The network can learn \(\gamma^{(k)}=\sqrt{\mathrm{Var}[x^{(k)}]}\), \(\beta^{(k)}=\mathrm{E}[x^{(k)}]\) to recover the identity mapping.

Babysitting the Learning Process

Preprocess the data.
Choose the architecture.
Double check that the loss os reasonable.
Try training. Start with small regularization and find learning rate that makes the loss go down.

Rough range for learning rate wo should be cross-validating is somewhere [1e-3 ... 1e-5].

Hyperparameter Optimization

Hyperparameters to play with: netword architecture, learning rate, decay schedule, update type, regularization...

Cross-validation strategy: coarse → fine cross-validation in stages.

Note it's best to optimize in log space.

In practice, it's better to sample with random layout than with grid layout.