Lec 07 Training Neural Networks II
阅读信息
440 词 4 分钟 本页总访问量 加载中...
Improve Optimization
| Python | |
|---|---|
Problem:
- Loss function sensitive in one direction but not in another direction. SGD will have zigzag bahavior.
- The loss function has a local minima or saddle point (in higher space, loss go up in some directions and down in other directions). SGD will get stuck.
- Our gradients come from minibatches so they can be noisy.
SGD + Momentum:
| Python | |
|---|---|
Nesterov Momentum: First step in the direction os velocity, then compute the gradient at the new position.
| Python | |
|---|---|
AdaGrad: add element-wise scaling of the gradient based on the historical sum of squares in wach dimension.
The sum get larger and the step get slower. Its good in convex case but not in non-convex case.
| Python | |
|---|---|
RMSProp: adjust to let the square estimate decay.
| Python | |
|---|---|
Adam: combine momentum and AdsGrad/RMSProp.
To avoid a very large step at the beginning (becouse beta2 is close to one and second_moment is small in the first loop), add unbias terms.
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models.
Learning rate can decay over time, which is especially common in momentum. E.g., exponential decay: \(\alpha=\alpha_0e^{-kt}\).
Second-order optimization:
- First-order optimization: use gradient form linear approximation; step to minimize the approximation.
- Second-order optimization: use gradient and Hessian to form quadratic approximation; step to minima of the approximation.
Second-order Taylor expansion:
Solving for the critical point we obtain the Newton parameter update:
Quasi-Newton methods (BGFS most popular): instead of invertin the Hessian (\(O(n^3)\)), approximate inverse Hessian with rank 1 updates over time (\(O(n^2)\) each).
L-BFGS: does not form/store the full inverse Hessian.
Improve Performance
Model ensembles: train multiple independent models. At test time averafe their results. Instead of training independent models, we can also use multiple snapshots of a single model during training.
Polyak averaging: keep a moving average of the parameter vector and use that at test time.
Regularizarion: add term to loss.
Dropout: in each forward pass, randomly set some neurons to zero. Probability of dropping is a hyperparameter.
Dropout on training a large ensemble of models that share parameters. Each binary mask is one model. At test time, multiply by dropout probability.
Data augmentation: random flips, crops, scales, color jitters of images.
Dropconnect: randomly zero out some oh the values of the weights matrix.
Fraction max pooling, stochastic depth...
Transfer Learning
Situation: train on large imagenet, and use on small dataset.
Freeze the weights of previous layers, and only reinitialize the last matrix.
| very similar dataset | very different dataset | |
|---|---|---|
| very little data | Use Linear Classifier on top layer | You're in trouble... Try linear classifier from different stages |
| quite a lot of data | Finetune a few layers | Finetune a larger number of layers |