Lec 07 Training Neural Networks II

阅读信息

440 词 4 分钟本页总访问量加载中...

Improve Optimization

Python
while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights += -step_size * weights_grad

Problem:

Loss function sensitive in one direction but not in another direction. SGD will have zigzag bahavior.
The loss function has a local minima or saddle point (in higher space, loss go up in some directions and down in other directions). SGD will get stuck.
Our gradients come from minibatches so they can be noisy.

SGD + Momentum:

Python
vx = 0
while True:
    dx = compute_gradient(x)
    vx = rho * vx + dx
    x += learning_rate * vx

Nesterov Momentum: First step in the direction os velocity, then compute the gradient at the new position.

Python
dx = compute_gradient(x)
old_v = v
v = rho * v - learning_rate * dx
x += -rho * old_v + (1 + rho) * v

AdaGrad: add element-wise scaling of the gradient based on the historical sum of squares in wach dimension.

The sum get larger and the step get slower. Its good in convex case but not in non-convex case.

Python
grad_squared = 0
while True:
    dx = compute_gradient(x)
    grad_squared += dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

RMSProp: adjust to let the square estimate decay.

Python
grad_squared = 0
while True:
    dx = compute_gradient(x)
    grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Adam: combine momentum and AdsGrad/RMSProp.

To avoid a very large step at the beginning (becouse beta2 is close to one and second_moment is small in the first loop), add unbias terms.

Python
first_moment = 0
second_moment = 0
while True:
    dx = compute_gradient(x)
    first_moment = beta1 * first_moment + (1 - beta1) * dx
    second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
    first_unbias = first_moment / (1 - beta1 ** t)
    second_unbias = second_moment / (1 - beta2 ** t)
    x -= learning_rate * forst_moment / (np.sqrt(second_moment) + 1e-7)

Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models.

Learning rate can decay over time, which is especially common in momentum. E.g., exponential decay: \(\alpha=\alpha_0e^{-kt}\).

Second-order optimization:

First-order optimization: use gradient form linear approximation; step to minimize the approximation.
Second-order optimization: use gradient and Hessian to form quadratic approximation; step to minima of the approximation.

Second-order Taylor expansion:

\[J(\theta)\approx J(\theta_0)+(\theta-\theta_0_^T\nabla_{\theta})J(\theta_0)+\frac{1}{2}(\theta-\theta_0)^T H(\theta-\theta_0).\]

Solving for the critical point we obtain the Newton parameter update:

\[\theta^*=\theta_0-H^{-1}\nabla_{\theta}J(\theta_0).\]

Quasi-Newton methods (BGFS most popular): instead of invertin the Hessian (\(O(n^3)\)), approximate inverse Hessian with rank 1 updates over time (\(O(n^2)\) each).

L-BFGS: does not form/store the full inverse Hessian.

Improve Performance

Model ensembles: train multiple independent models. At test time averafe their results. Instead of training independent models, we can also use multiple snapshots of a single model during training.

Polyak averaging: keep a moving average of the parameter vector and use that at test time.

Regularizarion: add term to loss.

Dropout: in each forward pass, randomly set some neurons to zero. Probability of dropping is a hyperparameter.

Dropout on training a large ensemble of models that share parameters. Each binary mask is one model. At test time, multiply by dropout probability.

Data augmentation: random flips, crops, scales, color jitters of images.

Dropconnect: randomly zero out some oh the values of the weights matrix.

Fraction max pooling, stochastic depth...

Transfer Learning

Situation: train on large imagenet, and use on small dataset.

Freeze the weights of previous layers, and only reinitialize the last matrix.

	very similar dataset	very different dataset
very little data	Use Linear Classifier on top layer	You're in trouble... Try linear classifier from different stages
quite a lot of data	Finetune a few layers	Finetune a larger number of layers