Lec 03 Loss Functions and Optimization

阅读信息

478 词 3 分钟本页总访问量加载中...

Loss Function: SVM Loss

For the N training samples \((x_i, y_i)\), define the per-sample loss \(L_i\) and the total loss as the average

\[L = \frac{1}{N} \sum_i L_i(f(x_i, W), y_i)\]

Our goal is to minimize this L.

SVM loss: each class gets a score. The correct class score should be higher than all others by at least a margin; outside the margin the loss is zero, inside the margin the loss grows with the violation.

E.g., with a margin of 1 (\(s\) denotes the vector of scores):

\[L_i = \displaystyle \sum_{j \neq y_i} \begin{cases} 0, & s_{y_i} \ge s_j + 1 \\ s_j - s_{y_i} + 1, & \text{otherwise} \end{cases}\]

Early in training， scores are near 0, so loss is roughly (#classes - 1). The minimum loss is 0; if a \(W\) achieves zero loss then any scaled version like \(2W\) also does.

We do not only care about fitting training labels; we want good test performance. A model that fits train data too aggressively may generalize poorly. Therefore we add a regularization term \(\lambda R(W)\) to the loss to trade off margin against model simplicity so that \(W\) does not become arbitrarily complex.

Common regularizers:

L2: \(R(W) = \sum_k \sum_l W_{k,l}^2\)
L1: \(R(W) = \sum_k \sum_l |W_{k,l}|\)
Elastic net (L1 + L2): \(R(W) = \sum_k \sum_l \left(\beta W_{k,l}^2 + |W_{k,l}|\right)\)

Loss Function: Softmax Loss

This is the loss for the Softmax classifier (multinomial logistic regression). In SVM we only enforce that the correct class out-scores others by a margin. In Softmax, scores are mapped to class probabilities.

Definition of probability:

\[P(Y = k \mid X = x_i) = \frac{e^{s_k}}{\sum_j e^{s_j}}\]

Where \(s = f(x_i; W)\) and exponentiation makes every entry positive.
We want the probability of the correct class to be high; the loss is the negative log-likelihood:

\[L_i = -\log P(Y = y_i \mid X = x_i)\]

Exponentiated scores define a distribution; minimizing the loss maximizes the probability of the correct class.

SVM loss only pushes the correct score above others by the margin; Softmax loss not only wants it higher but as high as possible.
In practice, Softmax often works better.

Finding W via Gradient Descent

Different \(W\) give different loss; optimization searches for \(W\) that minimizes loss.

Gradient of \(f\): \(\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right)\).
Gradient descent moves against the gradient of the loss with respect to \(W\) to approach a minimum.
To get the gradient numerically, add a small \(h\) to each element of \(W\), measure the change in loss, and approximate \(\mathrm{d}W\); this is numerical gradient checking. Writing the exact derivative formula is the analytic gradient.

Gradient descent pseudocode:

Python
while True:
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights += -step_size * weights_grad

Stochastic Gradient Descent (SGD): computing the full loss gradient can be expensive; when loss is an average we can approximate its gradient using only part of the data.

\[\nabla_W L(W) = \frac{1}{N} \sum_{i=1}^N \nabla_W L_i(x_i, y_i, W) + \lambda \nabla_W R(W)\]

Vanilla minibatch gradient descent: instead of all data, randomly pick a small set (e.g., 256 samples) as a minibatch to estimate the gradient.

Feature Extraction for Image Matching

In practice, raw pixels vary a lot; we often extract features before matching.
If two images contain the same object but at different positions, a linear classifier on raw pixels fails because translation moves the pattern across the grid.
Better features include color histograms and Histogram of Oriented Gradients (HoG).
Another approach: sample patches to build a codebook (visual words); for a new image, encode patches by the nearest codebook words and pool those counts into features.