Lec 03 Loss Functions and Optimization
阅读信息
478 词 3 分钟 本页总访问量 加载中...
Loss Function: SVM Loss
For the N training samples \((x_i, y_i)\), define the per-sample loss \(L_i\) and the total loss as the average
Our goal is to minimize this L.
SVM loss: each class gets a score. The correct class score should be higher than all others by at least a margin; outside the margin the loss is zero, inside the margin the loss grows with the violation.
E.g., with a margin of 1 (\(s\) denotes the vector of scores):
Early in training, scores are near 0, so loss is roughly (#classes - 1). The minimum loss is 0; if a \(W\) achieves zero loss then any scaled version like \(2W\) also does.
We do not only care about fitting training labels; we want good test performance. A model that fits train data too aggressively may generalize poorly. Therefore we add a regularization term \(\lambda R(W)\) to the loss to trade off margin against model simplicity so that \(W\) does not become arbitrarily complex.
Common regularizers:
- L2: \(R(W) = \sum_k \sum_l W_{k,l}^2\)
- L1: \(R(W) = \sum_k \sum_l |W_{k,l}|\)
- Elastic net (L1 + L2): \(R(W) = \sum_k \sum_l \left(\beta W_{k,l}^2 + |W_{k,l}|\right)\)
Loss Function: Softmax Loss
This is the loss for the Softmax classifier (multinomial logistic regression). In SVM we only enforce that the correct class out-scores others by a margin. In Softmax, scores are mapped to class probabilities.
Definition of probability:
Where \(s = f(x_i; W)\) and exponentiation makes every entry positive.
We want the probability of the correct class to be high; the loss is the negative log-likelihood:
Exponentiated scores define a distribution; minimizing the loss maximizes the probability of the correct class.
SVM loss only pushes the correct score above others by the margin; Softmax loss not only wants it higher but as high as possible.
In practice, Softmax often works better.
Finding W via Gradient Descent
Different \(W\) give different loss; optimization searches for \(W\) that minimizes loss.
Gradient of \(f\): \(\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\right)\).
Gradient descent moves against the gradient of the loss with respect to \(W\) to approach a minimum.
To get the gradient numerically, add a small \(h\) to each element of \(W\), measure the change in loss, and approximate \(\mathrm{d}W\); this is numerical gradient checking. Writing the exact derivative formula is the analytic gradient.
Gradient descent pseudocode:
| Python | |
|---|---|
Stochastic Gradient Descent (SGD): computing the full loss gradient can be expensive; when loss is an average we can approximate its gradient using only part of the data.
Vanilla minibatch gradient descent: instead of all data, randomly pick a small set (e.g., 256 samples) as a minibatch to estimate the gradient.
Feature Extraction for Image Matching
In practice, raw pixels vary a lot; we often extract features before matching.
If two images contain the same object but at different positions, a linear classifier on raw pixels fails because translation moves the pattern across the grid.
Better features include color histograms and Histogram of Oriented Gradients (HoG).
Another approach: sample patches to build a codebook (visual words); for a new image, encode patches by the nearest codebook words and pool those counts into features.