跳转至

Loss Functions

阅读信息

202 词  2 分钟  本页总访问量 加载中...

Parametric Approach

\[ input: x \rightarrow f(x, W[, b]) \rightarrow \text{10 numbers giving class scores} \]

e.g.

\[ f(x, W, b) = W \text{vec}(x) + b \]
  • Hard cases: Cases with no linear boundaries.

Loss function

Given a dataset of examples \(\{(x_i, y_i)\}_{i = 1}^n\), loss over the dataset is a sum of loss over examples:

\[ L = \frac{1}{N}\sum_{i}L_i(f(x_i, W), y_i) \]

Multiclass SVM Loss (Hinge loss)

\(y_i\) are integers, \(s_i\) is the \(j\)-th class prediction of data \(x_i\), \(L_i = \sum_{j \neq y_i}\max \{0, s_j - s_{y_i} + 1\}\) (s -> score). \(1\) here is a threshold, but it can be chosen randomly.

image.png

Python
1
2
3
4
5
6
def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0
    loss_i = np.sum(margins)
    return loss_i
Why not use L2 norm loss? → Using softmax is better.

The regularization

Model should be simple, so it works on test data. (防止过拟合)

\[ L(W) = \frac{1}{N}\sum_{i = 1}^{N}L_i(f(x_i, W), y_i) + \lambda R(W) \]

\(\lambda\): regularization strength. \(R(W)\) allows the model to choose a simple model (It can be understood as reducing the degree of the fitted curve)

  • L2 regularization: \(R(W) = \Vert W \Vert_F\)
  • L1 regularization: \(R(W) = \Vert W \Vert_1\) (May be better)
  • Elastic net: \(R(W) = \beta \Vert W \Vert_F + \Vert W \Vert_1\)
  • Max norm regularization
  • Dropout
  • Fancier

Bias Variance Tradeoff

Let \(L(W) = E(\hat{y} - y)^2\). Then we have

\[ L(W) = \underbrace{\left(E(\hat{y}) - y\right)^2}_{\text{bias}^2} + \underbrace{E\left((\hat{y} - E(\hat{y}))^2\right)}_{\text{variance}} \]

Considering the variance can reduce overfiting.

Softmax Loss (Cross-entropy loss)

\[ P(Y = k | X = x_i) = \frac{\mathrm{e}^{s_k}}{\sum_j \mathrm{e}^{s_j}} \]

where \(s = f(x_i, W)\), which will normalize the score vector \(s\). We want to minimize the negative log likelihood of the correct class, so the loss function would be

\[ L_i = -\log P(Y = y_i | X = x_i) \]

Actually it is the Kullback-Leibler Divergence: (\(Q, P\) are two discrete probability distributions)

\[ D_{KL}(Q, P) = \sum_k Q(k)\log\frac{Q(k)}{P(k)} \sim -\sum_i \log P(Y = y_i | X = x_i) \]