跳转至

Lec 04 Introduction to Neural Networks

阅读信息

295 词  2 分钟  本页总访问量 加载中...

Backpropagation

The forward pass uses a computation graph to store intermediate values. The backward pass starts from the end node and uses those saved values to compute the gradient of each node with respect to the loss by applying the chain rule.
For a node with inputs (x, y) and output z, each node performs a simple operation; its local gradients \(\frac{\partial z}{\partial x}\) and \(\frac{\partial z}{\partial y}\) are easy to compute. Once we have \(\frac{\partial L}{\partial z}\) from downstream, the chain rule gives \(\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial x}\), which is then passed to the previous node.

Each node keeps and passes on its local gradient.
For a max node, the input that wins gets gradient 1, the others get 0; the local gradient depends on which input was largest.

Why is backprop more efficient than directly differentiating \(L\) with respect to every parameter? Because the chain rule lets us reuse local gradients; we only need each node’s small derivative.

You can treat several simple units as one larger operation, e.g., bundling nodes into a sigmoid \(\sigma(x) = \frac{1}{1 + e^{-x}}\).

When inputs are high-dimensional, the local gradient (Jacobian) is often sparse.
Example: a 4096-D ReLU layer \(f(x) = \max(0, x)\) outputs 4096-D. Its Jacobian is 4096×4096; entry (i, j) is how output i changes with input j. For ReLU each output depends only on its matching input, so only the diagonal matters.

Backprop illustration:

4-1

Neural Networks

Single-layer linear model: \(f = W x\).
Two-layer example: \(f = W_2 \max(0, W_1 x)\); stacking layers with nonlinearity increases expressive power. W2 can assign different weights to different templates learned in W1.

Neural-network terminology comes from biology but the analogy is weak: dendrites multiply by weights \(w_i x_i\), the cell body sums them, and the axon applies a nonlinearity \(f\!\left(\sum_i w_i x_i + b_i\right)\).

Common activation functions: Sigmoid, Leaky ReLU, tanh, Maxout, ReLU, ELU, etc.
Architectures have an input layer, an output layer, and hidden layers in between. Example:

Python
1
2
3
4
x = np.random.randn(3, 1)
h1 = f(np.dot(W1, x) + b1)
h2 = f(np.dot(W2, h1) + b2)
out = np.dot(W3, h2) + b3