Lec 05 Convolutional Neural Networks
阅读信息
351 词 2 分钟 本页总访问量 加载中...
Convolutional Neural Network
Fully connected layer: take 32x32x3 image for example. Stretch it to 3072x1 vector, do dot product with 10x3072 weight matrix, and get 1x10 activation.
Convolution layer: preserve the spacial structure. Keep the image as 32x32x3, concolve a 5x5x3 filter with the image.
The depths of filter and image are always the same (3 here).
Convolve: overlay the filter on top of a spacial location in the image, multiply corresponding elements and add bias.
How to slide the filter? Center the filter on top of every pixel in the input volumn. The activation map will be smaller than the input, in 32x32x3 image and 5x5x3 filter case, activation map will be 28x28x1.
Each filter is looking for some specific type of template or concept. E.g., when applying 6 filters, the activation map becomes 28x28x6.
- CONV: extract features with filters (as above)
- RELU: nonlinear activation applied after CONV
- POOL: downsample the activation map, applied after RELU.
ConvNet is a sequence of convolutional layers. Each output is the input of the next layer.
What do filter image represent?
If we treat filter as grayscale image, each filter can be visualized as a small patch with properties like stripes, edges or textures. Images with vertical stripes detect horizontal edges, images with oriented stripes detect oriented edges, etc.
These images show the input pattern that maximizes the activation of the filter.
Other options: slide the filter with different strides. Larger stride, smaller output.
If strides doesn't fit the input, dismiss this option.
In practice, pad the input image with zeros (or other values) to maintain full size.
Calculate parameters
Input volumn: 32x32x3, 10 5x5 filters with stride 1, pad 2. What's the number of parameters in this layer?
For each filter, \(5\times 5\times 3=75\) weights and \(1\) bias.
Total parameters: \((75+1)\times 10=760\).
Pooling layer: make the map smaller and more managable. Commom methods include max pooling (find the maximum in a region) and average pooling (calculate the average of a region).
Now people use strides more instead of pooling?
Fully connected layer (FC layer): aggregate the result into a 1D vector.
Depth of activation map?
If the depth of input image is n, corresponding element in output is the sum of n convolving result. Therefore each filter produce a 2D map.