Machine Learning is Calculus Plus Linear Algebra

Loss Functions

$f$ $X$ $Y$ $D$ $F$ $f$ $[x,y]$ $x$ . We consider only supervised learning here.

$f$ $L$ $f$ makes on the training data.

\begin{matrix} (1) & L_{f} = \sum_{[x, y] \in D} | f (x), y | + λ r (f) \end{matrix}

$|f(x), y|$ $\lambda$ $r(f)$ $f$ that minimizes this loss function.

Parameterized Families of Functions

$f$ $\Theta$ $\Theta$ $L_{f_\Theta}$ $L_\Theta$ $r(f_\Theta)$ $r(\Theta)$ .

$\theta$ $L_\theta$ $F$ $n$ is the number of parameters:

\begin{matrix} (2) & \begin{matrix} \nabla_{Θ} L_{Θ} = [\begin{matrix} \frac{\partial L_{Θ}}{\partial θ_{1}} \\ \frac{\partial L_{Θ}}{\partial θ_{2}} \\ ⋮ \\ \frac{\partial L_{Θ}}{\partial θ_{n}} \end{matrix}] \end{matrix} \end{matrix}

$r$ to be a real-value function:

\begin{matrix} (3) & L_{Θ} = \sum_{[x, y] \in D} | | f_{Θ} (x), y | |_{2}^{2} + λ r (Θ) \end{matrix}

$(f_\Theta(x) - y)^2$ $(f_\Theta(x)-y)^T(f_\Theta(x)-y)$ , and thus is a single real number. We can simplify the gradient using the fact that the derivative of a sum is a sum of derivatives as follows:

\begin{matrix} (4) & \begin{matrix} \nabla_{Θ} L_{Θ} = \sum_{[x, y] \in D} [\begin{matrix} \frac{\partial (f_{Θ} (x) - y)^{2}}{\partial θ_{1}} \\ \frac{\partial (f_{Θ} (x) - y)^{2}}{\partial θ_{2}} \\ ⋮ \\ \frac{\partial (f_{Θ} (x) - y)^{2}}{\partial θ_{n}} \end{matrix}] + λ \nabla_{Θ} r (Θ) \end{matrix} \end{matrix}

Now consider one element of the vector inside the summation above. We can use the chain rule to calculate that

\begin{matrix} (5) & \frac{\partial (f_{Θ} (x) - y)^{2}}{\partial θ_{i}} = 2 {(\frac{\partial f_{Θ} (x)}{\partial θ_{i}})}^{⊤} (f_{Θ} (x) - y) \end{matrix}

This expression can be interpreted as twice the influence of the parameter on the results times the difference between the current candidate and the target. (The difference is also called the errorresidual $\eta$ times the negative of the gradient.

\begin{matrix} (6) & Θ \leftarrow Θ - η \nabla_{Θ} L_{Θ} \end{matrix}

$f_\Theta$ $\Theta$ and the the regularization function is likewise differentiable.

In gradient descent, the parameters are adjusted many time. Each pass over the training data is called an epoch. A stopping rule specifies when training stop. A simple rule is to stop when all the elements of the gradient are smaller in absolute value than some given small constant. Another stopping rule is hold out some of the data in what is called a validation set. The validation data is not used in computing the gradient. Instead, after some number of descent steps, the network is tested on the validation data. If the loss on the validation set stops decreasing, then training stops, because the network has stopped generalizing to data not used in training. This stopping rule prevents the network from overfitting to the training data and failing to generalize to new data.

Regularization Functions

The general purpose of regularization is prefer simplier functions under the hope that they generalize better to unseen data. This is a way to implement Ockham’s Razor, the principle that the simpliest explanation is usually the best.

A commonly used regularization function is the sum of squares of the parameters, which is called the L2 regularizer.

\begin{matrix} (7) & r (Θ) = \sum_{i} θ_{i}^{2} \end{matrix}

This will prefer functions where many of the parameters are zero or near zero. (It is often combined with a heuristic step where near-zero parameters are rounded to zero.) It is easy to compute the gradient of this function:

\begin{matrix} (8) & \nabla_{Θ} (λ \sum_{j} θ_{j}^{2}) = 2 λ Θ \end{matrix}

Artificial Neurons

$x_0$ $f_\theta$ is thus

\begin{matrix} (9) & \begin{matrix} f_{Θ} (x) = {\begin{cases} 0 & if Θ^{T} x \leq 0 \\ Θ^{T} x & otherwise \end{cases} \end{matrix} \end{matrix}

The fact that this function is not differential at 0 can be ignored by simply defining its deriviative to be 0 at 0. (Although the standard definition of derivatives do not allow this trick, more generalized notions of derivative from convex analysis allows this to be done as a "choice of subgradient".)

$f_\Theta(x)$ $\theta_i$ $\Theta^T x$ $\theta_i$ $\theta_i$ $x_i$ . Therefore

\begin{matrix} (10) & \begin{matrix} \frac{\partial f_{Θ} (x)}{\partial θ_{i}} = {\begin{cases} 0 & if Θ^{⊤} x \leq 0 \\ x_{i} & otherwise \end{cases} \end{matrix} \end{matrix}

Note that because our neuron has a single output, it's partial derivative is a single real number rather than a vector of real numbers. We can write down and simplify the gradient as follows.

\begin{aligned} (11) & \nabla_{Θ} L_{Θ} & = \sum_{[x, y] \in D} [\begin{array}{c} 2 (\frac{\partial f_{Θ} (x)}{\partial θ_{0}}) (f_{Θ} (x) - y) \\ 2 (\frac{\partial f_{Θ} (x)}{\partial θ_{1}}) (f_{Θ} (x) - y) \\ ⋮ \\ 2 (\frac{\partial f_{Θ} (x)}{\partial θ_{n}}) (f_{Θ} (x) - y) \end{array}] + 2 λ Θ \\ (12) & = 2 \sum_{[x, y] \in D} (f_{Θ} (x) - y) [\begin{array}{c} \frac{\partial f_{Θ} (x)}{\partial θ_{0}} \\ \frac{\partial f_{Θ} (x)}{\partial θ_{1}} \\ ⋮ \\ \frac{\partial f_{Θ} (x)}{\partial θ_{n}} \end{array}] + 2 λ Θ \\ (13) & = 2 (\sum_{[x, y] \in D} (f_{Θ} (x) - y) {\begin{cases} \vec{0} & if Θ^{⊤} x \leq 0 \\ x^{T} & otherwise \end{cases}) + 2 λ Θ \end{aligned}

$x$ is such that the output of the unit is zero, the first term of the gradient becomes zero and only the regularization function can move the gradient away from zero. This fact can lead linear rectifier units to become stuck at zero during optimization. There are variations of linear rectifier units that include a slight slope in the negative sum region to prevent this from happening. The issue of dead units is less of problem once we move to large networks with thousands or millions of units because having some fraction of the units "die" during training can make the network smaller and faster.

Layered Neural Networks

$2$ $n+1$ $n$ is the number of inputs and arbitrarily high depth. In practice, of course, both the width and depth must be bounded, and finding a sufficient width and depth for a particular task is a matter of trial and error.

$L$ $0$ $0$ $1$ $l$ $i$ is the postion in the layer of the neuron.

\begin{matrix} (14) & \begin{array}{ll} a_{i}^{(l)} & output of neuron i in layer l \\ \hat{y} & output (vector) of the neuron(s) in the highest layer \\ z_{i}^{(l)} & input sum (pre-activation) of neuron i in layer l \\ w_{i j}^{(l)} & weight from neuron j in layer l - 1 to neuron i in layer l \end{array} \end{matrix}

$W$ $\Theta$ $\sigma$ in an ReLU and its derivative:

\begin{aligned} (15) & σ (z) & = max (0, z) \\ (16) & σ^{'} (z) & = 1 {z > 0} \end{aligned}

$\mathbf{1}\{\,P \,\}$ $P$ is true, 0 otherwise".

Example: a 3-Unit Network

$x_0^{(l)}$ constant 1 inputs. Some formalizations of neural networks use a separate vector of so-called bias weights instead of using special fixed inputs.)

We will see that the calculation of the gradient uses the error signal of the entire network in computing the derivatives for the weights in the output layer, and then propagates the error signal back to the previous layer in proportion to the weights on the outputs of that layer, and so on. This intuitive description of the calculation is why the method is named "back propagation". It is important to understand, however, that it is simply an application of the chain rule.

$((x_1,x_2),y)$ $x$ $\hat y$ and storing the intermediate values. This is called the "forward pass".

Layer 1 (two units):

\begin{matrix} (17) & \begin{aligned} z_{1}^{(1)} & = w_{10}^{(1)} + w_{11}^{(1)} x_{1} + w_{12}^{(1)} x_{2} \\ a_{1}^{(1)} & = σ (z_{1}^{(1)}) = max (0, z_{1}^{(1)}) \\ z_{2}^{(1)} & = w_{20}^{(1)} + w_{21}^{(1)} x_{1} + w_{22}^{(1)} x_{2} \\ a_{2}^{(1)} & = σ (z_{2}^{(1)}) = max (0, z_{2}^{(1)}) \end{aligned} \end{matrix}

Layer 2 (one unit / output):

\begin{aligned} (18) & z_{1}^{(2)} & = w_{10}^{(2)} + w_{11}^{(2)} a_{1}^{(1)} + w_{12} a_{2}^{(1)} \\ (19) & \hat{y} & = a_{1}^{(2)} = σ (z_{1}^{(2)}) = max (0, z_{1}^{(2)}) \end{aligned}

Loss:

\begin{matrix} (20) & ℓ = (\hat{y} - y)^{2} \end{matrix}

$i$ $l$ :

\begin{matrix} (21) & \begin{array}{ll} δ_{i}^{(l)} ≜ \frac{\partial ℓ}{\partial z_{i}^{(l)}} \end{array} \end{matrix}

$\delta$ are called "error terms" and measure how much the loss would change if you nudged the corresponding pre-activations a bit. We can now calculate the gradient of the loss function by backpropagation via the chain rule.

Step 1: Top derivative

\begin{matrix} (22) & \frac{\partial ℓ}{\partial \hat{y}} = 2 (\hat{y} - y) \end{matrix}

Step 2: Through the ReLU

\begin{matrix} (23) & δ_{1}^{(2)} \equiv \frac{\partial ℓ}{\partial z_{1}^{(2)}} = \frac{\partial ℓ}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_{1}^{(2)}} = 2 (\hat{y} - y) σ^{'} (z_{1}^{(2)}) = 2 (\hat{y} - y) 1 {z_{1}^{(2)} > 0} \end{matrix}

$w^{(2)}$

\begin{matrix} (24) & \frac{\partial ℓ}{\partial w_{10}^{(2)}} = δ_{1}^{(2)}, \frac{\partial ℓ}{\partial w_{11}^{(2)}} = δ_{1}^{(2)} a_{1}^{(1)}, \frac{\partial ℓ}{\partial w_{12}^{(2)}} = δ_{1}^{(2)} a_{2}^{(1)} \end{matrix}

Step 4: Send error to hidden activations

\begin{matrix} (25) & \frac{\partial ℓ}{\partial a_{1}^{(1)}} = δ_{1}^{(2)} w_{11}^{(2)}, \frac{\partial ℓ}{\partial a_{2}^{(1)}} = δ_{1}^{(2)} w_{12}^{(2)} \end{matrix}

Step 5: Through hidden ReLUs to pre-activations

\begin{matrix} (26) & \begin{array}{l} δ_{1}^{(1)} \equiv \frac{\partial ℓ}{\partial z_{1}^{(1)}} = (δ_{1}^{(2)} w_{11}^{(2)}) σ^{'} (z_{1}^{(1)}) = δ_{1}^{(2)} w_{11}^{(2)} 1 {z_{1}^{(1)} > 0} \\ δ_{2}^{(1)} \equiv \frac{\partial ℓ}{\partial z_{2}^{(1)}} = (δ_{1}^{(2)} w_{12}^{(2)}) σ^{'} (z_{2}^{(1)}) = δ_{1}^{(2)} w_{12}^{(2)} 1 {z_{2}^{(1)} > 0} \end{array} \end{matrix}

$w^{(1)}$

\begin{matrix} (27) & \begin{array}{l} \frac{\partial ℓ}{\partial w_{10}^{(1)}} = δ_{1}^{(1)}, \frac{\partial ℓ}{\partial w_{11}^{(1)}} = δ_{1}^{(1)} x_{1}, \frac{\partial ℓ}{\partial w_{12}^{(1)}} = δ_{1}^{(1)} x_{2} \\ \frac{\partial ℓ}{\partial w_{20}^{(1)}} = δ_{2}^{(1)}, \frac{\partial ℓ}{\partial w_{21}^{(1)}} = δ_{2}^{(1)} x_{1}, \frac{\partial ℓ}{\partial w_{22}^{(1)}} = δ_{2}^{(1)} x_{2} \end{array} \end{matrix}

Deep Networks

$L$ $m$ at every layer.

$L$ $m$ ):

\begin{aligned} (28) & a_{0}^{(0)} & \equiv 1 \\ (29) & a_{j}^{(0)} & = x_{j} (j = 1, \dots, m) \\ (30) & σ (u) & = max (0, u) \\ (31) & σ^{'} (u) & = 1 {u > 0} \end{aligned}

$l=1,\dots,L$ $i=1,\dots,m$ :

\begin{aligned} (32) & z_{i}^{(l)} & = \sum_{j = 0}^{m} w_{i j}^{(l)} a_{j}^{(l - 1)} \\ (33) & a_{i}^{(l)} & = σ (z_{i}^{(l)}) \end{aligned}

Output and loss:

\begin{aligned} (34) & {\hat{y}}_{i} & \equiv a_{i}^{(L)} \\ (35) & ℓ & = \sum_{i = 1}^{m} (a_{i}^{(L)} - y_{i})^{2} = ∥ a^{(L)} - y ∥_{2}^{2} \end{aligned}

$L$ $m$ ):

Top layer errors:

\begin{matrix} (36) & δ_{i}^{(L)} = \frac{\partial ℓ}{\partial z_{i}^{(L)}} = 2 (a_{i}^{(L)} - y_{i}) σ^{'} (z_{i}^{(L)}), i = 1, \dots, m . \end{matrix}

$l=L-1,\dots,1$ :

\begin{matrix} (37) & δ_{j}^{(l)} = \frac{\partial ℓ}{\partial z_{j}^{(l)}} = (\sum_{i = 1}^{m} w_{i j}^{(l + 1)} δ_{i}^{(l + 1)}) σ^{'} (z_{j}^{(l)}), j = 1, \dots, m . \end{matrix}

$j=0$ ):

\begin{aligned} (38) & \frac{\partial ℓ}{\partial w_{i 0}^{(l)}} & = δ_{i}^{(l)} \\ (39) & \frac{\partial ℓ}{\partial w_{i j}^{(l)}} & = δ_{i}^{(l)} a_{j}^{(l - 1)}, j = 1, \dots, m, i = 1, \dots, m, l = 1, \dots, L \end{aligned}

$\ell$ $\mathcal{L}_\text{mean}$ to represent the mean average loss over the entire training set. The mean loss is more useful than the total loss because it does not depend upon the size of the data set and its gradient is in the same direction as that of the total loss.

We make one small change in the regularization function by not including the weights on the constant 1 dummy inputs - that is, the bias weights. This is because the purpose of regularization is to push the network toward sparsity by having many zero weights on connections between neurons. Thus, we write:

\begin{aligned} (40) & L_{mean} (Θ) & = \frac{1}{| D |} \sum_{(x, y) \in D} ‖ a^{(L)} (x) - y ‖_{2}^{2} + λ \sum_{l = 1}^{L} \sum_{i = 1}^{m} \sum_{j = 1}^{m} (w_{i j}^{(l)})^{2} \\ (41) & \frac{\partial L_{mean}}{\partial w_{i 0}^{(l)}} & = \frac{1}{| D |} \sum_{(x, y) \in D} δ_{i}^{(l)} (x) \\ (42) & \frac{\partial L_{mean}}{\partial w_{i j}^{(l)}} & = \frac{1}{| D |} \sum_{(x, y) \in D} δ_{i}^{(l)} (x) a_{j}^{(l - 1)} (x) + 2 λ w_{i j}^{(l)} \end{aligned}

$W$ . So, where

\begin{matrix} (43) & W = (w_{10}^{(1)}, w_{11}^{(1)}, \dots, w_{1 m}^{(1)}, w_{20}^{(1)}, w_{21}^{(1)}, \dots, w_{2 m}^{(1)}, \dots, w_{m 0}^{(1)}, \dots, w_{m m}^{(1)}, w_{10}^{(2)}, \dots, w_{m m}^{(L)})^{⊤} \end{matrix}

the gradient is

\begin{matrix} (44) & \nabla L (W) = (\frac{\partial L}{\partial w_{10}^{(1)}}, \frac{\partial L}{\partial w_{11}^{(1)}}, \dots, \frac{\partial L}{\partial w_{1 m}^{(1)}}, \frac{\partial L}{\partial w_{20}^{(1)}}, \dots, \frac{\partial L}{\partial w_{m m}^{(L)}})^{⊤} \end{matrix}

$G^{(l)}$ is gradient of one layer of weights.

\begin{matrix} (45) & G^{(l)} ≜ \frac{\partial L_{mean}}{\partial W^{(l)}} \in R^{m \times (m + 1)}, [G^{(l)}]_{i j} = \frac{\partial L_{mean}}{\partial w_{i j}^{(l)}} \end{matrix}

Jacobian Notation

$f:\mathbb{R}^n\to\mathbb{R}^m$ $f(x)=\big(f_1(x),\dots,f_m(x)\big)^\top$ $x$ $m\times n$ matrix:

\begin{matrix} (46) & \begin{matrix} J_{f} (x) ≜ \frac{\partial f (x)}{\partial x} = [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{m}}{\partial x_{1}} & \dots & \frac{\partial f_{m}}{\partial x_{n}} \end{matrix}] . \end{matrix} \end{matrix}

$l$ $W^{(l)}$ . If we collect all entries of all layer weight matrices into a single long vector, we get the stacked parameter vector

\begin{matrix} (47) & W = vec (W^{(1)}, W^{(2)}, \dots, W^{(L)}) \in R^{P} \end{matrix}

$\operatorname{vec}(\cdot)$ $P$ is the total number of scalar weights in the network.

$\mathcal{L}:\mathbb{R}^P\to\mathbb{R}$ $W$ $1\times P$ row matrix:

\begin{matrix} (48) & J_{L} (W) = \frac{\partial L (W)}{\partial W} \in R^{1 \times P} \end{matrix}

The gradient vector is then the transpose of that row:

\begin{matrix} (49) & \nabla_{W} L (W) = J_{L} (W)^{⊤} \in R^{P} \end{matrix}

$\mathcal{L}(W)=\tfrac{1}{|D|}\sum_{t=1}^{|D|}\ell_t(W)$ , then Jacobians add linearly:

\begin{matrix} (50) & \nabla_{W} L (W) = \frac{1}{| D |} \sum_{t = 1}^{| D |} \nabla_{W} ℓ_{t} (W) \end{matrix}

$\phi$ $\phi(\hat y,y) = || \hat y - y ||^2_2$ $\mathcal{L}=\phi\!\circ\!a^{(L)}\!\circ\cdots\!\circ a^{(1)}$ $W$ ,

\begin{aligned} (51) & J_{L} (W) & = J_{ϕ} (a^{(L)}) J_{a^{(L)}} (z^{(L)}) \dots J_{a^{(1)}} (z^{(1)}) J_{z^{(1)}} (W) \\ (52) & \nabla_{W} L (W) & = J_{L} (W)^{⊤} \end{aligned}

Other Kinds of Units and Layers

Several other kinds of units and layers are used in deep neural networks.

Softmax

Softmax takes a vector of pre-activations as input, computes their exponentials, and then normalizes them so they sum to one. The result has several interesting properties. First, if one of the inputs is much larger than the others, the corresponding output is near 1 and the others are near 0. Softmax is thus a relation of the argmax operator. Second, the output can be viewed as probability distribution. For example, if the network is being used to classify inputs and each of the inputs to the softmax is a score computed for a particular label, then the output can be interpreted as the probability that the corresponding label is the correct one.

$x \in \mathbb{R}^d$ , and the weight matrix is

\begin{matrix} (53) & W \in R^{m \times d} \end{matrix}

The pre-activation vector is

\begin{matrix} (54) & z = W x \in R^{m} \end{matrix}

$a = \sigma(z) \in \mathbb{R}^m$ , with components

\begin{matrix} (55) & a_{i} = \frac{e^{z_{i}}}{\sum_{k = 1}^{m} e^{z_{k}}}, i = 1, \dots, m . \end{matrix}

$a_i > 0$ $i$ $\sum_{i=1}^m a_i = 1$ .

$i = j$ :

\begin{matrix} (56) & \frac{\partial a_{i}}{\partial z_{i}} = a_{i} (1 - a_{i}) \end{matrix}

$i \neq j$ :

\begin{matrix} (57) & \frac{\partial a_{i}}{\partial z_{j}} = - a_{i} a_{j} . \end{matrix}

A key operation in applications of neural networks to computer vision is convolution. A 2-D convolution takes a small filter (also called a kernel) and slides it across an input grid, such as an image. At each position, the filter values are multiplied with the overlapping input values and summed together, producing a single number in the output grid. Repeating this process across all positions produces a new 2-D array where each entry reflects how strongly that local region of the input “matches” the filter. In short: A 2-D convolution is a way of scanning a small pattern across an image to detect features like edges, corners, or textures, turning local structure in the input into structured signals in the output.

As you read the previous paragraph, you will likely be puzzled by the idea of a filter "sliding" across an image. Neurons are simply wired from one to another in a fixed pattern (even if the weights change) and cannot slide connections from one neuron to another. Sliding is a metaphor for weight sharing, where we deliberately reuse the same parameter values across multiple connections instead of learning a separate weight for each one. In other words, different parts of the network are “tied together” so they use the same weights. Thus there are many instances of the filter implemented by different neurons across the 2-D image array, each separated by a few pixels (a quantiity called the stride). All these instances, however, use the same values for their weights.

Let the input be an image

\begin{matrix} (58) & x \in R^{H \times W} \end{matrix}

and let the convolution kernel (shared weights) be

\begin{matrix} (59) & F = (w_{i, j}) \in R^{r \times s} \end{matrix}

$(u,v)$ is

\begin{matrix} (60) & z_{u, v} = \sum_{i = 1}^{r} \sum_{j = 1}^{s} w_{i, j} x_{u + i - 1, v + j - 1} . \end{matrix}

$w_{i,j}$ $(u,v)$ $\mathcal{F}$ detects the same pattern (for example, an edge) regardless of where it appears in the image.

Pooling Layers

Convolution layers are often used in conjunction with pooling layers, also called downsampling layers. While a filter activates at the position in the image where it matches, a pooling layer determines if a filter activates anywhere is a rectangular region of the image of a specified size. For example, you might have a high-level filter that detects cats, and you want to know if a cat appears anywhere in the image.

Let the input feature map be

\begin{matrix} (61) & z \in R^{H \times W} \end{matrix}

$r \times s$ $t$ . The max pooling output is

\begin{matrix} (62) & p \in R^{⌊ \frac{H - r}{t} ⌋ + 1 \times ⌊ \frac{W - s}{t} ⌋ + 1} \end{matrix}

Each entry is defined as

\begin{matrix} (63) & \begin{matrix} p_{u, v} = max_{\begin{matrix} 1 \leq i \leq r \\ 1 \leq j \leq s \end{matrix}} z_{t (u - 1) + i, t (v - 1) + j} \end{matrix} \end{matrix}

Mini-Batches, Cross-Entropy, and Dropout

The three concepts in this section title are crucial for making it practical to train neural networks. The formalization of gradient descent presented above computed the gradient across the entire training set for each weight update. This would be too slow if there were thousands of training instances. The idea of mini-batches (or just batches) is to update the weights after a given number of instances are processed during an epoch. The batch size is often chosen to be 128, so small enough for efficient processing, but large enough to prevent the update from being swayed by a few anomolous data points.

The second crucial concept is using cross-entropy rather than the L2 norm to measure the loss. The network is trying to classify the input, that is, assign one of 10 labels to it. Given a predicted probability distribution over the outputs (e.g. from a softmax layer)

\begin{matrix} (64) & \hat{y} = ({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{m}), \sum_{i = 1}^{m} {\hat{y}}_{i} = 1, {\hat{y}}_{i} \geq 0, \end{matrix}

and a true one-hot encoded label vector

\begin{matrix} (65) & y = (y_{1}, y_{2}, \dots, y_{m}), y_{c} = 1 for the correct class c and 0 otherwise, \end{matrix}

the cross-entropy loss for a single example is defined as

\begin{matrix} (66) & ℓ (\hat{y}, y) ≜ - \sum_{i = 1}^{m} y_{i} \log {\hat{y}}_{i} . \end{matrix}

For one-hot targets this simplifies to

\begin{matrix} (67) & ℓ (\hat{y}, y) = - \log {\hat{y}}_{c}, \end{matrix}

$c$ $n$ -way classification problems, cross-entropy leads to faster convergence than L2 loss because it optimizes the log-probability of the true class directly and provides stronger corrective gradients when the model is wrong or uncertain.

The third concept is drop-out. Dropout sets a random fraction of the activations to zero during a training epoch. Dropout is a kind of regularization because it reduces overfitting: it forces neurons to learn more robust, distributed representations by randomly removing them during training, effectively ensembling many models into one.

As an example of these concepts in practice, here is a complete Python program for recognizing handwritten digits. The dataset, MNIST, was used in one of the first successful neural network programs, LeNet-5, created in 1998 by Yann LeCun. MNIST is still a starter baseline for work in machine learning and is included in PyTorch.


xxxxxxxxxx
# mnist_cnn_with_dropout.py
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ------------------------
# Data: MNIST
# ------------------------
# Transform: convert to tensor + normalize with dataset mean/std
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Train and test datasets
train_set = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_set  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# Dataloaders
train_loader = DataLoader(train_set, batch_size=128, shuffle=True)
test_loader  = DataLoader(test_set,  batch_size=256, shuffle=False)

# ------------------------
# Model: small CNN
# ------------------------
class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.classifier = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),  # Conv layer (28x28 -> 28x28)
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                             # Pool (28x28 -> 14x14)

            nn.Conv2d(32, 64, kernel_size=3, padding=1), # Conv (14x14 -> 14x14)
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                             # Pool (14x14 -> 7x7)

            nn.Flatten(),                                # Flatten to vector
            nn.Linear(64 * 7 * 7, 128),                  # Fully connected layer
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),                             # Dropout regularization
            nn.Linear(128, num_classes)                  # Output logits (10 classes)
        )

    def forward(self, x):
        y = self.classifier(x)
        return y

# Instantiate model
model = SmallCNN().to(device)

# Loss function: CrossEntropyLoss combines softmax + negative log-likelihood
criterion = nn.CrossEntropyLoss()

# Optimizer: Adam with learning rate 1e-3.  
# L2 regularization is set by the weight_decay parameter.
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

# ------------------------
# Training loop
# ------------------------
for epoch in range(5):  # Train for 5 epochs
    # Put Pytorch into training mode
    model.train()
    
    running_loss, running_correct, running_total = 0.0, 0, 0
    # Loop over the training instances
    for images, targets in train_loader:
        images, targets = images.to(device), targets.to(device)

        optimizer.zero_grad(set_to_none=True)          # Reset gradients
        logits = model(images)                         # Forward pass
        loss = criterion(logits, targets)              # Compute loss
        loss.backward()                                # Backprop
        optimizer.step()                               # Update weights

        # Accumulate stats over this epoch
        running_loss += loss.item() * targets.size(0)
        running_correct += (logits.argmax(1) == targets).sum().item()
        running_total += targets.size(0)

    # final stats for this epoch
    train_loss = running_loss / running_total
    train_acc = running_correct / running_total

    # ------------------------
    # Evaluation
    # ------------------------
    # Put PyTorch in evaluation mode
    model.eval()
    test_loss, test_correct, test_total = 0.0, 0, 0
    with torch.no_grad():
        for images, targets in test_loader:  # compute loss over the test data
            images, targets = images.to(device), targets.to(device)
            logits = model(images)
            loss = criterion(logits, targets)

            test_loss += loss.item() * targets.size(0)
            test_correct += (logits.argmax(1) == targets).sum().item()
            test_total += targets.size(0)

    print(f"Epoch {epoch+1}/5 | "
          f"train loss {train_loss:.4f} acc {train_acc:.4f} | "
          f"test loss {test_loss/test_total:.4f} acc {test_correct/test_total:.4f}")

What Else?

Understanding the concepts presented in this tutorial should give you a good foundation for reading papers and building applications with neural networks. The most important concepts not covered here that you should learn about are:

Neural methods for natural language processing: how to convert words to real-number vectors in a manner that captures word meaning. Once so converted, a neural network architecture called the transformer can handle a wide range of language tasks - and might even be the basis for artificial general intelligence.
Unsupervised learning: how to learn when there are no labels by discovering regularities in raw data.
Reinforcement learning: learning how to act in a world on the basis of receiving rewards that are delayed in time from when the system must begin to act.
Non-neural net methods of machine learning. For many practical problems, neural networks are overkill and require too much data. Important other methods are regression and its variants, decision trees, nearest-neighbor clustering, and what are called support vector machines (SVMs). Until deep neural networks became practical by using GPUs, SVMs were the most powerful form of machine learning - and their theory also involves much calculus and linear algebra.