Henry Kautz henry.kautz@gmail.com
The goal of machine learning is to learn a function
Let
where
Now let us assume that
Our goal, then, is to find
Now let us take the output space to be the reals or a vector of reals; the distance measure to be the squared Euclidean norm; and
In the case that the output is a single real number, the squared Euclidean norm is simply
Now consider one element of the vector inside the summation above. We can use the chain rule to calculate that
This expression can be interpreted as twice the influence of the parameter on the results times the difference between the current candidate and the target. (The difference is also called the error or the residual.) Gradient descent minimizes the loss by repeatedly updating the parameters by a small constant
We see now that we can compute the gradient of the loss function as long as the family of parameterized functions
In gradient descent, the parameters are adjusted many time. Each pass over the training data is called an epoch. A stopping rule specifies when training stop. A simple rule is to stop when all the elements of the gradient are smaller in absolute value than some given small constant. Another stopping rule is hold out some of the data in what is called a validation set. The validation data is not used in computing the gradient. Instead, after some number of descent steps, the network is tested on the validation data. If the loss on the validation set stops decreasing, then training stops, because the network has stopped generalizing to data not used in training. This stopping rule prevents the network from overfitting to the training data and failing to generalize to new data.
The general purpose of regularization is prefer simplier functions under the hope that they generalize better to unseen data. This is a way to implement Ockham’s Razor, the principle that the simpliest explanation is usually the best.
A commonly used regularization function is the sum of squares of the parameters, which is called the L2 regularizer.
This will prefer functions where many of the parameters are zero or near zero. (It is often combined with a heuristic step where near-zero parameters are rounded to zero.) It is easy to compute the gradient of this function:
A standard artificial neuron takes the weighted sum of its inputs plus a constant and applies a non-linear function to produce the output. The constant can be handled by considering it to be a weight applied to special input that is always 1. Many non-lnear output functions have been studied, but the most useful turn out to be a simple threshold function that outputs zero if the sum is negative and outputs the sum otherwise. This is called a "rectified linear unit" (ReLU). We assume that input
The fact that this function is not differential at 0 can be ignored by simply defining its deriviative to be 0 at 0. (Although the standard definition of derivatives do not allow this trick, more generalized notions of derivative from convex analysis allows this to be done as a "choice of subgradient".)
When taking the partial derivative of
Note that because our neuron has a single output, it's partial derivative is a single real number rather than a vector of real numbers. We can write down and simplify the gradient as follows.
The reader will note that the when
Single neurons or even a single layer of neurons can only learn linear functions. Once multiple layers are introduced, with enough neurons any function can be learned. There theoretical results on the width and depth necessary to learn various classes of functions. For example, any continuous function can be learned by a network of depth
Consider a layered neural network where
Following common practice, we will henceforth use
The derivative uses indicator function notation, where
We begin by considering a three unit network where there are there are two inputs in layer 0, two ReLU units in layer 1, one ReLU unit in the output layer 2, and no regularization. There are a total of 9 weights in the network. (Recall that three of the weights are for the special
We will see that the calculation of the gradient uses the error signal of the entire network in computing the derivatives for the weights in the output layer, and then propagates the error signal back to the previous layer in proportion to the weights on the outputs of that layer, and so on. This intuitive description of the calculation is why the method is named "back propagation". It is important to understand, however, that it is simply an application of the chain rule.
For a given training instance
Layer 1 (two units):
Layer 2 (one unit / output):
Loss:
We define the following shorthand for the partial derivative of the loss with respect to the pre-activation of unit
The
Step 1: Top derivative
Step 2: Through the ReLU
Step 3: Gradients for layer-2 weights
Step 4: Send error to hidden activations
Step 5: Through hidden ReLUs to pre-activations
Step 6: Gradients for layer-1 weights
Many applications today use neural networks that are thousands of units wide and hundreds of layers deep. We now show the general form of the feed-forward and back-propagation calculations for a depth-
Forward pass (any
For each layer
Output and loss:
Backward pass (any
Top layer errors:
Hidden layers
Gradients with respect to weights (including the bias weights
We have been using
We make one small change in the regularization function by not including the weights on the constant 1 dummy inputs - that is, the bias weights. This is because the purpose of regularization is to push the network toward sparsity by having many zero weights on connections between neurons. Thus, we write:
The order of the partial derivatives in the gradient follows the order of weights in the vector
the gradient is
Although we represented the gradient as a single long vector, in machine learning it is common to organize it as a set of 2-dimensional matrices, one per layer, because the width of network can vary by layer. Thus you will see expressions like the following, where each
In the previous section, we used only basic calculus and linear algebra notation. We can more compactly represent the gradient by using an operator named the Jacobian that combines notions from calculus and linear algebra. For a vector-valued function
In a neural network, the weights at layer
where
Let the dataset loss be
The gradient vector is then the transpose of that row:
We next want to calculte the mean loss over the dataset. Since
Now we come to the chain rule in Jacobian form. Let
Several other kinds of units and layers are used in deep neural networks.
Softmax takes a vector of pre-activations as input, computes their exponentials, and then normalizes them so they sum to one. The result has several interesting properties. First, if one of the inputs is much larger than the others, the corresponding output is near 1 and the others are near 0. Softmax is thus a relation of the argmax operator. Second, the output can be viewed as probability distribution. For example, if the network is being used to classify inputs and each of the inputs to the softmax is a score computed for a particular label, then the output can be interpreted as the probability that the corresponding label is the correct one.
Suppose the input to a softmax layer is
The pre-activation vector is
The softmax output is the vector
This ensures
Now we consider the derivative of softmax. If
If
A key operation in applications of neural networks to computer vision is convolution. A 2-D convolution takes a small filter (also called a kernel) and slides it across an input grid, such as an image. At each position, the filter values are multiplied with the overlapping input values and summed together, producing a single number in the output grid. Repeating this process across all positions produces a new 2-D array where each entry reflects how strongly that local region of the input “matches” the filter. In short: A 2-D convolution is a way of scanning a small pattern across an image to detect features like edges, corners, or textures, turning local structure in the input into structured signals in the output.
As you read the previous paragraph, you will likely be puzzled by the idea of a filter "sliding" across an image. Neurons are simply wired from one to another in a fixed pattern (even if the weights change) and cannot slide connections from one neuron to another. Sliding is a metaphor for weight sharing, where we deliberately reuse the same parameter values across multiple connections instead of learning a separate weight for each one. In other words, different parts of the network are “tied together” so they use the same weights. Thus there are many instances of the filter implemented by different neurons across the 2-D image array, each separated by a few pixels (a quantiity called the stride). All these instances, however, use the same values for their weights.
Let the input be an image
and let the convolution kernel (shared weights) be
The convolution output at spatial position
Here, the same kernel weights
Convolution layers are often used in conjunction with pooling layers, also called downsampling layers. While a filter activates at the position in the image where it matches, a pooling layer determines if a filter activates anywhere is a rectangular region of the image of a specified size. For example, you might have a high-level filter that detects cats, and you want to know if a cat appears anywhere in the image.
Let the input feature map be
Choose a pooling window of size
Each entry is defined as
The three concepts in this section title are crucial for making it practical to train neural networks. The formalization of gradient descent presented above computed the gradient across the entire training set for each weight update. This would be too slow if there were thousands of training instances. The idea of mini-batches (or just batches) is to update the weights after a given number of instances are processed during an epoch. The batch size is often chosen to be 128, so small enough for efficient processing, but large enough to prevent the update from being swayed by a few anomolous data points.
The second crucial concept is using cross-entropy rather than the L2 norm to measure the loss. The network is trying to classify the input, that is, assign one of 10 labels to it. Given a predicted probability distribution over the outputs (e.g. from a softmax layer)
and a true one-hot encoded label vector
the cross-entropy loss for a single example is defined as
For one-hot targets this simplifies to
where
The third concept is drop-out. Dropout sets a random fraction of the activations to zero during a training epoch. Dropout is a kind of regularization because it reduces overfitting: it forces neurons to learn more robust, distributed representations by randomly removing them during training, effectively ensembling many models into one.
As an example of these concepts in practice, here is a complete Python program for recognizing handwritten digits. The dataset, MNIST, was used in one of the first successful neural network programs, LeNet-5, created in 1998 by Yann LeCun. MNIST is still a starter baseline for work in machine learning and is included in PyTorch.
xxxxxxxxxx# mnist_cnn_with_dropout.pyimport torchimport torch.nn as nnimport torch.optim as optimfrom torchvision import datasets, transformsfrom torch.utils.data import DataLoader# Use GPU if availabledevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")# ------------------------# Data: MNIST# ------------------------# Transform: convert to tensor + normalize with dataset mean/stdtransform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.1307,), (0.3081,))])# Train and test datasetstrain_set = datasets.MNIST(root="./data", train=True, download=True, transform=transform)test_set = datasets.MNIST(root="./data", train=False, download=True, transform=transform)# Dataloaderstrain_loader = DataLoader(train_set, batch_size=128, shuffle=True)test_loader = DataLoader(test_set, batch_size=256, shuffle=False)# ------------------------# Model: small CNN# ------------------------class SmallCNN(nn.Module):def __init__(self, num_classes=10):super().__init__()self.classifier = nn.Sequential(nn.Conv2d(1, 32, kernel_size=3, padding=1), # Conv layer (28x28 -> 28x28)nn.ReLU(inplace=True),nn.MaxPool2d(2), # Pool (28x28 -> 14x14)nn.Conv2d(32, 64, kernel_size=3, padding=1), # Conv (14x14 -> 14x14)nn.ReLU(inplace=True),nn.MaxPool2d(2), # Pool (14x14 -> 7x7)nn.Flatten(), # Flatten to vectornn.Linear(64 * 7 * 7, 128), # Fully connected layernn.ReLU(inplace=True),nn.Dropout(0.3), # Dropout regularizationnn.Linear(128, num_classes) # Output logits (10 classes))def forward(self, x):y = self.classifier(x)return y# Instantiate modelmodel = SmallCNN().to(device)# Loss function: CrossEntropyLoss combines softmax + negative log-likelihoodcriterion = nn.CrossEntropyLoss()# Optimizer: Adam with learning rate 1e-3.# L2 regularization is set by the weight_decay parameter.optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)# ------------------------# Training loop# ------------------------for epoch in range(5): # Train for 5 epochs# Put Pytorch into training modemodel.train()running_loss, running_correct, running_total = 0.0, 0, 0# Loop over the training instancesfor images, targets in train_loader:images, targets = images.to(device), targets.to(device)optimizer.zero_grad(set_to_none=True) # Reset gradientslogits = model(images) # Forward passloss = criterion(logits, targets) # Compute lossloss.backward() # Backpropoptimizer.step() # Update weights# Accumulate stats over this epochrunning_loss += loss.item() * targets.size(0)running_correct += (logits.argmax(1) == targets).sum().item()running_total += targets.size(0)# final stats for this epochtrain_loss = running_loss / running_totaltrain_acc = running_correct / running_total# ------------------------# Evaluation# ------------------------# Put PyTorch in evaluation modemodel.eval()test_loss, test_correct, test_total = 0.0, 0, 0with torch.no_grad():for images, targets in test_loader: # compute loss over the test dataimages, targets = images.to(device), targets.to(device)logits = model(images)loss = criterion(logits, targets)test_loss += loss.item() * targets.size(0)test_correct += (logits.argmax(1) == targets).sum().item()test_total += targets.size(0)print(f"Epoch {epoch+1}/5 | "f"train loss {train_loss:.4f} acc {train_acc:.4f} | "f"test loss {test_loss/test_total:.4f} acc {test_correct/test_total:.4f}")
Understanding the concepts presented in this tutorial should give you a good foundation for reading papers and building applications with neural networks. The most important concepts not covered here that you should learn about are:
Neural methods for natural language processing: how to convert words to real-number vectors in a manner that captures word meaning. Once so converted, a neural network architecture called the transformer can handle a wide range of language tasks - and might even be the basis for artificial general intelligence.
Unsupervised learning: how to learn when there are no labels by discovering regularities in raw data.
Reinforcement learning: learning how to act in a world on the basis of receiving rewards that are delayed in time from when the system must begin to act.
Non-neural net methods of machine learning. For many practical problems, neural networks are overkill and require too much data. Important other methods are regression and its variants, decision trees, nearest-neighbor clustering, and what are called support vector machines (SVMs). Until deep neural networks became practical by using GPUs, SVMs were the most powerful form of machine learning - and their theory also involves much calculus and linear algebra.