Machine Learning is Calculus Plus Linear Algebra

Henry Kautz henry.kautz@gmail.com

Loss Functions

The goal of machine learning is to learn a function f from some input space X to some output space Y from training data D. This may sound abstract, but is general enough to cover anything we might call learning. The algorithm is given the form of the input and output spaces and the space of functions F from which f is drawn. In supervised learning, the learning algorithm is give a set of pairs [x,y]. In unsupervised learning, the algorithm is given only a set of instances x. We consider only supervised learning here.

Let f be a candidate function. The loss L measures the errors f makes on the training data.

(1)Lf=[x,y]D|f(x),y|+λr(f)

where |f(x),y| is a measure of distance in the output space, λ is a small real number, and r(f) is a "regularizer", that is, some a priori measure of preference over learned functions. The goal of machine learning is to find an f that minimizes this loss function.

Parameterized Families of Functions

Now let us assume that f comes from a class of functions parameterized by a vector of real numbers Θ. For example, the class could be neural networks of a certain shape, and Θ could be a vector of the weights of the network. To simplify notation, we will write LfΘ as simply LΘ, and r(fΘ) as r(Θ).

Our goal, then, is to find θ that minimizes Lθ. A convex function over the reals can be minimized by starting at a random point, computing the derivative of the function at that point, taking a small step in the opposite of the direction of the derviative, and repeating until the derivative is zero. A convex function over a sequence of reals can be minimized by following the negative of its "gradient" downhill, where the gradient is the sequence of partial derivatives of the function over each of its parameters. Although in general F is not provabily convex, in practice for many useful classes of functions this approach finds a good approximation to the minimum. The gradient is thus defined as follows where n is the number of parameters:

(2)ΘLΘ=[LΘθ1LΘθ2LΘθn]

Now let us take the output space to be the reals or a vector of reals; the distance measure to be the squared Euclidean norm; and r to be a real-value function:

(3)LΘ=[x,y]D||fΘ(x),y||22+λr(Θ)

In the case that the output is a single real number, the squared Euclidean norm is simply (fΘ(x)y)2. If the output is a vector, it is the dot product (fΘ(x)y)T(fΘ(x)y), and thus is a single real number. We can simplify the gradient using the fact that the derivative of a sum is a sum of derivatives as follows:

(4)ΘLΘ=[x,y]D[(fΘ(x)y)2θ1(fΘ(x)y)2θ2(fΘ(x)y)2θn]+λΘr(Θ)

Now consider one element of the vector inside the summation above. We can use the chain rule to calculate that

(5)(fΘ(x)y)2θi=2(fΘ(x)θi)(fΘ(x)y)

This expression can be interpreted as twice the influence of the parameter on the results times the difference between the current candidate and the target. (The difference is also called the error or the residual.) Gradient descent minimizes the loss by repeatedly updating the parameters by a small constant η times the negative of the gradient.

(6)ΘΘηΘLΘ

We see now that we can compute the gradient of the loss function as long as the family of parameterized functions fΘ is differentiable with respect to Θ and the the regularization function is likewise differentiable.

In gradient descent, the parameters are adjusted many time. Each pass over the training data is called an epoch. A stopping rule specifies when training stop. A simple rule is to stop when all the elements of the gradient are smaller in absolute value than some given small constant. Another stopping rule is hold out some of the data in what is called a validation set. The validation data is not used in computing the gradient. Instead, after some number of descent steps, the network is tested on the validation data. If the loss on the validation set stops decreasing, then training stops, because the network has stopped generalizing to data not used in training. This stopping rule prevents the network from overfitting to the training data and failing to generalize to new data.

Regularization Functions

The general purpose of regularization is prefer simplier functions under the hope that they generalize better to unseen data. This is a way to implement Ockham’s Razor, the principle that the simpliest explanation is usually the best.

A commonly used regularization function is the sum of squares of the parameters, which is called the L2 regularizer.

(7)r(Θ)=iθi2

This will prefer functions where many of the parameters are zero or near zero. (It is often combined with a heuristic step where near-zero parameters are rounded to zero.) It is easy to compute the gradient of this function:

(8)Θ(λjθj2)=2λΘ

Artificial Neurons

A standard artificial neuron takes the weighted sum of its inputs plus a constant and applies a non-linear function to produce the output. The constant can be handled by considering it to be a weight applied to special input that is always 1. Many non-lnear output functions have been studied, but the most useful turn out to be a simple threshold function that outputs zero if the sum is negative and outputs the sum otherwise. This is called a "rectified linear unit" (ReLU). We assume that input x0 is fixed to 1. The parameterized function fθ is thus

(9)fΘ(x)={0if ΘTx0ΘTxotherwise

The fact that this function is not differential at 0 can be ignored by simply defining its deriviative to be 0 at 0. (Although the standard definition of derivatives do not allow this trick, more generalized notions of derivative from convex analysis allows this to be done as a "choice of subgradient".)

When taking the partial derivative of fΘ(x) with respect to a weight θi all of the terms in the sum ΘTx not involving θi become zero and the one that does involve θi becomes its coefficient xi. Therefore

(10)fΘ(x)θi={0if Θx0xiotherwise

Note that because our neuron has a single output, it's partial derivative is a single real number rather than a vector of real numbers. We can write down and simplify the gradient as follows.

(11)ΘLΘ=[x,y]D[2(fΘ(x)θ0)(fΘ(x)y)2(fΘ(x)θ1)(fΘ(x)y)2(fΘ(x)θn)(fΘ(x)y)]+2λΘ(12)=2[x,y]D(fΘ(x)y)[fΘ(x)θ0fΘ(x)θ1fΘ(x)θn]+2λΘ(13)=2([x,y]D(fΘ(x)y){0if Θx0xTotherwise)+2λΘ

The reader will note that the when x is such that the output of the unit is zero, the first term of the gradient becomes zero and only the regularization function can move the gradient away from zero. This fact can lead linear rectifier units to become stuck at zero during optimization. There are variations of linear rectifier units that include a slight slope in the negative sum region to prevent this from happening. The issue of dead units is less of problem once we move to large networks with thousands or millions of units because having some fraction of the units "die" during training can make the network smaller and faster.

Layered Neural Networks

Single neurons or even a single layer of neurons can only learn linear functions. Once multiple layers are introduced, with enough neurons any function can be learned. There theoretical results on the width and depth necessary to learn various classes of functions. For example, any continuous function can be learned by a network of depth 2 and arbitrarily high width. Likewise, any function can be learned by a network of width n+1 where n is the number of inputs and arbitrarily high depth. In practice, of course, both the width and depth must be bounded, and finding a sufficient width and depth for a particular task is a matter of trial and error.

Consider a layered neural network where L is the depth. Each unit in a layer takes inputs from all of the neurons in the previous layer. By convention the inputs are layer 0, and the input in position 0 in each layer is fixed to 1. The following notation is commonly used where l is the layer and i is the postion in the layer of the neuron.

(14)ai(l)output of neuron i in layer ly^output (vector) of the neuron(s) in the highest layerzi(l)input sum (pre-activation) of neuron i in layer lwij(l)weight from neuron j in layer l1 to neuron i in layer l

Following common practice, we will henceforth use W for the parameters instead of Θ. We will also use an improved notation for the nonlinear activation function σ in an ReLU and its derivative:

(15)σ(z)=max(0,z)(16)σ(z)=1{z>0}

The derivative uses indicator function notation, where 1{P} means "1 if the statement P is true, 0 otherwise".

Example: a 3-Unit Network

We begin by considering a three unit network where there are there are two inputs in layer 0, two ReLU units in layer 1, one ReLU unit in the output layer 2, and no regularization. There are a total of 9 weights in the network. (Recall that three of the weights are for the special x0(l) constant 1 inputs. Some formalizations of neural networks use a separate vector of so-called bias weights instead of using special fixed inputs.)

We will see that the calculation of the gradient uses the error signal of the entire network in computing the derivatives for the weights in the output layer, and then propagates the error signal back to the previous layer in proportion to the weights on the outputs of that layer, and so on. This intuitive description of the calculation is why the method is named "back propagation". It is important to understand, however, that it is simply an application of the chain rule.

For a given training instance ((x1,x2),y), we begin by running the neural network from x to y^ and storing the intermediate values. This is called the "forward pass".

Layer 1 (two units):

(17)z1(1)=w10(1)+w11(1)x1+w12(1)x2a1(1)=σ(z1(1))=max(0,z1(1))z2(1)=w20(1)+w21(1)x1+w22(1)x2a2(1)=σ(z2(1))=max(0,z2(1))

Layer 2 (one unit / output):

(18)z1(2)=w10(2)+w11(2)a1(1)+w12a2(1)(19)y^=a1(2)=σ(z1(2))=max(0,z1(2))

Loss:

(20)=(y^y)2

We define the following shorthand for the partial derivative of the loss with respect to the pre-activation of unit i in layer l:

(21)δi(l)zi(l)

The δ are called "error terms" and measure how much the loss would change if you nudged the corresponding pre-activations a bit. We can now calculate the gradient of the loss function by backpropagation via the chain rule.

Step 1: Top derivative

(22)y^=2(y^y)

Step 2: Through the ReLU

(23)δ1(2)z1(2)=y^y^z1(2)=2(y^y)σ(z1(2))=2(y^y)1{z1(2)>0}

Step 3: Gradients for layer-2 weights w(2)

(24)w10(2)=δ1(2),w11(2)=δ1(2)a1(1),w12(2)=δ1(2)a2(1)

Step 4: Send error to hidden activations

(25)a1(1)=δ1(2)w11(2),a2(1)=δ1(2)w12(2)

Step 5: Through hidden ReLUs to pre-activations

(26)δ1(1)z1(1)=(δ1(2)w11(2))σ(z1(1))=δ1(2)w11(2)1{z1(1)>0}δ2(1)z2(1)=(δ1(2)w12(2))σ(z2(1))=δ1(2)w12(2)1{z2(1)>0}

Step 6: Gradients for layer-1 weights w(1)

(27)w10(1)=δ1(1),w11(1)=δ1(1)x1,w12(1)=δ1(1)x2w20(1)=δ2(1),w21(1)=δ2(1)x1,w22(1)=δ2(1)x2

Deep Networks

Many applications today use neural networks that are thousands of units wide and hundreds of layers deep. We now show the general form of the feed-forward and back-propagation calculations for a depth-L fully connected ReLU network of width m at every layer.

Forward pass (any L, width m):

(28)a0(0)1(29)aj(0)=xj  (j=1,,m)(30)σ(u)=max(0,u)(31)σ(u)=1{u>0}

For each layer l=1,,L and unit i=1,,m:

(32)zi(l)=j=0mwij(l)aj(l1)(33)ai(l)=σ(zi(l))

Output and loss:

(34)y^iai(L)(35)=i=1m(ai(L)yi)2=a(L)y22

Backward pass (any L, width m):

Top layer errors:

(36)δi(L)=zi(L)=2(ai(L)yi)σ(zi(L)),i=1,,m.

Hidden layers l=L1,,1:

(37)δj(l)=zj(l)=(i=1mwij(l+1)δi(l+1))σ(zj(l)),j=1,,m.

Gradients with respect to weights (including the bias weights j=0):

(38)wi0(l)=δi(l)(39)wij(l)=δi(l)aj(l1),j=1,,m,i=1,,m,l=1,,L

We have been using to represent the loss on a particular training instance. We use Lmean to represent the mean average loss over the entire training set. The mean loss is more useful than the total loss because it does not depend upon the size of the data set and its gradient is in the same direction as that of the total loss.

We make one small change in the regularization function by not including the weights on the constant 1 dummy inputs - that is, the bias weights. This is because the purpose of regularization is to push the network toward sparsity by having many zero weights on connections between neurons. Thus, we write:

(40)Lmean(Θ)=1|D|(x,y)Da(L)(x)y22+λl=1Li=1mj=1m(wij(l))2(41)Lmeanwi0(l)=1|D|(x,y)Dδi(l)(x)(42)Lmeanwij(l)=1|D|(x,y)Dδi(l)(x)aj(l1)(x)+2λwij(l)

The order of the partial derivatives in the gradient follows the order of weights in the vector W. So, where

(43)W=(w10(1),w11(1),,w1m(1),w20(1),w21(1),,w2m(1),,wm0(1),,wmm(1),w10(2),,wmm(L))

the gradient is

(44)L(W)=(Lw10(1),Lw11(1),,Lw1m(1),Lw20(1),,Lwmm(L))

Although we represented the gradient as a single long vector, in machine learning it is common to organize it as a set of 2-dimensional matrices, one per layer, because the width of network can vary by layer. Thus you will see expressions like the following, where each G(l) is gradient of one layer of weights.

(45)G(l)LmeanW(l)Rm×(m+1),[G(l)]ij=Lmeanwij(l)

Jacobian Notation

In the previous section, we used only basic calculus and linear algebra notation. We can more compactly represent the gradient by using an operator named the Jacobian that combines notions from calculus and linear algebra. For a vector-valued function f:RnRm with components f(x)=(f1(x),,fm(x)), the Jacobian at x is the m×n matrix:

(46)Jf(x)f(x)x=[f1x1f1xnfmx1fmxn].

In a neural network, the weights at layer l are naturally organized as a matrix W(l). If we collect all entries of all layer weight matrices into a single long vector, we get the stacked parameter vector

(47)W=vec(W(1),W(2),,W(L))RP

where vec() stacks all entries, and P is the total number of scalar weights in the network.

Let the dataset loss be L:RPR. Its Jacobian with respect to the stacked weights W is a 1×P row matrix:

(48)JL(W)=L(W)W  R1×P

The gradient vector is then the transpose of that row:

(49)WL(W)=JL(W)  RP

We next want to calculte the mean loss over the dataset. Since L(W)=1|D|t=1|D|t(W), then Jacobians add linearly:

(50)WL(W)=1|D|t=1|D|Wt(W)

Now we come to the chain rule in Jacobian form. Let ϕ be the final scalar loss function that takes the network's output and compares it to the target label; for L2 loss, ϕ(y^,y)=||y^y||22. For a composition L=ϕa(L)a(1) depending on W,

(51)JL(W)=Jϕ(a(L))Ja(L)(z(L))Ja(1)(z(1))Jz(1)(W)(52)WL(W)=JL(W)

Other Kinds of Units and Layers

Several other kinds of units and layers are used in deep neural networks.

Softmax

Softmax takes a vector of pre-activations as input, computes their exponentials, and then normalizes them so they sum to one. The result has several interesting properties. First, if one of the inputs is much larger than the others, the corresponding output is near 1 and the others are near 0. Softmax is thus a relation of the argmax operator. Second, the output can be viewed as probability distribution. For example, if the network is being used to classify inputs and each of the inputs to the softmax is a score computed for a particular label, then the output can be interpreted as the probability that the corresponding label is the correct one.

Suppose the input to a softmax layer is xRd, and the weight matrix is

(53)WRm×d

The pre-activation vector is

(54)z=WxRm

The softmax output is the vector a=σ(z)Rm, with components

(55)ai=ezik=1mezk,i=1,,m.

This ensures ai>0 for all i and i=1mai=1.

Now we consider the derivative of softmax. If i=j:

(56)aizi=ai(1ai)

If ij:

(57)aizj=aiaj.

Convolution and Weight Sharing

A key operation in applications of neural networks to computer vision is convolution. A 2-D convolution takes a small filter (also called a kernel) and slides it across an input grid, such as an image. At each position, the filter values are multiplied with the overlapping input values and summed together, producing a single number in the output grid. Repeating this process across all positions produces a new 2-D array where each entry reflects how strongly that local region of the input “matches” the filter. In short: A 2-D convolution is a way of scanning a small pattern across an image to detect features like edges, corners, or textures, turning local structure in the input into structured signals in the output.

As you read the previous paragraph, you will likely be puzzled by the idea of a filter "sliding" across an image. Neurons are simply wired from one to another in a fixed pattern (even if the weights change) and cannot slide connections from one neuron to another. Sliding is a metaphor for weight sharing, where we deliberately reuse the same parameter values across multiple connections instead of learning a separate weight for each one. In other words, different parts of the network are “tied together” so they use the same weights. Thus there are many instances of the filter implemented by different neurons across the 2-D image array, each separated by a few pixels (a quantiity called the stride). All these instances, however, use the same values for their weights.

Let the input be an image

(58)xRH×W

and let the convolution kernel (shared weights) be

(59)F=(wi,j)Rr×s

The convolution output at spatial position (u,v) is

(60)zu,v=i=1rj=1swi,jxu+i1,v+j1.

Here, the same kernel weights wi,j are reused at every location (u,v). This is weight sharing: the filter F detects the same pattern (for example, an edge) regardless of where it appears in the image.

Pooling Layers

Convolution layers are often used in conjunction with pooling layers, also called downsampling layers. While a filter activates at the position in the image where it matches, a pooling layer determines if a filter activates anywhere is a rectangular region of the image of a specified size. For example, you might have a high-level filter that detects cats, and you want to know if a cat appears anywhere in the image.

Let the input feature map be

(61)zRH×W

Choose a pooling window of size r×s and a stride t. The max pooling output is

(62)pRHrt+1×Wst+1

Each entry is defined as

(63)pu,v=max1ir1jszt(u1)+i,t(v1)+j

Mini-Batches, Cross-Entropy, and Dropout

The three concepts in this section title are crucial for making it practical to train neural networks. The formalization of gradient descent presented above computed the gradient across the entire training set for each weight update. This would be too slow if there were thousands of training instances. The idea of mini-batches (or just batches) is to update the weights after a given number of instances are processed during an epoch. The batch size is often chosen to be 128, so small enough for efficient processing, but large enough to prevent the update from being swayed by a few anomolous data points.

The second crucial concept is using cross-entropy rather than the L2 norm to measure the loss. The network is trying to classify the input, that is, assign one of 10 labels to it. Given a predicted probability distribution over the outputs (e.g. from a softmax layer)

(64)y^=(y^1,y^2,,y^m),i=1my^i=1, y^i0,

and a true one-hot encoded label vector

(65)y=(y1,y2,,ym),yc=1 for the correct class c and 0 otherwise,

the cross-entropy loss for a single example is defined as

(66)(y^,y)i=1myilogy^i.

For one-hot targets this simplifies to

(67)(y^,y)=logy^c,

where c is the index of the correct class. In n-way classification problems, cross-entropy leads to faster convergence than L2 loss because it optimizes the log-probability of the true class directly and provides stronger corrective gradients when the model is wrong or uncertain.

The third concept is drop-out. Dropout sets a random fraction of the activations to zero during a training epoch. Dropout is a kind of regularization because it reduces overfitting: it forces neurons to learn more robust, distributed representations by randomly removing them during training, effectively ensembling many models into one.

As an example of these concepts in practice, here is a complete Python program for recognizing handwritten digits. The dataset, MNIST, was used in one of the first successful neural network programs, LeNet-5, created in 1998 by Yann LeCun. MNIST is still a starter baseline for work in machine learning and is included in PyTorch.

What Else?

Understanding the concepts presented in this tutorial should give you a good foundation for reading papers and building applications with neural networks. The most important concepts not covered here that you should learn about are:

  1. Neural methods for natural language processing: how to convert words to real-number vectors in a manner that captures word meaning. Once so converted, a neural network architecture called the transformer can handle a wide range of language tasks - and might even be the basis for artificial general intelligence.

  2. Unsupervised learning: how to learn when there are no labels by discovering regularities in raw data.

  3. Reinforcement learning: learning how to act in a world on the basis of receiving rewards that are delayed in time from when the system must begin to act.

  4. Non-neural net methods of machine learning. For many practical problems, neural networks are overkill and require too much data. Important other methods are regression and its variants, decision trees, nearest-neighbor clustering, and what are called support vector machines (SVMs). Until deep neural networks became practical by using GPUs, SVMs were the most powerful form of machine learning - and their theory also involves much calculus and linear algebra.