Run this notebook online: or Colab:

# 2.4. Calculus¶

Finding the area of a polygon had remained mysterious until at least
2,500 years ago, when ancient Greeks divided a polygon into triangles
and summed their areas. To find the area of curved shapes, such as a
circle, ancient Greeks inscribed polygons in such shapes. As shown in
Section 2.4, an inscribed polygon with more sides of
equal length better approximates the circle. This process is also known
as the *method of exhaustion*.

In fact, the method of exhaustion is where *integral calculus* (will be
described in `sec_integral_calculus`

) originates from. More
than 2,000 years later, the other branch of calculus, *differential
calculus*, was invented. Among the most critical applications of
differential calculus, optimization problems consider how to do
something *the best*. As discussed in
Section 2.3.10.1, such problems are ubiquitous in
deep learning.

In deep learning, we *train* models, updating them successively so that
they get better and better as they see more and more data. Usually,
getting better means minimizing a *loss function*, a score that answers
the question “how *bad* is our model?” This question is more subtle than
it appears. Ultimately, what we really care about is producing a model
that performs well on data that we have never seen before. But we can
only fit the model to data that we can actually see. Thus we can
decompose the task of fitting models into two key concerns: i)
*optimization*: the process of fitting our models to observed data; ii)
*generalization*: the mathematical principles and practitioners’ wisdom
that guide as to how to produce models whose validity extends beyond the
exact set of data examples used to train them.

To help you understand optimization problems and methods in later chapters, here we give a very brief primer on differential calculus that is commonly used in deep learning.

## 2.4.1. Derivatives and Differentiation¶

We begin by addressing the calculation of derivatives, a crucial step in
nearly all deep learning optimization algorithms. In deep learning, we
typically choose loss functions that are differentiable with respect to
our model’s parameters. Put simply, this means that for each parameter,
we can determine how rapidly the loss would increase or decrease, were
we to *increase* or *decrease* that parameter by an infinitesimally
small amount.

Suppose that we have a function
\(f: \mathbb{R} \rightarrow \mathbb{R}\), whose input and output are
both scalars. The *derivative* of \(f\) is defined as

if this limit exists. If \(f'(a)\) exists, \(f\) is said to be
*differentiable* at \(a\). If \(f\) is differentiable at every
number of an interval, then this function is differentiable on this
interval. We can interpret the derivative \(f'(x)\) in
(2.4.1) as the *instantaneous* rate of change of
\(f(x)\) with respect to \(x\). The so-called instantaneous rate
of change is based on the variation \(h\) in \(x\), which
approaches \(0\).

To illustrate derivatives, let us experiment with an example. Define \(u = f(x) = 3x^2-4x\).

*Note: We will be using Double in this section to avoid incorrect
results since Double provides more decimal precision. Generally though,
we would use Float as deep learning frameworks by default use Fault.*

```
// %mavenRepo snapshots https://oss.sonatype.org/content/repositories/snapshots/
%maven ai.djl:api:0.9.0
%maven org.slf4j:slf4j-api:1.7.26
%maven org.slf4j:slf4j-simple:1.7.26
%maven ai.djl.mxnet:mxnet-engine:0.9.0
%maven ai.djl.mxnet:mxnet-native-auto:1.7.0-backport
```

```
%load ../utils/plot-utils
%load ../utils/Functions.java
```

```
import ai.djl.ndarray.*;
import tech.tablesaw.plotly.traces.*;
import tech.tablesaw.plotly.components.*;
import ai.djl.ndarray.types.DataType;
```

```
NDManager manager = NDManager.newBaseManager();
```

```
Function<Double, Double> f = x -> (3 * Math.pow(x, 2) -4 * x);
```

By setting \(x=1\) and letting \(h\) approach \(0\), the numerical result of \(\frac{f(x+h) - f(x)}{h}\) in (2.4.1) approaches \(2\). Though this experiment is not a mathematical proof, we will see later that the derivative \(u'\) is \(2\) when \(x=1\).

```
public Double numericalLim(Function<Double, Double> f, double x, double h) {
return (f.apply(x+h) - f.apply(x)) / h;
}
double h = 0.1;
for (int i=0; i < 5; i++) {
System.out.println("h=" + String.format("%.5f", h) + ", numerical limit="
+ String.format("%.5f", numericalLim(f, 1, h)));
h *= 0.1;
}
```

```
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003
```

Let us familiarize ourselves with a few equivalent notations for derivatives. Given \(y = f(x)\), where \(x\) and \(y\) are the independent variable and the dependent variable of the function \(f\), respectively. The following expressions are equivalent:

where symbols \(\frac{d}{dx}\) and \(D\) are *differentiation
operators* that indicate operation of *differentiation*. We can use the
following rules to differentiate common functions:

\(DC = 0\) (\(C\) is a constant),

\(Dx^n = nx^{n-1}\) (the

*power rule*, \(n\) is any real number),\(De^x = e^x\),

\(D\ln(x) = 1/x.\)

To differentiate a function that is formed from a few simpler functions
such as the above common functions, the following rules can be handy for
us. Suppose that functions \(f\) and \(g\) are both
differentiable and \(C\) is a constant, we have the *constant
multiple rule*

the *sum rule*

the *product rule*

and the *quotient rule*

Now we can apply a few of the above rules to find \(u' = f'(x) = 3 \frac{d}{dx} x^2-4\frac{d}{dx}x = 6x-4\). Thus, by setting \(x = 1\), we have \(u' = 2\): this is supported by our earlier experiment in this section where the numerical result approaches \(2\). This derivative is also the slope of the tangent line to the curve \(u = f(x)\) when \(x = 1\).

To visualize such an interpretation of derivatives, we will use
`plotly`

, a popular plotting library. `Tablesaw`

has implemented a
Java wrapper for `plotly`

so we will be using that framework. To
configure the properties and plot the figures produced by `plotly`

, we
will define one function.

We define `plotLineAndSegment`

which will take as input three arrays.
The first array will be the data in the x axis and the next two arrays
will contain the two functions that we want to plot in the y axis. In
addition to this data, the function requires us to specify the name of
the two lines we will be plotting, the label of both axes, and the width
and height of the figure. This function or a modified version of it,
will allow us to plot multiple curves succinctly since we will need to
visualize many curves throughout the book.

```
public Figure plotLineAndSegment(double[] x, double[] y, double[] segment,
String trace1Name, String trace2Name,
String xLabel, String yLabel,
int width, int height) {
ScatterTrace trace = ScatterTrace.builder(x, y)
.mode(ScatterTrace.Mode.LINE)
.name(trace1Name)
.build();
ScatterTrace trace2 = ScatterTrace.builder(x, segment)
.mode(ScatterTrace.Mode.LINE)
.name(trace2Name)
.build();
Layout layout = Layout.builder()
.height(height)
.width(width)
.showLegend(true)
.xAxis(Axis.builder().title(xLabel).build())
.yAxis(Axis.builder().title(yLabel).build())
.build();
return new Figure(layout, trace, trace2);
}
```

Now we can plot the function \(u = f(x)\) and its tangent line \(y = 2x - 3\) at \(x=1\), where the coefficient \(2\) is the slope of the tangent line.

```
NDArray X = manager.arange(0f, 3f, 0.1f, DataType.FLOAT64);
double[] x = X.toDoubleArray();
double[] fx = new double[x.length];
for (int i=0; i < x.length; i++) {
fx[i] = f.apply(x[i]);
}
double[] fg = new double[x.length];
for (int i=0; i < x.length; i++) {
fg[i] = 2*x[i]-3;
}
plotLineAndSegment(x, fx, fg, "f(x)", "Tangent line(x=1)", "x", "f(x)", 700, 500)
```

## 2.4.2. Partial Derivatives¶

So far we have dealt with the differentiation of functions of just one
variable. In deep learning, functions often depend on *many* variables.
Thus, we need to extend the ideas of differentiation to these
*multivariate* functions.

Let \(y = f(x_1, x_2, \ldots, x_n)\) be a function with \(n\)
variables. The *partial derivative* of \(y\) with respect to its
\(i^\mathrm{th}\) parameter \(x_i\) is

To calculate \(\frac{\partial y}{\partial x_i}\), we can simply treat \(x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n\) as constants and calculate the derivative of \(y\) with respect to \(x_i\). For notation of partial derivatives, the following are equivalent:

## 2.4.3. Gradients¶

We can concatenate partial derivatives of a multivariate function with
respect to all its variables to obtain the *gradient* vector of the
function. Suppose that the input of function
\(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is an
\(n\)-dimensional vector
\(\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top\) and the output is a
scalar. The gradient of the function \(f(\mathbf{x})\) with respect
to \(\mathbf{x}\) is a vector of \(n\) partial derivatives:

where \(\nabla_{\mathbf{x}} f(\mathbf{x})\) is often replaced by \(\nabla f(\mathbf{x})\) when there is no ambiguity.

Let \(\mathbf{x}\) be an \(n\)-dimensional vector, the following rules are often used when differentiating multivariate functions:

For all \(\mathbf{A} \in \mathbb{R}^{m \times n}\), \(\nabla_{\mathbf{x}} \mathbf{A} \mathbf{x} = \mathbf{A}^\top\),

For all \(\mathbf{A} \in \mathbb{R}^{n \times m}\), \(\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} = \mathbf{A}\),

For all \(\mathbf{A} \in \mathbb{R}^{n \times n}\), \(\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}\),

\(\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}\).

Similarly, for any matrix \(\mathbf{X}\), we have \(\nabla_{\mathbf{X}} \|\mathbf{X} \|_F^2 = 2\mathbf{X}\). As we will see later, gradients are useful for designing optimization algorithms in deep learning.

## 2.4.4. Chain Rule¶

However, such gradients can be hard to find. This is because
multivariate functions in deep learning are often *composite*, so we may
not apply any of the aforementioned rules to differentiate these
functions. Fortunately, the *chain rule* enables us to differentiate
composite functions.

Let us first consider functions of a single variable. Suppose that functions \(y=f(u)\) and \(u=g(x)\) are both differentiable, then the chain rule states that

Now let us turn our attention to a more general scenario where functions have an arbitrary number of variables. Suppose that the differentiable function \(y\) has variables \(u_1, u_2, \ldots, u_m\), where each differentiable function \(u_i\) has variables \(x_1, x_2, \ldots, x_n\). Note that \(y\) is a function of \(x_1, x_2, \ldots, x_n\). Then the chain rule gives

for any \(i = 1, 2, \ldots, n\).

## 2.4.5. Summary¶

Differential calculus and integral calculus are two branches of calculus, where the former can be applied to the ubiquitous optimization problems in deep learning.

A derivative can be interpreted as the instantaneous rate of change of a function with respect to its variable. It is also the slope of the tangent line to the curve of the function.

A gradient is a vector whose components are the partial derivatives of a multivariate function with respect to all its variables.

The chain rule enables us to differentiate composite functions.

## 2.4.6. Exercises¶

Plot the function \(y = f(x) = x^3 - \frac{1}{x}\) and its tangent line when \(x = 1\).

Find the gradient of the function \(f(\mathbf{x}) = 3x_1^2 + 5e^{x_2}\).

What is the gradient of the function \(f(\mathbf{x}) = \|\mathbf{x}\|_2\)?

Can you write out the chain rule for the case where \(u = f(x, y, z)\) and \(x = x(a, b)\), \(y = y(a, b)\), and \(z = z(a, b)\)?