Run this notebook online:Binder or Colab: Colab

6.2. Convolutions for Images

Now that we understand how convolutional layers work in theory, we are ready to see how they work in practice. Building on our motivation of convolutional neural networks as efficient architectures for exploring structure in image data, we stick with images as our running example.

6.2.1. The Cross-Correlation Operator

Recall that strictly speaking, convolutional layers are a (slight) misnomer, since the operations they express are more accurately described as cross correlations. In a convolutional layer, an input array and a correlation kernel array are combined to produce an output array through a cross-correlation operation. Let’s ignore channels for now and see how this works with two-dimensional data and hidden representations. In Fig. 6.2.1, the input is a two-dimensional array with a height of 3 and width of 3. We mark the shape of the array as \(3 \times 3\) or (\(3\), \(3\)). The height and width of the kernel are both \(2\). Note that in the deep learning research community, this object may be referred to as a convolutional kernel, a filter, or simply the layer’s weights. The shape of the kernel window is given by the height and width of the kernel (here it is \(2 \times 2\)).

Fig. 6.2.1 Two-dimensional cross-correlation operation. The shaded portions are the first output element and the input and kernel array elements used in its computation: \(0\times0+1\times1+3\times2+4\times3=19\).

In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the top-left corner of the input array and slide it across the input array, both from left to right and top to bottom. When the convolution window slides to a certain position, the input subarray contained in that window and the kernel array are multiplied (elementwise) and the resulting array is summed up yielding a single scalar value. This result gives the value of the output array at the corresponding location. Here, the output array has a height of 2 and width of 2 and the four elements are derived from the two-dimensional cross-correlation operation:

(6.2.1)\[\begin{split}0\times0+1\times1+3\times2+4\times3=19,\\ 1\times0+2\times1+4\times2+5\times3=25,\\ 3\times0+4\times1+6\times2+7\times3=37,\\ 4\times0+5\times1+7\times2+8\times3=43.\end{split}\]

Note that along each axis, the output is slightly smaller than the input. Because the kernel has width and height greater than one, we can only properly compute the cross-correlation for locations where the kernel fits wholly within the image, the output size is given by the input size \(H \times W\) minus the size of the convolutional kernel \(h \times w\) via \((H-h+1) \times (W-w+1)\). This is the case since we need enough space to ‘shift’ the convolutional kernel across the image (later we will see how to keep the size unchanged by padding the image with zeros around its boundary such that there is enough space to shift the kernel). Next, we implement this process in the corr2d function, which accepts the input array X and kernel array K and returns the output array Y.

But first we will import the relevant libraries.

%load ../utils/djl-imports
public NDArray corr2d(NDArray X, NDArray K){
    // Compute 2D cross-correlation.
    int h = (int) K.getShape().get(0);
    int w = (int) K.getShape().get(1);

    NDArray Y = manager.zeros(new Shape(X.getShape().get(0) - h + 1, X.getShape().get(1) - w + 1));

    for(int i=0; i < Y.getShape().get(0); i++){
        for(int j=0; j < Y.getShape().get(1); j++){
            Y.set(new NDIndex(i + "," + j), X.get(i + ":" + (i+h) + "," + j + ":" + (j+w)).mul(K).sum());

    return Y;

We can construct the input array X and the kernel array K from the figure above to validate the output of the above implementation of the two-dimensional cross-correlation operation.

NDManager manager = NDManager.newBaseManager();
NDArray X = manager.create(new float[]{0,1,2,3,4,5,6,7,8}, new Shape(3,3));
NDArray K = manager.create(new float[]{0,1,2,3}, new Shape(2,2));
System.out.println(corr2d(X, K));
ND: (2, 2) gpu(0) float32
[[19., 25.],
 [37., 43.],

6.2.2. Convolutional Layers

A convolutional layer cross-correlates the input and kernels and adds a scalar bias to produce an output. The two parameters of the convolutional layer are the kernel and the scalar bias. When training models based on convolutional layers, we typically initialize the kernels randomly, just as we would with a fully connected layer.

We are now ready to implement a two-dimensional convolutional layer based on the corr2d function defined above. In the ConvolutionalLayer constructor function, we declare weight and bias as the two class parameters. The forward computation function forward calls the corr2d function and adds the bias. As with \(h \times w\) cross-correlation we also refer to convolutional layers as \(h \times w\) convolutions.

public class ConvolutionalLayer{

    private NDArray w;
    private NDArray b;

    public NDArray getW(){
        return w;

    public NDArray getB(){
        return b;

    public ConvolutionalLayer(Shape shape){
        NDManager manager = NDManager.newBaseManager();
        w = manager.create(shape);
        b = manager.randomNormal(new Shape(1));

    public NDArray forward(NDArray X){
        return corr2d(X, w).add(b);


6.2.3. Object Edge Detection in Images

Let’s take a moment to parse a simple application of a convolutional layer: detecting the edge of an object in an image by finding the location of the pixel change. First, we construct an ‘image’ of \(6\times 8\) pixels. The middle four columns are black (0) and the rest are white (1).

X = manager.ones(new Shape(6,8));
X.set(new NDIndex(":" + "," + 2 + ":" + 6), 0f);
ND: (6, 8) gpu(0) float32
[[1., 1., 0., 0., 0., 0., 1., 1.],
 [1., 1., 0., 0., 0., 0., 1., 1.],
 [1., 1., 0., 0., 0., 0., 1., 1.],
 [1., 1., 0., 0., 0., 0., 1., 1.],
 [1., 1., 0., 0., 0., 0., 1., 1.],
 [1., 1., 0., 0., 0., 0., 1., 1.],

Next, we construct a kernel K with a height of \(1\) and width of \(2\). When we perform the cross-correlation operation with the input, if the horizontally adjacent elements are the same, the output is 0. Otherwise, the output is non-zero.

K = manager.create(new float[]{1, -1}, new Shape(1,2));

We are ready to perform the cross-correlation operation with arguments X (our input) and K (our kernel). As you can see, we detect 1 for the edge from white to black and -1 for the edge from black to white. All other outputs take value \(0\).

NDArray Y = corr2d(X, K);
ND: (6, 7) gpu(0) float32
[[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
 [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
 [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
 [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
 [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
 [ 0.,  1.,  0.,  0.,  0., -1.,  0.],

We can now apply the kernel to the transposed image. As expected, it vanishes. The kernel K only detects vertical edges.

corr2d(X.transpose(), K);
ND: (8, 5) gpu(0) float32
[[0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0.],

6.2.4. Learning a Kernel

Designing an edge detector by finite differences [1, -1] is neat if we know this is precisely what we are looking for. However, as we look at larger kernels, and consider successive layers of convolutions, it might be impossible to specify precisely what each filter should be doing manually.

Now let us see whether we can learn the kernel that generated Y from X by looking at the (input, output) pairs only. We first construct a convolutional layer and initialize its kernel as a random array. Next, in each iteration, we will use the squared error to compare Y to the output of the convolutional layer. We can then calculate the gradient to update the weight. For the sake of simplicity, in this convolutional layer, we will ignore the bias.

This time, we will use the in-built Block and Conv2d class from DJL.

X = X.reshape(1,1,6,8);
Y = Y.reshape(1,1,6,7);

Loss l2Loss = Loss.l2Loss();
// Construct a two-dimensional convolutional layer with 1 output channel and a
// kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
Block block = Conv2d.builder()
                .setKernelShape(new Shape(1, 2))

block.setInitializer(new NormalInitializer(), Parameter.Type.WEIGHT);
block.initialize(manager, DataType.FLOAT32, X.getShape());

// The two-dimensional convolutional layer uses four-dimensional input and
// output in the format of (example, channel, height, width), where the batch
// size (number of examples in the batch) and the number of channels are both 1

ParameterList params = block.getParameters();
NDArray wParam = params.get(0).getValue().getArray();

NDArray lossVal = null;
ParameterStore parameterStore = new ParameterStore(manager, false);

NDArray lossVal = null;

for (int i = 0; i < 10; i++) {


    try (GradientCollector gc = Engine.getInstance().newGradientCollector()) {
        NDArray yHat = block.forward(parameterStore, new NDList(X), true).singletonOrThrow();
        NDArray l = l2Loss.evaluate(new NDList(Y), new NDList(yHat));
        lossVal = l;
    // Update the kernel

    if((i+1)%2 == 0){
        System.out.println("batch " + (i+1) + " loss: " + lossVal.sum().getFloat());
batch 2 loss: 0.12571818
batch 4 loss: 0.09935227
batch 6 loss: 0.07851635
batch 8 loss: 0.062050212
batch 10 loss: 0.049037326

Note that the error has dropped to a small value after 10 iterations. Now we will take a look at the kernel array we learned.

ParameterList params = block.getParameters();
NDArray wParam = params.get(0).getValue().getArray();
weight: (1, 1, 1, 2) gpu(0) float32 hasGradient
[[[[ 0.4475, -0.4477],

Indeed, the learned kernel array is moving close to the kernel array K we defined earlier.

6.2.5. Cross-Correlation and Convolution

Recall our observation from the previous section of the correspondence between the cross-correlation and convolution operators. The figure above makes this correspondence apparent. Simply flip the kernel from the bottom left to the top right. In this case, the indexing in the sum is reverted, yet the same result can be obtained. In keeping with standard terminology with deep learning literature, we will continue to refer to the cross-correlation operation as a convolution even though, strictly-speaking, it is slightly different.

6.2.6. Summary

  • The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation. In its simplest form, this performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.

  • We can design a kernel to detect edges in images.

  • We can learn the kernel’s parameters from data.

6.2.7. Exercises

  1. Construct an image X with diagonal edges.

    • What happens if you apply the kernel K to it?

    • What happens if you transpose X?

    • What happens if you transpose K?

  2. When you try to automatically find the gradient for the Conv2d class we created, what kind of error message do you see?

  3. How do you represent a cross-correlation operation as a matrix multiplication by changing the input and kernel arrays?

  4. Design some kernels manually.

    • What is the form of a kernel for the second derivative?

    • What is the kernel for the Laplace operator?

    • What is the kernel for an integral?

    • What is the minimum size of a kernel to obtain a derivative of degree \(d\)?