Run this notebook online: or Colab:

# 6.5. Pooling¶

Often, as we process images, we want to gradually reduce the spatial resolution of our hidden representations, aggregating information so that the higher up we go in the network, the larger the receptive field (in the input) to which each hidden node is sensitive.

Often our ultimate task asks some global question about the image, e.g.,
*does it contain a cat?* So typically the nodes of our final layer
should be sensitive to the entire input. By gradually aggregating
information, yielding coarser and coarser maps, we accomplish this goal
of ultimately learning a global representation, while keeping all of the
advantages of convolutional layers at the intermediate layers of
processing.

Moreover, when detecting lower-level features, such as edges (as
discussed in Section 6.2), we often want our
representations to be somewhat invariant to translation. For instance,
if we take the image `X`

with a sharp delineation between black and
white and shift the whole image by one pixel to the right, i.e.,
`Z[i, j] = X[i, j+1]`

, then the output for the new image `Z`

might
be vastly different. The edge will have shifted by one pixel and with it
all the activations. In reality, objects hardly ever occur exactly at
the same place. In fact, even with a tripod and a stationary object,
vibration of the camera due to the movement of the shutter might shift
everything by a pixel or so (high-end cameras are loaded with special
features to address this problem).

This section introduces pooling layers, which serve the dual purposes of mitigating the sensitivity of convolutional layers to location and of spatially downsampling representations.

## 6.5.1. Maximum Pooling and Average Pooling¶

Like convolutional layers, pooling operators consist of a fixed-shape
window that is slid over all regions in the input according to its
stride, computing a single output for each location traversed by the
fixed-shape window (sometimes known as the *pooling window*). However,
unlike the cross-correlation computation of the inputs and kernels in
the convolutional layer, the pooling layer contains no parameters (there
is no *filter*). Instead, pooling operators are deterministic, typically
calculating either the maximum or the average value of the elements in
the pooling window. These operations are called *maximum pooling* (*max
pooling* for short) and *average pooling*, respectively.

In both cases, as with the cross-correlation operator, we can think of
the pooling window as starting from the top left of the input array and
sliding across the input array from left to right and top to bottom. At
each location that the pooling window hits, it computes the maximum or
average value of the input subarray in the window (depending on whether
*max* or *average* pooling is employed).

The output array in Fig. 6.5.1 above has a height of 2 and a width of 2. The four elements are derived from the maximum value of \(\text{max}\):

A pooling layer with a pooling window shape of \(p \times q\) is called a \(p \times q\) pooling layer. The pooling operation is called \(p \times q\) pooling.

Let us return to the object edge detection example mentioned at the
beginning of this section. Now we will use the output of the
convolutional layer as the input for \(2\times 2\) maximum pooling.
Set the convolutional layer input as `X`

and the pooling layer output
as `Y`

. Whether or not the values of `X[i, j]`

and `X[i, j+1]`

are
different, or `X[i, j+1]`

and `X[i, j+2]`

are different, the pooling
layer outputs all include `Y[i, j]=1`

. That is to say, using the
\(2\times 2\) maximum pooling layer, we can still detect if the
pattern recognized by the convolutional layer moves no more than one
element in height and width.

In the code below, we implement the forward computation of the pooling
layer in the `pool2d`

function. This function is similar to the
`corr2d`

function in Section 6.2. However, here we have
no kernel, computing the output as either the max or the average of each
region in the input..

```
%load ../utils/djl-imports
```

```
NDManager manager = NDManager.newBaseManager();
public NDArray pool2d(NDArray X, Shape poolShape, String mode){
long poolHeight = poolShape.get(0);
long poolWidth = poolShape.get(1);
NDArray Y = manager.zeros(new Shape(X.getShape().get(0) - poolHeight + 1,
X.getShape().get(1) - poolWidth + 1));
for(int i=0; i < Y.getShape().get(0); i++){
for(int j=0; j < Y.getShape().get(1); j++){
if("max".equals(mode)){
Y.set(new NDIndex(i+","+j),
X.get(new NDIndex(i + ":" + (i + poolHeight) + ", " + j + ":" + (j + poolWidth))).max());
}
else if("avg".equals(mode)){
Y.set(new NDIndex(i+","+j),
X.get(new NDIndex(i + ":" + (i + poolHeight) + ", " + j + ":" + (j + poolWidth))).mean());
}
}
}
return Y;
}
```

We can construct the input array `X`

in the above diagram to validate
the output of the two-dimensional maximum pooling layer.

```
NDArray X = manager.arange(9f).reshape(3,3);
pool2d(X, new Shape(2,2), "max");
```

```
ND: (2, 2) gpu(0) float32
[[4., 5.],
[7., 8.],
]
```

At the same time, we experiment with the average pooling layer.

```
pool2d(X, new Shape(2,2), "avg");
```

```
ND: (2, 2) gpu(0) float32
[[2., 3.],
[5., 6.],
]
```

## 6.5.2. Padding and Stride¶

As with convolutional layers, pooling layers can also change the output
shape. And as before, we can alter the operation to achieve a desired
output shape by padding the input and adjusting the stride. We can
demonstrate the use of padding and strides in pooling layers via the
two-dimensional maximum pooling layer `maxPool2dBlock`

shipped in
DJL’s `Pool`

module. We first construct an input data of shape
`(1, 1, 4, 4)`

, where the first two dimensions are batch and channel.

```
X = manager.arange(16f).reshape(1, 1, 4, 4);
X
```

```
ND: (1, 1, 4, 4) gpu(0) float32
[[[[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.],
],
],
]
```

Below, we use a pooling window of shape `(3, 3)`

, and a stride shape
of `(3, 3)`

```
// defining block specifying kernel and stride
Block block = Pool.maxPool2dBlock(new Shape(3, 3), new Shape(3, 3));
block.initialize(manager, DataType.FLOAT32, new Shape(1,1,4,4));
ParameterStore parameterStore = new ParameterStore(manager, false);
// Because there are no model parameters in the pooling layer, we do not need
// to call the parameter initialization function
block.forward(parameterStore, new NDList(X), true).singletonOrThrow();
```

```
ND: (1, 1, 1, 1) gpu(0) float32
[[[[10.],
],
],
]
```

The stride and padding can be manually specified.

```
// redefining block shapes for kernel shape, stride shape and pad shape
block = Pool.maxPool2dBlock(new Shape(3,3), new Shape(2,2), new Shape(1,1));
// block forward method
block.forward(parameterStore, new NDList(X), true).singletonOrThrow();
```

```
ND: (1, 1, 2, 2) gpu(0) float32
[[[[ 5., 7.],
[13., 15.],
],
],
]
```

Of course, we can specify an arbitrary rectangular pooling window and specify the padding and stride for height and width, respectively.

```
// redefining block shapes for kernel shape, stride shape and pad shape
block = Pool.maxPool2dBlock(new Shape(2,3), new Shape(2,3), new Shape(1,2));
block.forward(parameterStore, new NDList(X), true).singletonOrThrow();
```

```
ND: (1, 1, 3, 2) gpu(0) float32
[[[[ 0., 3.],
[ 8., 11.],
[12., 15.],
],
],
]
```

## 6.5.3. Multiple Channels¶

When processing multi-channel input data, the pooling layer pools each
input channel separately, rather than adding the inputs of each channel
by channel as in a convolutional layer. This means that the number of
output channels for the pooling layer is the same as the number of input
channels. Below, we will concatenate arrays `X`

and `X+1`

on the
channel dimension to construct an input with 2 channels.

```
X = X.concat(X.add(1), 1);
X
```

```
ND: (1, 2, 4, 4) gpu(0) float32
[[[[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.],
],
[[ 1., 2., 3., 4.],
[ 5., 6., 7., 8.],
[ 9., 10., 11., 12.],
[13., 14., 15., 16.],
],
],
]
```

As we can see, the number of output channels is still 2 after pooling.

```
block = Pool.maxPool2dBlock(new Shape(3,3), new Shape(2,2), new Shape(1,1));
block.forward(parameterStore, new NDList(X), true).singletonOrThrow();
```

```
ND: (1, 2, 2, 2) gpu(0) float32
[[[[ 5., 7.],
[13., 15.],
],
[[ 6., 8.],
[14., 16.],
],
],
]
```

## 6.5.4. Summary¶

Taking the input elements in the pooling window, the maximum pooling operation assigns the maximum value as the output and the average pooling operation assigns the average value as the output.

One of the major functions of a pooling layer is to alleviate the excessive sensitivity of the convolutional layer to location.

We can specify the padding and stride for the pooling layer.

Maximum pooling, combined with a stride larger than 1 can be used to reduce the resolution.

The pooling layer’s number of output channels is the same as the number of input channels.

## 6.5.5. Exercises¶

Can you implement average pooling as a special case of a convolution layer? If so, do it.

Can you implement max pooling as a special case of a convolution layer? If so, do it.

What is the computational cost of the pooling layer? Assume that the input to the pooling layer is of size \(c\times h\times w\), the pooling window has a shape of \(p_h\times p_w\) with a padding of \((p_h, p_w)\) and a stride of \((s_h, s_w)\).

Why do you expect maximum pooling and average pooling to work differently?

Do we need a separate minimum pooling layer? Can you replace it with another operation?

Is there another operation between average and maximum pooling that you could consider (hint: recall the softmax)? Why might it not be so popular?