Q17 – ConvNet Invariances

Question 1: A convolutional neural network (CNN) has the ability to be “insensitive” to some slight spatial variations in the input data, such as translation. In comparison with the regular feed-forward neural networks, the CNN architecture has two components responsible for providing this kind of insensitivity. Explain which are those components and how a CNN can ignore small translations in the input data.


5 thoughts on “Q17 – ConvNet Invariances

  1. The two mechanisms responsible for translation invariance are the convolution itself and the pooling operation. Since the convolution shifts a kernel over the whole image, a single kernel is enough to detect an edge anywhere in the image, whereas an MLP would need as many edge detectors as to fit the image.

    On the other hand, (max) pooling keeps only a part of its input information, which provides a great tolerance to input distortions and makes the receptive fields of the neurons in the next layer bigger.

    Do you agree ? Did I miss something ?


    • I think it depends on what we mean by “slight spatial variations”. If we mean, a translation of one or two pixels, than I think the answer would max pooling and the stride, since as you pointed out, we have very local lost of information.

      I think the fact that there are shared parameters don’t really come into play here. I think it matters when we have “big spatial variations”, i.e. a feature can appear anywhere in the image.

      Liked by 1 person

      • “A convolutional neural network (CNN) has the ability to be “insensitive” to some slight spatial variations in the input data, such as translation.”

        Given the precision “such as translation” at the end, I think that the question was intended to highlight the invariant* aspect of CNNs in general not only for a 1-2 pixel shift. My interpretation of “slight spatial variations” was that we’re not talking about big spatial distortions, but rather gentle transformations through which the CNN will still detect the features.

        With that interpretation in mind, I’d go with the shared parameters as the main mechanism in play.

        As described by Hubel & Wiesel (neuro/vision) some of your neurons are sensible (and responsible) to a specific orientation of stimuli. All you need is your receptive field, i.e. signal coming from your eye, to be connected to that neuron so that it can fire. In the case of CNNs we ensure that the receptive field is connected to the neuron responsible for a specific orientation (or feature) by having a convolution over the image, meaning that we’re connecting that neuron to the receptive field at various location (simultaneously) on the image covering it in full* (considering kernel size, strides & padding). Therefore, no matter where on the image the feature appears (from low-level features like in V1 to higher level features like in V4) the specialized neuron will fire, making the network invariant to translation. If the weights were different from location to location (i.e. not shared across the image), we’d be looking for different features at different locations going back to the complexity of a fully connected layer.


  2. I think I agree with what Massimiliano has suggested. I did want to ask or maybe comment on the fact that the convolution component would appear to me as equivariant since applying a slight translation in a subset of pixels on the input would result in a shifted feature after the convolutional layer, but before the pooling layer. I believe this equivariance is the consequence of using the same kernel (parameter sharing) with the convolution.

    Therefore, wouldn’t the pooling component be mostly responsible for the ‘invariance’ to translation because of how a small translation results in a same output assuming the translation is small compared to the pooling layers ‘receptive field’ (the maximum is still kept in the input even after the translation)?

    Could the features learned tend to be invariant to small translations only in regards to the convolution while the convolution itself is globally equivariant to translation ?

    (The textbook, section 9.3, is fairly direct in describing pooling as making a representation approximately invariant to small translations. I do make the following point concerning convolution since section 9.2 (p329-330) seems to mention the convolution’s equivariance)


    • I would add to my answer that the first lecture on convnets stipulates at slide #23:

      “Invariances built-in in convolutional networks:
      -small translations: due to convolution and max pooling “


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s