1

I am trying to understand how a nn.conv1d processes an input for a specific example related to audio processing in a WaveNet model.

I have input data of shape (1,1,8820), which passes through an input layer (1,16,1), to output a shape of (1,16,8820).

That part I understand, because you can just multiply the two matrices. The next layer is a conv1d, kernel size=3, input channels=16, output channels=16, so the state dict shows a matrix with shape (16,16,3) for the weights. When the input of (1,16,8820) goes through that layer, the result is another (1,16,8820).

What multiplication steps occur within the layer to apply the weights to the audio data? In other words, if I wanted to apply the layer(forward calculations only) using only the input matrix, the state_dict matrix, and numpy, how would I do that?

This example is using the nn.conv1d layer from Pytorch. Also, if the same layer had a dilation=2, how would that change the operations?

Bedir Yilmaz
  • 3,271
  • 5
  • 31
  • 53
Keith
  • 47
  • 4

1 Answers1

0

A convolution is a specific type of "sliding window operation": that is, applying the same function/operation on overlapping sliding windows of the input.
In your example, you treat each 3 overlapping temporal samples (each in 16 dimensions) as an input to 16 filters. Therefore, you have a weight matrix of 3x16x16.

You can think of it as "unfolding" the (1, 16, 8820) signal into (1, 16*3, 8820) sliding windows. Then multiplying by 16*3 x 16 weight matrix to get an output of shape (1, 16, 8820).

Padding, dilation and strides affect the way the "sliding windows" are formed.
See nn.Unfold for more information.

Shai
  • 102,241
  • 35
  • 217
  • 344
  • 1
    Your explanation on unfolding the matrix really cleared that up for me, much appreciated! – Keith Aug 10 '20 at 13:21