11

I have the network architecture from the paper "learning fine-grained image similarity with deep ranking" and I am unable to figure out how the output from the three parallel network is merged using the linear embedding layer. The only information given on this layer, in the paper is

Finally, we normalize the embeddings from the three parts, and combine them with a linear embedding layer. The dimension of the embedding is 4096.

Can anyone help me in figuring out what exactly does the author mean when he is talking about this layer?

Snympi
  • 242
  • 3
  • 16
A. Sam
  • 233
  • 1
  • 6
  • It's unfortunate for me that there is no answer for this question. Because I'm stuck with the exactly same issue. Did you figure it out? – LKM Oct 10 '17 at 16:51
  • I did not figure out the answer but i just concatenated the input from the three parts and passed it through a dense layer containing 4096 nodes. – A. Sam Oct 12 '17 at 08:53

2 Answers2

1

Linear embedding layer must be just a fancy name for a dense layer with no activation. 'Linear' means there is no activation (activation is identity). And the embedding is rather a concept for a vector representation of the input data (e.g. word embeddings). I believe the elements from the second vector are simply added to the first one element-wise.

Dmytro Prylipko
  • 836
  • 5
  • 10
0

It's mentioned in the paper:

A local normalization layer normalizes the feature map around a local neighborhood to have unit norm and zero mean. It leads to feature maps that are robust to the differences in illumination and contrast.

They take each part of the model and normalize it separately.

As for combining them, as you commented, to capture the most salient features, with under-complete representation no needs for the non-linearity.

Fadi Bakoura
  • 906
  • 5
  • 13