I'm following the Bi-Encoder architecture (see here) in order to build a dense retrieval (search) system. Formally, my network encodes a query q and an item description d based on fixed representations from Sentence Transformers denoted as SBERT(q) and SBERT(d), respectively.
It then learns a transformation (the 'pooling' in the picture below) that maximizes the cosine similarity between positive examples (where the query and item description match) and minimize the similarity between negative examples (randomly assigned query/description pairs). I use an MSE loss.
Now, when I train my network, I observe that it (always) converges to producing a cosine similarity of 0.5 for all examples, provided that my labels are equally distributed as {0, 1}. If I adjust the balance of the positive/negative examples, it converges to whatever minimizes the MSE loss while still producing the same output (within a fractional range) for both positive and negative examples.
What could be going wrong? My dataset isn't the largest, only a few thousand examples. I would say that the queries are fairly semantically related to the descriptions, so it shouldn't be too hard to learn this mapping. The offline-computed sentence representations for the queries and descriptions also look reasonable. I have tried smaller and bigger networks for the pooling transformations, all with the same effect.
