7

I would like to know if exists a rule of thumb to set the $\gamma$ parameter in Focal Loss when we have very imbalanced classes.

The focal loss first appeared in Focal Loss for Dense Object Detection by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár

Sycorax
  • 90,934
Simone
  • 311
  • (for anyone who's stumbled upon this thread) I'm doing some time series anomaly detection and focal loss wasn't working for me until I set gamma=20, so don't shy away from trying super high values, sometimes that's what you need to do! – Bruno E Feb 27 '24 at 10:09

2 Answers2

4

The parameter $\gamma$ smoothly adjusts the rate at which easy examples are down-weighted and that is quite dataset and application dependent. In the paper, the focal loss is actually given as: $−\alpha (1 − p_t)^\gamma \log(p_t)$ which is a reformulated view of the standard cross-entropy loss and the class imbalance itself is "controlled" by $\alpha$ rather than $\gamma$. People often treat the focal loss as a tool to primarily address class imbalance whether it is actually a tool to primarily address information asymmetry during learning; i.e. focal loss can very relevant when training with a balanced set where one of the two classes is easy to distinguish. But to bring us back: there is no good rule of thumb aside setting $\gamma=2$ (as the paper suggests) and then adjusting it based on our evaluation criteria. This point highlights the difference between a loss function and an evaluation metric; CV.SE has thread on Loss function and evaluation metric if you want to explore this distinction further but the main point to carry here is that $\gamma$ (i.e. a hyper-parameter of our loss) needs to be adjust in relation to our evaluation criteria and "on it's own" is often meaningless.

usεr11852
  • 44,125
3

On page 3, the authors write "we found $\gamma = 2$ to work best in our experiments." This tells us that they chose $\gamma$ experimentally by training models with different values and then choosing the $\gamma$ from the best model. Because this value was chosen experimentally, it may not be the best choice across all modeling tasks or datasets.

You might use $\gamma =2$, with the reasoning that the paper authors found that to be the best value. Alternatively, you can tune $\gamma$ experimentally (assuming you have the resources to do so).

Sycorax
  • 90,934