In problems where one of $3+$ categories can be observed and we prodict the probability of each category being observed, it is known that the Brier score is a strictly proper scoring rule that is uniquely optimized in expected value by the true probability values [1, 2, 3, 4]. The machine learning community often refers to these problems with $3+$ categories as "multiclass" problems.
In contrast, "multi-label" problems allow for all, none, or any combination of categorical outcomes to be observed, and we model the probability of each individual outcome, possibly with relationships between the outcomes (e.g., if there is a horse in a photo, there probably isn't an airplane).
For a multi-label outcome, is the Brier score still a strictly proper scoring rule?