I am running a supervised regression with cross validation, and wish to use $R^2$ as my performance metric. I am using Leave-P-Out cross validation, with P=2, which gives me approximately 4500 folds, of which I obtain an $R^2$ statistic for each.
I want to find a sensible way to generalize the explainable variance over the entire model. Initially, I had assumed the average over all folds would be sensible. However, the issue I have is that the range of $R^2$ scores appears unbounded <0. In some folds, I get a reasonable level of performance ($R^2$=0.6), and others I get enormously negative values ($R^2$ = -1900), no doubt because each fold is only being tested on 2 instances. As a result, the average $R^2$ over all 4500 folds is a negative number.
My question is, can I sensibly justify representing this in any other way to demonstrate that "when the model works, $R^2$=X"?
Given that a model predicting the expected value of the target would generate an $R^2$=0 , is there any practical difference between $R^2$ = 0 and $R^2$ = -1000? If not, is it (in)appropriate to treat all negative $R^2$ scores as 0, and thus avoiding washing away positive values where the model did work?