RMSE is nice because it relates to the Brier score, which is just a term sometimes used for square loss in classification settings.
Depending on how $R^2$ is calculated, it might or might not relate to the RMSE. I would calculate $R^2$ in such a situation by comparing Brier score of your model to the Brier score that predicts the prior probability for each category every time, and I would take the stance I discuss here when it comes to an out-of-sample $R^2$. However, not everyone calculates $R^2$ the same way, and there are serious flaws to just calculating the correlation between the predictions and the observed outcomes. (I also do not know how that would work for when there are multiple categories.)
Perhaps even better than such an approach is to compare using the log-likelihoods. This would be akin to McFadden’s $R^2$.
UCLA has a nice page discussing metrics for logistic regression. Since a neural network classifier is, in some regards, just an amplified logistic regression, there is useful content there (especially when degrees of freedom are not considered, since that could be tough to calculate for a neural network machine learning approach). The last two on the page have some flaws, but I do like my interpretation of their adjusted count, though I take issue with their assertion that “count” (what they call classification accuracy) is a reasonable analogue to the usual $R^2$.
EDIT
Note that predicting a probability would typically be regarded as a “classification” task, as the measured outcomes are categories. This terminology is at odds with the usual English meaning of “classification” yet has become the terminology used in much of machine learning. A “regression” neural network in most machine learning circles would be a model of a numerical outcome, such as the fastest wind gust during the day.