I have a learning algorithm $A$, which is a neural network, and two different datasets, $D_1$ and $D_2$, that consist of data with the same set of features. I don't know if it's relevant, but the datasets have a different number of examples from each other: $D_1$ has 1280 examples whereas $D_2$ has 546.
My goal is to compare the performance (i.e. accuracy, since it's a binary classification problem) of the learning algorithm on the two datasets, in order to say with a certain confidence things like "the learning algorithm $A$ performs better on $D_1$ than $D_2$", or "the learning algorithm performs similarly on both datasets".
My current approach would be to perform K-fold cross validation on both datasets, getting back two accuracy values, one for each dataset, $a_1$ and $a_2$. But I'm not sure I can make a statistically significant comparison between $a_1$ and $a_2$ directly to conclude on which dataset the learning algorithm performs better. Could I compute confidence intervals for $a_1$ and $a_2$ and compare them, for example saying that there is no significant difference if the two interval overlap and, on the other hand, there is significant difference if intervals don't overlap?
Or are there better approaches to reach my goal?