Both situations are specific cases of test-retest, except that the recall period is null in the first case you described. I would also expect a larger agreement in the former case, but that may be confounded with a learning or memory effect. A chance-corrected measure of agreement, like Cohen's kappa, can be used with binary variables, and bootstraped confidence intervals might be compared in the two situations (this is better than using $\kappa$ sampling variance directly). This should give an indication of the reliability of your measures, or in this case diagnostic agreement, at the two occasions. A McNemar test which tests for marginal homogeneity in matched pairs can also be used.
An approach based on the intraclass correlation is still valid and, provided your prevalence is not extreme, should be closed to
- a simple Pearson correlation (which, for binary data, is also called a phi coefficient) or the tetrachoric version suggested by @Skrikant,
- the aforementioned kappa (for a large sample, and assuming that the marginal distributions for case at the two occasions are the same, $\kappa\approx\text{ICC}$ from a one-way ANOVA).
About your bonus question, you generally need 3 time points to separate the lack of (temporal) stability -- which can occur if the latent class or trait your are measuring evolve over time -- from the lack of reliability (see for an illustration the model proposed by Wiley and Wiley, 1970, American Sociological Review 35).