One first clarification we need here is that identification refers to structural parameters, it means the ability of identifying these parameters with observed quantities. If you have several parameters that are observationally equivalent, then the parameters are not identified. I think the initial discussion in this paper by Pearl might help with some of the ideas. For now let's assume $n = \infty$ to avoid the discussion of sampling issues, which is not the main concern here.
The linear projection of $y$ on $X$ is always "identified" since this is a property of the data. You don't need to make any assumptions about the error term to "identify" the linear projection, it's the opposite, the linear projection defines its error term. Hence $\beta^{OLS} = (X'X)^{-1}X'y$ is what it is regardless of what assumptions you make about the structural process. And you can also write $y = X\beta^{OLS} + \epsilon^{OLS}$ and $X'\epsilon^{OLS} = 0$ by the properties of OLS. So you can always get $\beta^{OLS}$ (provided (X'X) is invertible) but that doesn't mean it represents anything structurally meaningful.
Likewise, the conditional expectation $E[y|X]$ is always "identified" since it's just a property of the data. You may mispecify the conditional expectation and estimate it incorrectly (for example, assuming it's linear when it's not), but there's no identification problem. For instance, nowadays, given enough data, we actually have universal approximation algorithms that theoretically could be able to estimate this quantity (on a bounded domain).
Thus, identification is neither about the best linear approximation nor the "true" conditional expectation. It is about structural quantities. The question here is, assuming $y = X\beta + \epsilon$ is structural, what do we have to assume about $\epsilon$ to be able to identify $\beta$?
In your case, both assumptions (1) and (3) work for identification. For (1), multiply each side of the equation by $X$ to obtain:
$$
E[X'y] = E[X'X]\beta + E[X'\epsilon] = E[X'X]\beta
$$
Then, just solve for $\beta$. The case of (3) is also straightforward. Taking the expectation conditional on $X$
$$
E[y|X] = X\beta + E[\epsilon|X] = X\beta
$$
Thus, we see that the structural conditional expectation equals the observed conditional expectation. Furthermore, $E[X'\epsilon] = E[E[X'\epsilon|X]] = E[X'E[\epsilon|X]] = 0$, and you can actually solve for $\beta$ as before. Assumption (2) is not enough, since it only imposes $E[y] = E[X]\beta$: if you have more than one dimension on $X$, several combinations of the parameter can give you the same expected value, that's why the structural parameters are not identified.