(This answer will set aside the issues related to PCA or other dimension-reduction techniques.)
It is possible that your classes simply are not separable on your features. For instance, if category $1$ has a bivariate distribution of $N\left(\begin{pmatrix}0\\0\end{pmatrix},\begin{pmatrix}1&0\\0&1\end{pmatrix}\right)$ and category $2$ has a distribution of $N\left(
\begin{pmatrix}0\\0\end{pmatrix},\begin{pmatrix}
1&0\\
0&2
\end{pmatrix}
\right)$, but the only feature you observe is the first one where both categories have standard normal distributions, then you cannot reliably distinguish between the classes. In more formal language, the posterior probability of either category, given an observation of that first feature, is equal to the prior probability of that category: $P(\text{category c}\vert\text{feature 1})=P(\text{category c})$.
In a more realistic scenario where the feature distributions of the two categories are not exactly the same, they might be different but only slightly. In such a scenario, there is slight ability to distinguish between the two classes but only slight.
If your classes have considerable overlap that precludes strong performance, yet you force a model to separate them by fitting to coincidences in the training data, then that absolutely could result in overfitting and having terrible out-of-sample performance. A picture on the Wikipedia article on overfitting shows this to some extent.

Imagine if the red and blue dots overlapped considerably. Yes, it would be possible to snake around and get most of the points on the correct side of some kind of decision boundary (you could get everything right if you drop the requirement for the decision boundary to be continuous), but that likely would have overfit to coincidences in the data rather than the real trend.