In my field, it is very common to have datasets which could be treated as a few million rows of data and a couple hundred features. Commonly used dimensionality reduction techniques include:
- PCA. The first thing in the book indeed, has a ton of drawbacks but works well if you do not have to deal with covariate shift when transferring your model to new data and plan on using something like SVM later. PCA + SVM w/ RBF is a very dirty solution which tends to produce pretty good results no one can reproduce on new data nor interpret. On the same note - PCA cutoff is a bit speculative more often than not. I would suggest using it for an exploratory stage, throwing away features that are supposedly wiggling around the noise threshold and looking only at the first few which are undoubtedly important. Once you get a feel for this dataset, move on to something else.
- GMM and ICA family, projection pursuit. These can help if you know you are dealing with latent variables linearly mixed together, but I do not think they could be applied well to categorical data and I am not the only one. You could always split your features into continuous and categorical and deal with them separately for a start. If all your categorical data is ordinal, especially binary, you can treat it as continuous - albeit with some caveats. It may make sense to treat a variable representing e.g. income brackets as a continuous variable with very poor resolution, this is routinely being done and it is also a low-hanging fruit for all the domain experts: come up with a better transformation of categorical data into numbers, boom, the model has improved.
Personally, of exploratory models I probably prefer spectral clustering, but it does not scale very well.
- Supervised learning. Enables you to use a wide range of tools such as metric-based clustering (Mahalanobis distance works well for high-dimensional continuous data; GMMs use it under the hood) or running something like random forests over your data and looking at feature importances later. If you are dealing with anomaly detection or otherwise important minority classes, sampling becomes a hard problem on its own.
- (bonus) "Manually" tweaking some of the above: if you can not fit the entire dataset in memory, you could still potentially use memmaps, factorize matrices, build estimators approximating the "true values" stochastically, treating incoming samples as a stream, and so forth. A good and valuable approach, but typically reserved for when you are reasonably sure this solution works.
Feature embedding is an entire field of study, and it is heavily data-dependent. I would argue that blind data mining is largely a thing of the past, and you need to bring some domain knowledge along, after all.