The key thing to do in this case is to properly account for the feature selection as a part of the modeling process. This means that inside of your CV loop, you'll apply the feature selection process to just the training data, and then to the testing data separately. This is done to minimize information leakage between training and testing. Failure to do this can result in overly-optimistic results and spurious findings.
Elastic net regression allows one to fit a linear model, perform feature selection and control for correlation among variables all at once. The R package glmnet supports GLMs (i.e. includes logistic regression as an option) with sparse matrices. This will be a good first-pass linear model of the data. However, be aware that the features selected by a linear model may have little to do with what's important in a nonlinear model (cf this answer of mine).
Next steps are models that are particularly apt at handling sparsity in feature vectors. Factorization machines were conceived to solve this particular task, but they are naturally a bit more complicated than linear models.(1)
Linear SVMs are another option. Inner products of sparse vectors are cheap to compute. There are even packages like sofia that work in the primal SVM problem, which means that the complexity is driven by the number of features rather than the number of observations. This can be a nice property, but keep in mind that just because a tool is efficient, it's not necessarily the right one for the job.
I'm often hesitant to assume that methods like PCA will solve my problems because it's not "$y$-aware," by which I mean that the projection may not contain any useful information about the target of the modeling process. In some cases, it will even discard or destroy the only useful information! On the other hand, some people insist that its application is usually good enough for their needs: your mileage may vary. PLS is a regression method to perform "$y$-aware" regression.
Instead of PCA for dimension reduction, non-negative matrix factorization might be useful. Suppose you use binary encodings on all of your categories. The resulting matrix is non-negative; non-negative matrix factorization will often be able to discover some latent "structure" to that data, where a small number of vectors coincide with particular patterns of association among your different categories. For example, the same people who enjoy the movies Ant-Man and Avengers probably also enjoy Iron Man; "superhero movies" would represent a latent factor that NMF could discover.
Boosted trees are a workhorse model; some people consider them to be the "model to beat" if they're just starting some new project.
(1) Steffen Rendle "Factorization Machines"