1

I have a data set that is a mix between Ordinal variables and numerical variables.

The problem is that the ordinal variables are on different scales, such as 0-2, 1-4. The data set has 35 variables. I want to perform some type of data reduction on the data set but am unsure of what to do given the different scales.

Techniques such as:

  1. Random Forest
  2. Principal Component Analysis

However, I am not sure if it's possible due to the different scales?

I am using pandas.

3 Answers3

0

The categorical variables don't have any scaling as such, in your case, the categorical variables are binned into different groups, calculate the value counts of these categories, and you can further bin smaller categories into one group, which can reduce the size of input features, or else, it would add more sparse input features to your data and also leads to dimensionality curse problem.

If you feel that all the 35 variables are important then you can try one of the following strategies like

  • Principal component analysis
  • t-SNE
  • UMAP etc.
0

There is a paper that discusses a Bayesian factor analysis for mixed ordinal and continuous variables:

Kevin M. Quinn. 2004. “Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses.” Political Analysis. 12: 338-353.

This is implemented in the MCMCpack package in R: MCMCmixfactanal().

Another option would be the generalized low rank model (GLRM) which is implemented in h2o. This is similar in spirit to PCA, but without the linearity assumption (and it also permits missing data).

0

There is an R package called PCAmixdata which is described in the preprint Chavent, M., Kuentz-Simonet, V., Labenne, A., Saracco, J., Multivariate analysis of mixed data: The PCAmixdata R package.

They include categorical and continuous variables together in the algorithm mainly multiplying for a matrix that allows different metrics. I wonder if there could be an analogous extension for ordinal variables.

Update: looking around I found this interesting question in which someone is trying to use PCA on ordinal data. There are some interesting suggestions, especially regarding the R homals package.

I just copy-paste here the abstract from their software paper (in the link above):

Homogeneity analysis combines the idea of maximizing the correlations between variables of a multivariate data set with that of optimal scaling. In this article we present methodological and practical issues of the R package homals which performs homogeneity analysis and various extensions. By setting rank constraints nonlinear principal component analysis can be performed. The variables can be partitioned into sets such that homogeneity analysis is extended to nonlinear canonical correlation analysis or to predictive models which emulate discriminant analysis and regression models. For each model the scale level of the variables can be taken into account by setting level constraints. All algorithms allow for missing values.

jmarkov
  • 683
  • 4
  • 11