Methods in R or Python to perform feature selection in unsupervised learning

Question

What are the available methods/implementation in R/Python to discard/select unimportant/important features in data? My data does not have labels (unsupervised).

The data has ~100 features with mixed types. Some are numeric while others are binary (0/1).

What kind of unsupervised learning algorithm are you using? What does your data look like? — Max Candocia, Jul 21 '14 at 17:55
@user1362215, Before applying any unsupervised algorithm, I am trying to find a way to perform feature removal. — learner, Jul 21 '14 at 18:23
Have you seen this scikit-learn cheatsheet before? It may help you get started... — Steve S, Jul 21 '14 at 19:15
Why not use a unsupervised method that perfomes feature selection by itself like random forest in unsupervised mode? — JEquihua, Jul 21 '14 at 19:20
@JEquihua, Considering the fact that data that I am exploring is for anomaly detection (Most of the data is normal, ~5-10% anomalous), is it okay to apply RandomForest? — learner, Jul 21 '14 at 20:18
I'm not completely surem, I mean random forest is completely non parametric so don't worry about assumptions. What I'm not sure is if it will serve your purpose. What I CAN say is that there is a version of Random Forest just for "anomaly detection" called isolation forests: http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation there was an implementation in R but i'm not sure if it's up and running as of now. — JEquihua, Jul 22 '14 at 15:13
One technique that might be helpful is 'One Class SVM' (i.e, OCSVM). It looks like there's an example in (this previous stackexchange post)[http://stackoverflow.com/questions/27375517/one-class-classification-with-svm-in-r]. I think the original reference for it is here and google results for 'OCSVM' point to anomaly detection examples — mtreg, Oct 21 '15 at 20:20
The focus on R and python seems to me to be off-topic by present standards (indeed by those of 2014). Stripped of R or python, the question becomes What are the available methods to discard/select unimportant/important features in data? which is very broad indeed. Still, neither of those seems to have stopped the thread being interesting or useful. — Nick Cox, Mar 15 '18 at 15:25

score 11 · Answer 1 · edited Mar 15 '18 at 15:21

11

It's a year old but I still feel it is relevant, so I just wanted to share my python implementation of Principal Feature Analysis (PFA) as proposed in the paper that Charles linked to in his answer.

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler

class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

    def fit(self, X):
        if not self.q:
            self.q = X.shape[1]

        sc = StandardScaler()
        X = sc.fit_transform(X)

        pca = PCA(n_components=self.q).fit(X)
        A_q = pca.components_.T

        kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]

You can use it like this:

import numpy as np
X = np.random.random((1000,1000))

pfa = PFA(n_features=10)
pfa.fit(X)

# To get the transformed matrix
X = pfa.features_

# To get the column indices of the kept features
column_indices = pfa.indices_

This is strictly following the described algorithm from the article. I think the method has promise, but honestly I don't think it's the most robust approach to unsupervised feature selection. I'll post an update if I come up with something better.

edited Mar 15 '18 at 15:21

Nick Cox

56,404
8
127
185

answered Mar 27 '16 at 18:09

Ulf Aslak

396

In the method described in the paper that you link to, Step 1 is to calculate the covariance matrix and step 2 is to calculate PCA on the covariance matrix from Step 1. I believe your fit function skips Step 1, and performs PCA on the original dataset. – user35581 Oct 01 '18 at 15:49
@user35581 good point. However, what they do is to (1) produce the covariance matrix from the original data and then (2) compute eigenvectors and eigenvalues of the covariance matrix using the SVD method. Those two steps combined are what you call PCA. The Principle Components are the eigenvectors of the covariance matrix of the original data. – Ulf Aslak Oct 02 '18 at 07:39
@Ulf Aslak can you elaborate why you think it is not the most robust approach to unsupervised feature selection? – hipoglucido Dec 19 '19 at 09:31
1

@hipoglucido honestly, I can't account for my thoughts when I wrote that. It's three years ago. Reviewing the code again, I'm led to believe it has something to do with the use of KMeans (which is non-deterministic). Also, I would like to see how this compares to just clustering the non-PCA-transformed features. – Ulf Aslak Dec 22 '19 at 04:26
@UlfAslak. Thank you very much. Do you think it works in case of timeseries data that has lots of 0s and 1s? Have you come up with an update please to your implementation? – Avv Jul 09 '21 at 21:49
1

@Avra well you need to have a data matrix of samples (rows) and features (columns). The method is very simple if you take a couple minutes to study the code. In brief, it just sorts features by their distance to the cluster means of clustered features in PC space. What I don't like about is, is that it uses KMeans. That's a terrible algorithm which gives weird results in the real world. Anyway, I don't think time-series data is well suited for this method. You could perform this at each timestep but why? – Ulf Aslak Jul 11 '21 at 06:20
Instead of KMeans, HDBSCAN could be a good solution I suppose. I will post the code if I can implement it. – banikr Jul 08 '22 at 15:24

score 3 · Answer 2 · answered Sep 20 '14 at 13:18

3

The sparcl package in R performs sparse hierarchical and sparse K-means clustering. This may be useful. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2930825/

answered Sep 20 '14 at 13:18

charles

2,651

Max Ghenis · Answer 3 · 2016-03-27T19:49:00.933

1

Principal Feature Analysis looks to be a solution to unsupervised feature selection. It's described in this paper.

edited Mar 27 '16 at 19:49

answered Sep 27 '15 at 18:58

Max Ghenis

820

Link is dead. Shared a python implementation of the method proposed in the article http://stats.stackexchange.com/a/203978/76815. – Ulf Aslak Mar 27 '16 at 19:42
Thanks, I removed the link (it was deleted as off-topic). – Max Ghenis Mar 27 '16 at 19:49

score 0 · Answer 4 · answered Aug 20 '14 at 05:58

I've found a link wich could be useful, those are matlab implementations, they may work out for you http://www.cad.zju.edu.cn/home/dengcai/Data/MCFS.html it's a multicluster feature selection method, you can find strong foundation about it in recent papers Let me know if it works for you

score 0 · Answer 5 · answered Aug 20 '14 at 08:45

There are many options available in R. A nice place to look is the caret package which provides a nice interface to many other packages and options. You can take a look at the website here. There are many options out there, but I will illustrate one.

Here is an example of using a simple filter using the built into R "mtcars" datasets (shown below).

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Now some code setup (loading packages, etc.):

# setup a parallel environment
library(doParallel)
cl <- makeCluster(2) # number of cores to use
registerDoParallel(cl)
library(caret)

And we can fit a simple model to select variables:

fit1 <- sbf(mtcars[, -1], mtcars[, 1],
  sbfControl =
    sbfControl(functions = rfSBF, method = "repeatedcv", repeats = 10)
)

Viewing the results, we get:

fit1
Selection By Filter

Outer resampling method: Cross-Validated (10 fold, repeated 10 times) 

Resampling performance:

  RMSE Rsquared RMSESD RsquaredSD
 2.266   0.9224 0.8666     0.1523

Using the training set, 7 variables were selected:
   cyl, disp, hp, wt, vs...

During resampling, the top 5 selected variables (out of a possible 9):
   am (100%), cyl (100%), disp (100%), gear (100%), vs (100%)

On average, 7 variables were selected (min = 5, max = 9)

Finally we can plot the selected variables (in fit1$optVariables ) against the outcome, mpg:

library(ggplot2)
library(gridExtra)
do.call(grid.arrange, 
lapply(fit1$optVariables, function(v) {
  ggplot(mtcars, aes_string(x = v, y = "mpg")) +
    geom_jitter()
}))

Resulting in this graph: scatter plots

This isn't unsupervised learning as OP requested, since you're model uses mpg as an outcome. Is there a way to use methods like these in unsupervised models? — Max Ghenis, Aug 09 '15 at 04:33

score 0 · Answer 6 · answered Oct 21 '15 at 19:49

The nsprcomp R package provides methods for sparse Principal Component Analysis, which could suit your needs.

For example, if you believe your features are generally correlated linearly, and want to select the top five, you could run sparse PCA with a max of five features, and limit to the first principal component:

m <- nsprcomp(x, scale.=T, k=5, ncomp=1)
m$rotation[, 1]

Alternatively, if you want to capture the orthogonal nature of features, you can select the top feature from each of the top five PCs, limiting each PC to one feature:

m <- nsprcomp(x, scale.=T, k=1, ncomp=5)
m$rotation

An ensemble of these could be useful too; i.e., features that consistently come to the top across different methods are likely to explain a large amount of variance in the feature space. Having played around with nsprcomp a bit, it seems like the first two methods raise ~1/2 of the same features to the top. That said, optimizing this process may be an empirical effort.

Methods in R or Python to perform feature selection in unsupervised learning

6 Answers6

Linked