4

I have standard data: where rows are observations, and columns are features.

        target colum_1 colum_2 colum_10 colum_100110 colum_499999999

[1,] 1 -0.35 -1.58 1.26 1.08 0.30 [2,] 1 -1.21 2.05 -0.95 1.59 -0.59 [3,] 1 -0.15 -1.63 0.63 -0.74 0.60 [4,] 0 0.78 0.55 -1.31 0.24 -0.22 [5,] 0 0.68 0.36 0.25 -0.23 1.73 [6,] 1 -0.32 1.07 -0.13 -0.31 -1.26 [7,] 1 -0.37 0.47 1.11 -1.14 -0.43 [8,] 1 -0.85 0.96 -1.61 0.62 0.06 [9,] 1 0.19 0.62 -1.28 1.31 0.30 [10,] 1 0.16 1.35 -0.11 1.14 -2.03

The problem is that there are so many features that I cannot run any learning algorithm. I'm familiar with dimensionality reduction algorithms, but for some reason I don't want to use them.

If I convert my data to a more compact form and create a new feature as a column id n_colum, something like this:

   target         n_colum   val
1       1         colum_1 -0.35
2       1         colum_2 -1.58
3       1        colum_10  1.26
4       1    colum_100110  1.08
5       1 colum_499999999   0.3
6       1         colum_1 -1.21
7       1         colum_2  2.05
8       1        colum_10 -0.95
9       1    colum_100110  1.59
10      1 colum_499999999 -0.59
11      1         colum_1 -0.15
12      1         colum_2 -1.63
13      1        colum_10  0.63
14      1    colum_100110 -0.74
15      1 colum_499999999   0.6
16      0         colum_1  0.78
17      0         colum_2  0.55
18      0        colum_10 -1.31
19      0    colum_100110  0.24
20      0 colum_499999999 -0.22
21      0         colum_1  0.68
22      0         colum_2  0.36
23      0        colum_10  0.25
24      0    colum_100110 -0.23
25      0 colum_499999999  1.73
26      1         colum_1 -0.32
27      1         colum_2  1.07
28      1        colum_10 -0.13
29      1    colum_100110 -0.31
30      1 colum_499999999 -1.26
31      1         colum_1 -0.37
32      1         colum_2  0.47
33      1        colum_10  1.11
34      1    colum_100110 -1.14
35      1 colum_499999999 -0.43
36      1         colum_1 -0.85
37      1         colum_2  0.96
38      1        colum_10 -1.61
39      1    colum_100110  0.62
40      1 colum_499999999  0.06
41      1         colum_1  0.19
42      1         colum_2  0.62
43      1        colum_10 -1.28
44      1    colum_100110  1.31
45      1 colum_499999999   0.3
46      1         colum_1  0.16
47      1         colum_2  1.35
48      1        colum_10 -0.11
49      1    colum_100110  1.14
50      1 colum_499999999 -2.03

If I train an algorithm on compact data, will I lose performance or something, or is it equivalent to the first option?

================UPD===================

Wrote a small prototype of the idea.. sorry for the messy code, i haven't had coffee yet

so I remade the dataset into a compact form

ir <- iris[ sample(1:150,150) , ]
# make data
dat <- as.data.frame(matrix(ncol = 3,nrow = 0))
for(i in 1:nrow(ir)){
  nc <- ncol(ir)-1
  tmp_dat <- cbind.data.frame(
                              target = rep( ir$Species[i] , nc),
                              n_colum = 1:nc,
                              value = unlist(ir[i,-5])
                              )

dat <- rbind.data.frame(dat, tmp_dat)

} row.names(dat) <- NULL

head(ir) head(dat,20)

...

       target n_colum value
1      setosa       1   4.8
2      setosa       2   3.4
3      setosa       3   1.6
4      setosa       4   0.2
5      setosa       1   5.7
6      setosa       2   4.4
7      setosa       3   1.5
8      setosa       4   0.4
9  versicolor       1   5.6
10 versicolor       2   3.0
11 versicolor       3   4.5
12 versicolor       4   1.5
13  virginica       1   6.9
14  virginica       2   3.1
15  virginica       3   5.1
16  virginica       4   2.3

next I train the model

Y <- dat$target
X <- dat[,-1]

train model

tr <- 1:500 ts <- 501:nrow(X) table(Y[tr]) library(randomForest) rf <- randomForest(Y[tr]~., X[tr,])

then I make a prediction, the prediction is made immediately on four rows, because the original data has four columns

# predict
result <- as.data.frame(matrix(ncol = 2,nrow = 0))
for(i in ts){
      if(X$n_colum[i]==4){
       X_rows &lt;- X[(i-3):i, ]

       # X_rows
       #      n_colum value
       # 597       1   6.7
       # 598       2   3.3
       # 599       3   5.7
       # 600       4   2.1

        pr &lt;- predict( rf , X_rows , t=&quot;prob&quot;)



      pr &lt;- cbind.data.frame(
                              predicted = as.factor(names(which.max(colMeans(pr)))),
                              original = Y[i]
                             )

      result &lt;- rbind(result , pr)
  }

}

print(result)

predicted   original

1 versicolor versicolor 2 versicolor versicolor 3 versicolor versicolor 4 setosa setosa 5 versicolor versicolor 6 setosa setosa 7 versicolor versicolor 8 virginica virginica 9 virginica virginica 10 virginica virginica 11 versicolor versicolor 12 setosa setosa 13 virginica virginica 14 virginica virginica 15 virginica virginica 16 versicolor versicolor 17 versicolor versicolor 18 virginica virginica 19 virginica virginica 20 setosa setosa 21 virginica virginica 22 setosa setosa 23 virginica virginica 24 virginica virginica 25 setosa setosa

If I didn’t make any mistakes, then it turns out that you can train the model in this way, and this is very good))

mr.T
  • 259
  • 2
    Have you tried this and observed what happens? – mkt May 10 '23 at 10:50
  • Why you "cannot run" your algorithms? – Tim May 10 '23 at 10:55
  • @mkt no haven't tried it yet – mr.T May 10 '23 at 11:14
  • @Tim there are too many columns (features), one observation (one line) is about half a gigabyte – mr.T May 10 '23 at 11:15
  • 2
    Do you mean having one column of the target, one column of the feature value, and a third column giving the original feature to which the value corresponds? // If so, does that really reduce your data size to something manageable? Also, what kind of modeling would you do with your two features (one of values, one of a categorical variable with a bunch of categories)? – Dave May 10 '23 at 11:30
  • @Dave this will not reduce the size of the data in principle, but in this form I can train the algorithm in parts or retrain. And in the first case, I can’t even load a small part of the dataset lines into memory, let alone run the learning algorithm – mr.T May 10 '23 at 11:36
  • 4
    How many rows and columns do you have? In any event, SGD can do Gradient updates with one row at a time, the smallest possible amount of data to use; if only loading 1 row isn’t possible, then the problem is that your computer is too small. – Sycorax May 10 '23 at 12:16
  • 1
    250 000 rows and 50000000 colums – mr.T May 10 '23 at 12:52
  • Can you say more about this data? What are these 50 million (!) columns? – mkt May 10 '23 at 13:06
  • 1
    it is some numerical result from sequences. Alone, each column may mean nothing, but in a big groups (that are not known) it should give a result.

    That's why I have to submit all the columns at once

    – mr.T May 10 '23 at 13:26
  • As an educated guess, I think OP means that this is "some numerical result from gene sequences." – Sycorax May 10 '23 at 13:32
  • 1
    "Alone, each column may mean nothing, but in a big groups (that are not known) it should give a result." If I understand that statement and the compact form correctly, using the compact form will make it impossible to find anything useful. – Solomon Ucko May 11 '23 at 19:59
  • 1
    "I'm familiar with dimensionality reduction algorithms, but for some reason I don't want to use them." Can you elaborate on that reason? Also, interestingly, a predictor which must predict column target based on all the other columns can be seen as a dimensionality-reduction algorithm, which reduces from all-the-feature-columns to just-column-target – Stef May 12 '23 at 10:27

3 Answers3

14

Reshaping the data doesn't solve the problem because at the end of it, you have the same amount of data, plus an additional "index" column. If loading 1 row is expensive, then loading 1 row plus some more data must also be expensive. Moreover, using the index column to "decode" the new format isn't free, so any comparison would need to account for the costs (memory, compute) of how you decode this format.

If you're not making any effort to decode the new indexing column, and just treating it as data, then obviously that's not going to work because (1) the index data will change based on how the arbitrary choice of ordering the columns and (2) the original data are treated as a single feature, so you can't model the outcome as arising from interactions among different features.

However, there are some extremely simple steps that you can take to reduce the number of features you have.

  • Remove any columns that are comprised entirely of the same value. I know this is obvious, but it bears mentioning because it's easy & simple.

  • Screen out perfect-correlation features. There are ways to compute correlation coefficients one observation at a time, such as Online update of Pearson coefficient Then screen out the values with perfect correlation. The reasoning here is that perfectly-correlated features add no new information to the model, so removing one of them will save on memory and compute. You could relax "perfect correlation" to also exclude highly-correlated features, but then you'd have to make a choice about which half of each highly-correlated pair to keep, which could be important for inference or explanatory modeling. Moreover, whereas one feature in a pair of correlated features provides no new information compared to the other feature, this may not be the case for highly-but-not-perfectly-correlated features.

I've used these two extremely simple feature screening methods to dramatically cut down the number of features in real data sets.

(As an aside, some people use univariate feature screens for association between the target variable and a single feature. I don't recommend using these because it can leave out features that are predictive only in conjunction with other features. You've stated that this is the case for your data, so I think it would not be useful to solve your problem.)

However, these two steps are mostly preludes to help economize a more advanced analysis. Because you have 200 times as many columns as features, it makes sense to re-express the data as a full-rank matrix (in your case, at most $250~000 \times 250~000 $). One way to do this is using , for which there are . This is a review article I found as the first hit on Google.

"Online Principal Component Analysis in High Dimension: Which Algorithm to Choose?" by Hervé Cardot and David Degras

In the current context of data explosion, online techniques that do not require storing all data in memory are indispensable to routinely perform tasks like principal component analysis (PCA). Recursive algorithms that update the PCA with each new observation have been studied in various fields of research and found wide applications in industrial monitoring, computer vision, astronomy, and latent semantic indexing, among others. This work provides guidance for selecting an online PCA algorithm in practice. We present the main approaches to online PCA, namely, perturbation techniques, incremental methods, and stochastic optimization, and compare their statistical accuracy, computation time, and memory requirements using artificial and real data. Extensions to missing data and to functional data are discussed. All studied algorithms are available in the R package onlinePCA on CRAN.

Of course PCA is just one example of a dimension-reduction algorithm that can be conducted in an online fashion. There are many more. Another example would be incremental (the gold-standard rank-revealing algorithm) or QR decomposition. Both have incremental/online algorithms. Depending on the qualities of the data (are some columns sparse or binary?) then you may wish to use methods tailored to those qualities.

If you don't want to do dimensionality reduction, then you can use machine learning that are trained with online algorithms. As an example, models that are trained using (such as neural networks or regression) can work with only loading 1 example into memory at a time.

Finally, a literature review will be extremely enlightening! From one of OP's comments, I infer that these are data about gene sequences. If this surmise is correct, then there is a wealth of published academic literature where researchers have encountered this exact problem and devised different approaches to solving it. Even if my guess is not correct, and the problem is not exactly the same as gene sequencing, a literature review will still reveal numerous examples of researchers having large datasets & reducing them to smaller data sets.

Sycorax
  • 90,934
  • no, these are not genes, it is close to sequential association rules but with already calculated results – mr.T May 10 '23 at 13:46
  • Regardless, I'll bet $1 that someone, somewhere has published an academic article about this exact topic. And even if they haven't, gene sequencing researchers often encounter this same problem! – Sycorax May 10 '23 at 13:48
  • Quite possible. In the end everything repeats itself – mr.T May 10 '23 at 13:51
5

This is an interesting idea. However, I think you are doing less data reduction than you think (or perhaps none at all). For instance, by making a categorical feature with each original variable as a category, you now have data with size $(50,000,000\times 250,000)\times 3$, equal to $12.5$-trillion rows. That is with three columns, so you have $37.5$-trillion values to handle in the new data set, as opposed to just $50,000,000\times 250,000=12.5$-trillion in the original data.

Okay, but your goal is to pass subsets of the data into an algorithm and train in batches, either explicitly with a neural network or at least in a manner that is evocative of using batches to optimize neural network parameters. Then it would seem that you can pass batches into your training, maybe $1250$ rounds of passing data of size $1$-billion$\times 3$.

However...

...you have to do something with your column that has $50$-million categories. A common way to handle categorical data is make indicator variables where the indicator takes $1$ if that category is represented and $0$ otherwise, meaning that you have $50$-million new variables. Thus, while you think your data set is $12.5$-trillion$\times 3$, you data set is actually more like $12.5$-trillion$\times 50$-million, and you have made the problem $250,000$-times worse.

Dave
  • 62,186
  • Thank you all so much for your help and new ideas. I need to comprehend your answers – mr.T May 10 '23 at 13:41
  • Reading your answer, I think that if we replace the categorical feature "n_colum" with a regular integer ? maybe this will solve the problem of a huge number of categories for the model – mr.T May 10 '23 at 16:54
  • That introduces an arbitrary order to the categories. (In light of the discussion about gene sequences, however, is it arbitrary?) – Dave May 10 '23 at 16:59
  • The columns can be in any order and it does not matter, right? By coding the columns with integers, you impose an order that probably does not make sense. (If the columns are about values in a sequence, then it is a different story, though that seems not to be the case here.) – Dave May 10 '23 at 17:09
  • yes the columns can be in any order – mr.T May 10 '23 at 17:15
  • 1
    Then assigning them numbers imposes an order that probably makes no sense. – Dave May 10 '23 at 17:18
  • first I need to do a prototype test on small data to make sure the model learns something from the data in this format... Maybe I'm just wasting your time – mr.T May 10 '23 at 17:21
5

Unfortunately, I do not think your "compact" data format would be any more beneficial computationally (at least as in the example code you showed), and is not the same representation as the original.

You could think of the original (intractable) problem as estimating the model $$ Y_i = m(X_i) + \epsilon_i \qquad (i=1,\dots, n)$$ where $X_i \in \mathbb{R}^{50000000}$. What you are proposing is to define some new categorical variable $Z_{ij}$ that indicates column $j$ of the $i$th row with value $V_{ij}$, then estimating the model $$ Y_{ij} = m(Z_{ij}, V_{ij}) + \xi_{ij} \qquad (ij=1,\dots, n*50000000) $$


Firstly, the number of "samples" in the "compact" dataset is now 50M times larger. While it's true that you could train a model in mini-batches (say 50000 rows in each batch), why not just train on the original dataset with a much smaller batch size (say 1-2 rows per batch)? Note that you do not need to load the whole dataset into memory at once, just keep it on disk and read in 1-2 rows and train via an incremental ML algorithm.

A much more important point however is that most standard implementations of ML algorithms assume that each row are i.i.d. realizations from some DGP. Using the "compact" representation, we lose all of the information related to the joint distribution of $X_i$ - for example, correlations between the variables and (possibly complicated) combinations of them that are predictive of $Y_i$.

Now as for the fact that you have 50M features...I would strongly suggest you explore the data a bit more before considering any actual modeling. Are the features highly correlated (this may be tough to compute)? Are they sparse? Are there duplicates? Dimensionality reduction can be very useful here both in terms of making it computationally feasible and likely improving the performance of the model.

If you really do not want to do any dimensionality reduction for whatever reason (which you should), there are certain ML models that would be able to work well with only having a subset of the features. Random Forests come to mind, where you could pass in a small subset of the features to each leaf to be split. Though you will likely have to code this up yourself.

Adam
  • 396