1

I am very curious about a very large classification dataset approx. 1e9 samples and 1e4 classes. However, I cannot find any. Could you recommend me some?

Orophile
  • 1,751
  • 4
  • 11
  • 30
Mari Pecha
  • 111
  • 2

1 Answers1

2

1B rows is easy to reach. 10K labels can be achieved with concatenations.

For example, take Wikipedia for 100 languages, and split it into rows - by sentence, by n-gram

You now have enough rows, but still only 100 labels.

Then run some lib over each, and concatenate the output to the labels, yielding a new set of labels.

The lib could be a parsing lib like spaCy (POS tag sequence), it could be the language identification like langid.py, it could be sentiment, or combinations of the above. But somehow there should also be ~100 possible outputs (100^2 == 10^4).

For example you might have a row from the 3-gram va a Milano from the Corsican Wikipedia (co). For that sequence langid.py returns [['es': 0.5, 'it': 0.4, ...]. So the generated row can be:

va a Milano \t co_es_it

Note the classes will not be balanced, although you could balance them by doing it at greater scale and then discarding as needed.

philshem
  • 17,647
  • 7
  • 68
  • 170