1

So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.

For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?

Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.

I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.

Much Regards

  • Can you explain more about the problem you are trying to solve? – Alireza Zolanvari Mar 14 '19 at 11:07
  • Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details. – Abdullah Mohamed Mar 14 '19 at 11:51
  • Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words) – Shamit Verma Mar 14 '19 at 11:59
  • What kind of information you are trying to extract from the IP? – Alireza Zolanvari Mar 14 '19 at 11:59
  • @ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question. – Abdullah Mohamed Mar 14 '19 at 12:04
  • @alirezazolanvari Not only the IP, along with multiple other values. Actually it's an old Kaggle problem that I'm trying to solve. I don't want to look for kernels (at least not yet) before trying to deal with the problem exactly as if it were real case scenario. – Abdullah Mohamed Mar 14 '19 at 12:04
  • Let's ask my question in another word. If you know what information you want extract from each categorical feature, it has some well-developed solutions. But if you don't know and just search for an algorithm which can extract the information of these features, its a bit hard – Alireza Zolanvari Mar 14 '19 at 12:07
  • @alirezazolanvari Are you referring to domain knowledge approach? Answering your question, what I'm looking for is the best representation of the given data in order to correlate statistically as possible with the hoped results without any over/under-fitting. Aren't the categorical features the information? Aren't they already describe a dataset and should correlate with the dependend variable in order to make sense? In other words, how can we extract information from an independent variable and why would I do so? – Abdullah Mohamed Mar 14 '19 at 12:16
  • I suggest word embedding algorithms. Because you should have an insight about how each feature can help you toward the favorable result. – Alireza Zolanvari Mar 14 '19 at 12:28

2 Answers2

0

Have you heard of CatBoostClassifier?

https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/

It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.

Victor Oliveira
  • 810
  • 4
  • 10
0

There are multiple options, one can try and decide which suits your data best:

  • Word Embeddings:
    • Can use pre-trained models.
    • Can train your own Word2Vec model on your domain-specific data
  • Try to group different values:
    • Rare words having very low statistical significance can be marked as other
    • Try different clustering algorithms depending upon your data

There might be other more efficient methods, will add those as well in case I find any.

Thanks

Amandeep
  • 111
  • 3