1

I am currently working on a project that involves a data processing pipeline, and in it I might come across all sorts of data. I would like to know if there are any references on an automatic way to detect if a numerical column is actually a category (e.g. store ids, ZIP codes, etc...) and to not mistake it for a continuous variable.

Is there any way in which we could look at the numbers on a particular column and solely based on the number tell if it's categorical or numerical? I myself have already read a couple of other posts regarding this problem here but most of the answers either don't have any references to scientific studies or simply say it's impossible.

Glorfindel
  • 1,118
  • 2
  • 12
  • 18
  • Could you explain why this might matter for your processing pipeline? – whuber Jul 27 '22 at 18:28
  • We want to automate everything. We want to have a machine that automatically identifies the variable type (categorical etc) without having to do it all manually because at times we might receive dozens of datasets with hundreds of columns and we won't have the time to anayse column by column. – Ysgramor 500 Jul 27 '22 at 18:55
  • That isn't what I was asking. What is the point of characterizing columns as categorical or numerical? How is that relevant for data processing? – whuber Jul 27 '22 at 20:20
  • Should I use one hot encoding on them or not? maybe using one hot encoding for numerical variables might reduce the accuracy significantly... – Ysgramor 500 Jul 27 '22 at 20:51
  • That's not a data processing decision, nor is it wise to automate it: it's a statistical modeling decision. Using one-hot encoding for numerical variables will increase the accuracy of any prediction based on them but potentially at a cost of (severe) overfitting. Using that coding for a response variable will usually not work at all and will always create a complicated model. – whuber Jul 27 '22 at 22:33

1 Answers1

0

I found this helpful A key distinction between ordinal categorical variables and discrete quantitative variables is that there is a uniform degree of difference within discrete quantitative variables.

It’s from resource

https://www.codecademy.com/learn/stats-variable-types/modules/stats-variable-types/cheatsheet

Hope it helps.

SR1
  • 31
  • 1
    My question was more on the lines of: if I have many datasets with a large amount of columns, is there any strategy I could use to figure out which columns are categorical ordinal, categorical nominal, or quantitative if all the columns are already being represented in numbers? Like correlation, number of unique values, or anything else? – Ysgramor 500 Jul 27 '22 at 18:04
  • 1
    This will mistakenly identify ZIP codes as not being discrete. And with small datasets it will be often mistaken, no matter what the variables may be. The question really is unanswerable because it's predicated on a fallacy: neither "categorical" or "numerical" refer to inherent properties of data. They reflect a combination of the measurement process and how the data are modeled. – whuber Jul 27 '22 at 18:04
  • Found something similar here, they used python. to identify categorical variables in dataset. https://stackoverflow.com/q/29803093 – SR1 Jul 27 '22 at 19:26
  • That's a solution to a different problem. There, "categorical" refers to how data are represented in a computer. – whuber Jul 27 '22 at 20:21
  • okay. thanks @whuber – SR1 Jul 27 '22 at 20:28