0

I was just thinking about what would be the properties of an ideal data set $X \in R^{n,d}$ where n is sample size, d represents features. I think (or at least I understood from reading text books) that there are 2 things that has to be satisfied :

  1. Each feature ($d_{i}$) must be independent from each other so that the feature space is Positive Definite, which in turn $X^{T}X$ also has condition number close/equal to 1.
  2. There has to be enough samples ($n$) that prevents the data from curse of dimensionality

Is this enough?

After these requirements are satisfied, could we directly infer about the distribution of the data ? Is there any rule of thumb? For instance, if those 2 (or more) steps are satisfied than the data must be Gaussian or another distribution.

I am trying to fill the gap between statistical properties and the algebraic properties of the data. Hence, I am little confused of building the relationship between them. Could someone explain me where should I check about the materials for building this relationship? Or take time and explain me ?

User1865345
  • 8,202
  • "Ideal" in what sense? Being independent or the number of samples has nothing to do with distribution. Why does finding "ideal" data bothers you? – Tim Mar 23 '22 at 08:37
  • 1 . "Being independent or the number of samples has nothing to do with distribution." I was wondering if there is a relationship in between. 2. " Why does finding "ideal" data bothers you?" Because, I am/was thinking that when we apply preprocessing or post processing to any data, we actually transform it into a more proper form which results a different shape of that distirbution or may be a different kind of distribution. And in some cases, transforming the data and giving it to the model increases the overall performance, and in other cases it decreases. – Kadir Gunel Mar 23 '22 at 08:50
  • So the transformation process on the input data effects the shape/distribution of the data hence gives better performance. I accept that also the model used is important. – Kadir Gunel Mar 23 '22 at 08:53
  • What kind of transformations? – Tim Mar 23 '22 at 08:56
  • From the experiments that I am doing, I observe that if we apply unit, center and unit normalizations I get 2 things compared to original input : 1. Faster convergence, 2. higher accuracy. And these 2 operations (unit, centering) does affect the shape of the data and also condition number which makes me build relationship between stats and algebra. – Kadir Gunel Mar 23 '22 at 08:57
  • And the model is least squares. – Kadir Gunel Mar 23 '22 at 08:58
  • I'm lost. What is your question? Could you summarize it in one sentence? It sounds like you are asking "how to preprocess data for machine learning/statistical inference" but this would need a whole book rather than Q&A answer. – Tim Mar 23 '22 at 09:01
  • Is there any way to infer about the distribution of the data by just checking its feature space ($X^{T}X$)? Of course by applying normalization steps so it is actually covariance matrix. – Kadir Gunel Mar 23 '22 at 09:10
  • As far I as understand the question, it is really dependant of what you are doing .. for predicition with xgboost, for exemple, postive definite feature space is not particularly ideal .. (if I don't mislead myself ..) – MrSmithGoesToWashington Mar 23 '22 at 10:03

1 Answers1

0

Answering the clarified question from the comment

Is there any way to infer about the distribution of the data by just checking its feature space ($X^{T}X$)?

No. Let's start from the very beginning, say that you have a dataset with $n$ rows (samples) and $k$ columns, in statistics you would think of this dataset as a random matrix or $n \times k$ random variables. The shape of the dataset tells you nothing of the distributions of the random variables because you assume that each of the $n \times k$ points is a realization of a separate random variable. To simplify things, in many cases, you would assume the samples (e.g. rows, treated as $n$ random vectors of size $k$) to be independent and identically distributed multivariate random varaibles, but this has nothing to do with the shape of the dataset. You can also have a random matrix consisting of $n \times k$ random variables following different distributions, that are non-independent.

Tim
  • 138,066