0

Background

Let's think, there is a list of values which presents activity of a person for several hours. That person did not have any movement in those hours. Therefore, all the values are 0.

Then, what is the problem?

If I want to calculate skewness and kurtosis for that list of values, I will get undefined value since standard deviation will be 0 in those cases (skewness and kurtosis). In addition, $p(x)$ will be 0 and thus, in entropy calculation, I will have to multiply 0 and undefined value $(p(x) log p(x))$. However, based on my understanding, undefined values does not have any meaningful interpretation in explainable machine learning and also in statistical analysis (e.g., in correlation analysis).

Thus, my question

How can I quantify the skewness, kurtosis, entropy when all of values of a list is 0?

Note: I have checked several Q&A of this site, StackOverflow, and others (e.g., this one). However, I failed. I think 0 can be set. But it does not seem to be logical.


Update

Adding a new variable can be one approach (please, see the insightful comments of whuber). However, if I need to add a new variable, it needs to be added for every such cases. For example, for data of weekdays, weekends, holidays, different semesters, different time periods based on different intervals (e.g., 1, 2, 3, 4 .... hours)..... Therefore, lots of new variables will be created. The problem in this case is most of the feature selection approaches do not work when the number of feature is larger than the number of samples (Reference). The number of participants in our study is around 100.

  • 2
    It depends on your purpose. What's even the point of quoting such statistics? Won't it suffice to state all the results are the same? By the way, $p\log(p)=0$ when $p=0$ as you can check by taking the limit as $p\to 0$ from above. – whuber Jan 13 '22 at 18:40
  • Thank you so much @whuber for your kind response. Actually, we will use those as features of machine learning models where values of a feature or a participant can not be NaN. – Md Sabbir Ahmed Jan 13 '22 at 18:51
  • 1
    The resolution depends, then, on how those statistics are used in the model. One simple approach is to introduce a new variable that indicates whether all the moments are undefined and to set those moments to zero. That works perfectly for linear models, so there's some hope it will be effective in other forms of models, too. – whuber Jan 13 '22 at 18:57
  • @whuber Thank you again for your insightful response. Actually, there are thousands of variables. Thus, creating a new variable may not be a good choice. – Md Sabbir Ahmed Jan 14 '22 at 06:13
  • How could one more variable affect the complexity of a model with thousands of variables already? But if this is really a problem, then please edit your post to explain it. – whuber Jan 14 '22 at 14:18
  • @whuber I have updated the question. Please, check. Thanks. – Md Sabbir Ahmed Jan 14 '22 at 14:37
  • 2
    Sounds like classic https://xyproblem.info/ . Tell us what you actually want to do/study/model. Here "Y" is how to calculate skew/kurt, etc for constant data. – bdeonovic Jan 14 '22 at 15:27
  • I don't know why you need kurtosis, but if the true variance is not zero while there is zero variability in the sample data (both of which seem to be the case here), then the the true kurtosis is probably a large number. The reason is that any different values that are eventually observed will be rare, hence outliers. (Kurtosis measures the outlier character of the distribution.) – BigBendRegion Jan 14 '22 at 19:09

0 Answers0