my density/kde plots don't show what I expected

Question

I thought I understood kde/density plots until this problem. I have a dataset with two columns, Diff and Var with 5 million rows approx, this is the header:

About the data:

Var: Only takes values Var_A or Var_B
Diff: Can take integer values from -100 to 100
70% of observations are from Var_A and 30% from Var B.
Given that Var=Var_A, 98% of Diff values are equal to 0.
Given that Var=Var_B, the largest proportions are these:

Given that information, if I plot density plots of Var_A and Var_B, we should see a higher density height for the density plot of Var_A given that it has a bigger proportion equal to 0 (98%) than Var_B (50%). However, when I plot them I see the opposite, even by a large difference such that the height of Var_B makes to visually collapses the density plot of Var_A.

Why is that? Why does Var_B has, visually, a higher concentration of 0 values when plotting, if in the actual numbers it has only 50% related to 98% of Var_A? I thought it could be because the number of observations of Var_B were higher than Var_A, but as stated at the beginning, the number of observations of Var_A is more than double of observation than Var_B (70%/30% = 2.33)

So, where is my misunderstanding?

And just in case you are wondering, this is the code I used for those plots:

fig, ax = plt.subplots(1, 3, figsize=(15,4))
mask_a = data['Var'] == 'Var_A'
mask_b = data['Var'] == 'Var_B'
sns.kdeplot(data = data, x='Diff', hue='Var', ax=ax[0])
sns.kdeplot(data = data[mask_a], x='Diff', hue='Var', ax=ax[1])
sns.kdeplot(data = data[mask_b], x='Diff', hue='Var', ax=ax[2])
plt.show()

EDIT: I went deeper trying to explore/find some answers

I did the same plot above, but now standardized subtracting the mean and dividing the result by its standard deviation. It gave the same distributions (but with different Y-axis values, as it's standardized):

It seems to me that when Diff=0 for Var_B the density function is much higher than when Diff=0 for Var_A. But why is this the case? If the proportions of Diff=0 for Var_A (98%) are much higher than Diff=0 for Var_B (50%), should not the density kernel around 0 be higher for Var_A?

The only plot that makes sense to me if I plot in the Y-axis the actual proportion (percentage) for each discrete value, but its highly different from the grouped kde plot:

As my interpretation goes so far, the problem relies on the Var_A density plot, given that its density around 0 is very low even having 98% of its Diff values equal to 0. My question then is, why?

Tim · Answer 1 · 2022-11-26T07:43:55.117

1

You cannot compare the heights of two probability density functions (that kernel densities are) because they are relative. Those heights are not on the same scale. You can learn more from the Can a probability distribution value exceeding 1 be OK? thread.

If you have discrete values, why not plot the values against the empirical probabilities that you already calculated? Then, you could compare the probabilities.

Finally, if you only need it for visualization, you can always divide the kernel densities by their maximum values, so the relative heights would align.

edited Nov 26 '22 at 07:43

answered Nov 26 '22 at 07:31

Tim

138,066

Can you elaborate a bit more on why I cannot compare them? I didn't understand your point. In the meantime, I will edit the elaborate question with more examples I just did. – Chris Nov 27 '22 at 22:50
To @Tim s point I recommend plotting densities at cumulative scale, then they are comparable ( and generally more informative for heavily skewed data than histograms / density estimates). See seaborn ecdfplot for details – Georg M. Goerg Nov 28 '22 at 04:35
@Chris did you read the linked thread? – Tim Nov 28 '22 at 07:46
@Tim yeah I did! But I don't understand how can I apply to explain my problem. Let me explain. A KDE height in certain ranges can be higher than 1 in order to fulfill that the integral over the dominion adds up to 1. Then, I have in Var_A a super-small height around 0 because in other ranges it "takes" a lot of the area, but the problem is that in Var_A around 0 the height should be close to 1 as we have there 98% of the data. So the integral from -0.5 to 0.5 should be near 0.98 but it isn't, right? – Chris Nov 28 '22 at 17:45
@Chris say that I tell you that I have two figures that have the same area: a square and a triangle. Does this imply that they have the same (or comparable) height? Here you have two area under the curve that integrate to one, but you have no reason whatsoever to expect them to have the same height. – Tim Nov 28 '22 at 18:09
Yeah, I understand your point, but I have always compared distributions with KDEs and I have never encountered this kind of problem, where I could not have a visual comparison between two samples or distributions using KDE. So I'm a bit shocked and that's why I'm posting this question.. because somehow I expect to come up with a visual KDE where I could compare those distributions. If I can't accomplish that (as in this case it seems to be), what would be the technical justification? "In this joint plot we can see Var_B but we can't see Var_A in this plot because...? – Chris Nov 28 '22 at 19:03
@Chris there is no justification beyond that such comparison does make sensee. The probability densities are relative, and don't have universal meaning. If you want to visually compare them, you can always resize them. – Tim Nov 28 '22 at 19:37

my density/kde plots don't show what I expected

1 Answers1