I thought I understood kde/density plots until this problem. I have a dataset with two columns, Diff and Var with 5 million rows approx, this is the header:
About the data:
Var: Only takes valuesVar_AorVar_BDiff: Can take integer values from -100 to 100- 70% of observations are from Var_A and 30% from Var B.
- Given that
Var=Var_A, 98% ofDiffvalues are equal to 0. - Given that
Var=Var_B, the largest proportions are these:
Given that information, if I plot density plots of Var_A and Var_B, we should see a higher density height for the density plot of Var_A given that it has a bigger proportion equal to 0 (98%) than Var_B (50%). However, when I plot them I see the opposite, even by a large difference such that the height of Var_B makes to visually collapses the density plot of Var_A.
Why is that? Why does Var_B has, visually, a higher concentration of 0 values when plotting, if in the actual numbers it has only 50% related to 98% of Var_A? I thought it could be because the number of observations of Var_B were higher than Var_A, but as stated at the beginning, the number of observations of Var_A is more than double of observation than Var_B (70%/30% = 2.33)
So, where is my misunderstanding?
And just in case you are wondering, this is the code I used for those plots:
fig, ax = plt.subplots(1, 3, figsize=(15,4))
mask_a = data['Var'] == 'Var_A'
mask_b = data['Var'] == 'Var_B'
sns.kdeplot(data = data, x='Diff', hue='Var', ax=ax[0])
sns.kdeplot(data = data[mask_a], x='Diff', hue='Var', ax=ax[1])
sns.kdeplot(data = data[mask_b], x='Diff', hue='Var', ax=ax[2])
plt.show()
EDIT: I went deeper trying to explore/find some answers
I did the same plot above, but now standardized subtracting the mean and dividing the result by its standard deviation. It gave the same distributions (but with different Y-axis values, as it's standardized):
It seems to me that when Diff=0 for Var_B the density function is much higher than when Diff=0 for Var_A. But why is this the case? If the proportions of Diff=0 for Var_A (98%) are much higher than Diff=0 for Var_B (50%), should not the density kernel around 0 be higher for Var_A?
The only plot that makes sense to me if I plot in the Y-axis the actual proportion (percentage) for each discrete value, but its highly different from the grouped kde plot:
As my interpretation goes so far, the problem relies on the Var_A density plot, given that its density around 0 is very low even having 98% of its Diff values equal to 0. My question then is, why?




