How to plot binary vs. categorical (nominal) data?

Question

I am building a machine learning model for a binary classification task in Python/ Jupyter Notebook. I am currently in the "Exploratory data analysis" phase and try to create multiple plots/ graphs for my data set.

My data set consists of 20 columns (19 features and 1 labeled target). Each row in my data set represents a person. There are many categorical/ nominal features in my data set and only few numerical/ continuous ones. Unfortunately I cannot upload the real data set, so I will create a dummy one.

personID	age	car	TARGET_happiness
1	27	ford	0
2	41	tesla	1
3	55	bmw	0
4	34	tesla	1
5	62	ford	1
6	38	ford	1
7	51	bmw	0
8	46	tesla	1
9	72	bmw	0
10	59	tesla	0
11	48	ford	0
12	51	bmw	1

My aim is to create a plot/ graph to visualize the relationship between the binary variable TARGET_happiness (meaning "is the person happy?") and the categorical variable car (meaning "which car does this person own").

The plot I've used for binary TARGET_happiness vs. continuous age is a box plot, see:

This seems fine. Now I also try to use a box plot for binary TARGET_happiness vs. categorical car:

I'm not sure if this plot is useful / appropriate. Sure, you can see that Tesla owners seem to be happier than BMW owners. But the box for Ford owners looks strange.

Which type of plot/ graph can I use to better visualize the relationship between binary and categorical data?

Despite the different title, just about every idea in the above thread carries over to this case. — Nick Cox, Apr 21 '21 at 08:31
Box plots are generally useless for binary data or ordered data with only a few distinct values. If more than 25% of values are equal to the lowest value recorded then that value is both the minimum and the lower quartile and any whisker collapses to a line of zero length you can't see. A similar story applies at the other end of the distribution with the maximum and upper quartile. If less than 25% are values equal to the minimum or to the maximum, you may see a point symbol for each but the box plot alone won't tell you if that represents one data point or (almost) a quarter of them. — Nick Cox, Apr 23 '21 at 10:11
Thinking through the definitions shows that other weird-looking box plots can arise. For example, suppose 20% of values are 1 or 2, 60% of values are 3 and 20% of values are 4 or 5. Then 3 is at once the median and both quartiles and the box collapses to a line of zero length and it's a moot point whether your software will show it. Simple bar charts will work better. — Nick Cox, Apr 23 '21 at 10:16

score 4 · Answer 1 · answered Apr 25 '21 at 08:42

It makes more sense to count your 0/1 in each of the categories, for example:

import pandas as pd
import seaborn as sns
df = pd.DataFrame({'car':['ford','tesla','bmw','tesla','ford','ford','bmw','tesla','bmw','tesla','ford','bmw'],
                  'TARGET_happiness':[0,1,0,1,1,1,0,1,0,0,0,1]})
sns.catplot(x='car',hue='TARGET_happiness',data=df,kind="count")

Or directly using the plot method in pandas:

pd.crosstab(df['car'],df['TARGET_happiness']).plot.bar(stacked=True)

This is clear and straightforward. It should be the accepted answer. — Andrew Staroscik, Mar 29 '23 at 14:28

How to plot binary vs. categorical (nominal) data?

1 Answers1