1

I have a pandas dataframe. The pandas dataframe has an "hours" column, which consist of only integer values from 1 to 24, but there are multiple entries for each hour. There is another column called "counts" that gives the count for each integer hour.

I would like to create a histogram plot, preferably with distplot in seaborn of the counts of each integer hour against the hour. I could sum up the counts for each distinct hour separately and then plot it, but I was wondering if there's an automatic way to do this?

For example, consider that I have two entries for hour 1, [1, 20], [1, 50]. For hour 1 on the histogram, I want it to plot 20 + 50 = 70 and not 2.

student010101
  • 201
  • 1
  • 6
  • `distplot` is deprecated and replaced by `histplot` – Trenton McKinney Apr 20 '21 at 23:12
  • 2
    Always provide a complete [mre] with code, **data, errors, current output, and expected output**, as **[formatted text](https://stackoverflow.com/help/formatting)**. If relevant, only plot images are okay. Please see [How to ask a good question](https://stackoverflow.com/help/how-to-ask). Provide data with [How to provide a reproducible copy of your DataFrame using `df.to_clipboard(sep=',')`](https://stackoverflow.com/q/52413246/7758804), then **[edit] your question**, and paste the clipboard into a code block. – Trenton McKinney Apr 20 '21 at 23:12
  • @TrentonMcKinney Does `histplot` do everything that `distplot` does? – student010101 Apr 20 '21 at 23:50
  • 1
    See the documentation https://seaborn.pydata.org/generated/seaborn.distplot.html – Trenton McKinney Apr 21 '21 at 00:11

1 Answers1

2

Seaborn's histplot accepts a parameter weights= (similar to weigths in plt.hist) which gives the relative weight of each x-value. Setting discrete=True ensures the bin boundaries are calculated nicely around these discrete values.

Alternatively, sns.barplot can be used, using a sum as estimator. This automatically sets an x-tick for every value and chooses a color scheme. ci=None is needed to avoid error bars.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

df = pd.DataFrame({'hour': np.repeat(np.arange(24), 3),
                   'count': np.arange(72)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 4))
sns.histplot(data=df, x='hour', weights='count', discrete=True, color='darkturquoise', ax=ax1)
ax1.set_xticks(np.arange(24))
ax1.set_title('sns.histplot')
sns.barplot(data=df, x='hour', y='count', estimator=np.sum, ci=None, ax=ax2)
ax2.set_title('sns.barplot')
plt.tight_layout()
plt.show()

histplot with weights vs barplot using sums

JohanC
  • 59,187
  • 8
  • 19
  • 45
  • When should I use barplot over histplot? There seem to be so many options for histogram plots. And then there's also displot, countplot, distplot (which the user above said is deprecated) – student010101 Apr 21 '21 at 12:50
  • 1
    With seaborn for this case you can choose between `histplot` and `barplot`. Here `histplot` is handy because of the `discrete` and `weights` option. `barplot` is more general. Matplotlib's `plt.hist` is a bit tricky to use with discrete data. `displot` is usefull when you need multiple subplots with comparable settings. `countplot` is similar to `histplot` for discrete data, but doesn't have a `weights` option. `distplot` is just the old version of `histplot` and `kdeplot`. – JohanC Apr 21 '21 at 14:02
  • If I convert the hour variable to a categorical variable, it seems to no longer work with `histplot`. Is one of the other options more suitable for categorical variables? – student010101 Apr 21 '21 at 17:19
  • `histplot` really needs numbers for the x-axis, it can't work with categorical variables. – JohanC Apr 21 '21 at 17:27