1

I've got a data set of 84,529 entries, each entry referring to the number of times a particular entry is cited in a database. This set is extremely skewed, ranging from entries with 0 citations to one with over 45,000 citations alone. The median of the set is 16 citations, and it can be shown that a relatively small number of elements disproportionally account for the vast majority of citations. The data set is here and a log(x+1) histogram is shown below.

enter image description here

My perhaps stupid question is whether there's a standard way to describe this particular distribution? Naively, I figured it might be a generalised pareto distribution given the dominance of a small number of terms, but toying with log-log plots didn't convince me of this but I don't know enough about them to assert anything confidently.

DRG
  • 313
  • A semiparametric regression model would work well here. See this. – Frank Harrell Apr 27 '22 at 11:39
  • The use of a bar chart to display this distribution is deceptive and might be causing you to assess its properties incorrectly. Combine the smaller bins (values less than 2.5 or so) into wider bins if necessary so there are no gaps between bars and create a true histogram. Alternatively, examine a probability plot. Finally, you can usually do better than arbitrarily adding $1$ to the counts: see https://stats.stackexchange.com/questions/30728. – whuber Apr 27 '22 at 15:04
  • @whuber These counts are number of citations, so naturally integers. Is it possible to do better than adding 1? – dipetkov Apr 28 '22 at 19:47
  • @dipetkov Yes, as I describe in my answer to the linked thread. – whuber Apr 28 '22 at 20:56

1 Answers1

1

You should consider comparing distributions. One handy Python library to compare distributions to describing empirical data is the powerlaw library. It has a paper that explains how to use it:

import powerlaw
import pandas as pd

data = pd.read_csv("sampledata.txt", header=None)

fit = powerlaw.Fit(data[0].values)

ax = fit.plot_ccdf(marker="o", ls="", ms=2, color="k") fit.power_law.plot_ccdf(ax=ax, label="power law") fit.truncated_power_law.plot_ccdf(ax=ax, label="truncated power law") fit.lognormal.plot_ccdf(ax=ax, label="lognormal") ax.legend() ax.set_ylabel(r"$P(X\geq x)$") ax.set_xlabel(r"$x$");

enter image description here

mjjjj
  • 56
  • 1
    Could you spell out how this procedure handles the spike at 0? – Nick Cox Apr 27 '22 at 10:28
  • @Nick I believe that spike may be illusory, due to the use of a barchart to display the frequencies. Eyeballing it, I believe a histogram (or probability plot) might suggest a mixture model with two components (one supported between 0 and 4 and the other between 2 and 8, roughly). – whuber Apr 27 '22 at 18:17