1

Given these values, is it possible to generate random values that conform to this distribution (using Python, but preferably without the SciPy package)?

Statistic Value
Mean 1.518
Std Dev 24.827
Skew 140.770
Kurtosis 25342.612
Mode 1
Min 1
Max 5735
Count 182557

Purpose: I'm trying to generate dummy / fake / synthetic data based on production data.

I've tried finding several solutions, but most of them seem to use SciPy.

Reason for avoiding SciPy: I don't want to include ~200 MB of dependency just to use a single function.

EDIT

Old title: How to generate random values based on mean, standard deviation, skew and kurtosis in Python without SciPy?

I gave up avoiding installation of yet another package. I've included SciPy in our dependencies, but I'm still stuck with finding a good solution for this.

The solutions I've tried either generate too many zeros or all ones. It's confusing to me because I don't have a statistical background, and I know that it's against SE's rules to simply ask for working code as solution, I'm asking for some tips or nudges in the right direction.

I've tried a few functions (skewnorm, pearson3, etc.) without understanding what they're doing and merely checking if the output looks as I expect. Of course, I couldn't find a satisfactory solution.

m01010011
  • 111
  • What function out of SciPy were you planning to use? What you describe here is a very skewed distribution. Can you give more details? Maybe you can cannibalise the package? Also remember that we can always remove README/documentation files without impeding our software's functionality. Finally, do you build from source or use a precompiled wheel, the latter is often bigger. – usεr11852 Jul 29 '23 at 22:21
  • @usεr11852 (1) Not sure. I'm not from a statistical background. (2) More details: a column I'm trying to simulate data for has values from 1-5735. 1 has relatively higher chances of appearing (~98-99%), hence the skewness. I want to create a set of dummy values that also form this type of distribution; it doesn't have to generate the same statistics, approximate is enough. – m01010011 Aug 02 '23 at 11:19
  • 2
    Those moments alone do not determine the distribution. However, there is a well-established family of distributions that was designed specifically to provide one distribution for most mathematically allowed combinations of those moments: the Pearson system. – whuber Aug 02 '23 at 14:04
  • 2
    (+1) to what @whuber said, having those four moments do not uniquely define a distribution. We need more information to help you. Theoretically, we can estimate the eCDF and save the values in a file, then have that file as part of the binary and sample using that ecdf using only numpy functions. – usεr11852 Aug 02 '23 at 22:19

0 Answers0