2

I would like to generate synthetic data by specifying their mean, variance, skew, and kurtosis. However, I only know how to generate synthetic data with mean and var.

Here is an example with mean and var: $$ p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} } $$

Which can be synthesized in python as:

import numpy as np
import scipy.stats

np.random.seed(0)

Specify Data

n = 10000 m1 = 0 m2 = 1

m3 =

m4 =

Generate Data

x = np.random.normal(m1, m2, n)

Confirm Consistency

M1 = np.mean(x) M2 = np.var(x) M3 = scipy.stats.skew(x) M4 = scipy.stats.kurtosis(x)

print("M1 {:.4f}\nM2 {:.4f}\nM3 {:.4f}\nM4 {:.4f}".format(M1, M2, M3, M4))

Outputs:

M1 -0.0184
M2 0.9753
M3 0.0266
M4 -0.0310

How do we generate data if we also want to specify skew (m3) and kurtosis (m4)?

Joseph
  • 141
  • 1
    You seem to be generating data from a Normal distribution, which by definition has skewness of 0 and (excess) kurtosis of 0 (for any $\mu,\sigma$). The reason you are not getting those exact values is because you are calculating the sample statistics from a random sample from that distribution. It's not 100% clear what you're trying to achieve here. – statsplease Apr 10 '22 at 02:50
  • 1
    My question is not why the results are not exact. The example demonstrated is a sufficient generation of synthetic data given mean and variance. My question is how to (possibly with different distributions) generate synthetic data with specific mean, variance, skew, and kurtosis specified. – Joseph Apr 10 '22 at 03:28
  • 1
    As you point out, skew and kurtosis are 0 in this example. If I want to make them non-zero, what equation and implementation could achieve this? – Joseph Apr 10 '22 at 03:31
  • 1
    My point was that you have chosen a Gaussian distribution. There are no degrees of freedom left to choose what the skewness, kurtosis are. They are zero.

    Probability distributions are defined with parameters. The skewness and kurtosis of a random variable will just be a function of those parameters.

    Take a gamma distribution with a mean and variance already set by you (this means the two parameters $(\alpha,\beta)$ are defined). Once you have defined those parameters, the skewness and kurtosis are already defined. You cannot change those.

    – statsplease Apr 10 '22 at 05:09
  • It is not clear why you would want to do this. The exact method of simulation would depend on the type of population distribution. As mentioned, for normal distribution you can specify $\mu, \sigma$ and higher order moments are then already determined. // For other distribution families, you might have to solve several equations in several unknowns for parameter values that yield allowable moments. – BruceET Apr 10 '22 at 07:43
  • 2
    The Pearson family of distributions was designed specifically for problems like this. – whuber Apr 10 '22 at 12:23
  • Yes, it is a fact that (by definition) you cannot specify m3 !=0 or m4 != 0 for a normal distribution. It is provided here as an example of what I mean by synthetic data generation given m1 and m2. – Joseph Apr 11 '22 at 00:43
  • I would like to use a distribution where I can specify m1, m2, m3, and m4. The correct answer will look something like x = sample_from_some_distribution(m1, m2, m3, m4, n). – Joseph Apr 11 '22 at 00:43
  • The comment by @whuber is closest to the answer, though I am having trouble figuring out how to define a sampling method for the Pearson distribution. – Joseph Apr 11 '22 at 00:52
  • Maybe the answer at https://stats.stackexchange.com/questions/141652/constructing-a-continuous-distribution-to-match-m-moments might help you – kjetil b halvorsen Apr 11 '22 at 02:40
  • There is no one sampling method for a Pearson distribution, because that family comprises many distinct kinds of distribution. First identify which member of the family might be a realistic model for your data and then use that. – whuber Apr 11 '22 at 11:46

1 Answers1

0

I think what you are asking is, if you have a a set of four numbers, you want to generate some data which has (approximately) these four numbers for the mean, variance, skew and kurtosis.

A Python solution is given in this great answer on stack overflow.

The solution linked to above may be helpful in achieving this. For example if the set of four numbers we are trying to achieve are (0,1,-1,3) then it works quite well. I ran the code linked above, generated a sample, and for this sample achieved [-0.012865049352413248 0.9854758243633666 -1.0212929276152714 2.8670673702318163] respectively for mean, variance, skew and kurtosis - so quite close to (0,1,-1,3).

It is worth noting that this won't necessarily work with any set of four numbers, chosen arbitrarily. For example running it with (50,4,-1,12) I got [49.827147444660625 5.961898303494179 -0.29452022237194114 3.1324664040689862]. Notice that the PDF (in blue) is taking values below 0 (and notice the CDF in orange is decreasing).

enter image description here

The following note is from the statsmodel documentation for pdf_mvsk

In the Gram-Charlier distribituion it is possible that the density becomes negative. This is the case when the deviation from the normal distribution is too large.

Code: I took my code directly from the answer above. To produce the plots I just plotted (x,y) and (x,yy).


One suggestion in the comments was to look at the Pearson distributions. You may have some success adapting the following. We can specify the mean, skew and standard deviation(hence variance). Suppose we want to achieve [50,4,-1] for the mean, variance and skew.

import scipy.stats as ss
import numpy as np
#Note scale is standard deviation in scipy stats
sample = ss.pearson3.rvs(loc=50, scale=2, skew=-1,  size=1000)
print(np.mean(sample),
np.var(sample),
ss.skew(sample),
ss.kurtosis(sample))

Output: 49.96935451394343 4.268526621056802 -1.005394332195325 1.3883750132508892

Close to what we were hoping for.

If you use R, I think more options are available to you in the PearsonDS package. If you must use Python, it is possible to use R within Python to generate your data samples, using the rpy2 module.