1

We are currently adding some basic functionality for directional statistics to SciPy. Directional statistics refer to data whose magnitude does not matter such as unit vectors.

While implementing the directional variance, we noticed that there are competing definitions in literature. We would like some more opinions about this to see what the most useful functionality for SciPy users would be.

Directional variance is always related to the mean resultant length $R$: for circular data, elegantly illustrated here. For circular data, the canonical definition of variance is then $1-R$. For general directional data though, some sources define it as

  • $1-R$
  • $2(1-R)$

The question is now if SciPy should directly implement one of those or just expose the mean resultant length. Choices are:

  • scipy.stats.directional_variance : $1-R$
  • scipy.stats.directional_variance : $2(1-R)$
  • scipy.stats.mean_resultant_length
  • scipy.stats.directional_mean should return both directional mean and mean resultant length

We would appreciate some thoughts of the statistics community. Thanks in advance!

Tyrion
  • 51
  • Thanks for the compliment on my plot, but I would recommend the more elegant updated version which can be found as Figure 3 in here – Kees Mulder Sep 18 '22 at 09:52
  • Another small remark: I feel like it's slightly more conventional to write the length of the resultant vector as $R$, and the mean resultant length as $\bar{R}$. – Kees Mulder Sep 18 '22 at 09:54

1 Answers1

1

In my opinion, the only definition related to directional variance that is in some sense 'natural' or 'canonical' is the mean resultant length. I would even argue the same in the circular data setting, as I don't particularly find $(1-R)$ appealing even in that setting.

What is 'natural' or 'canonical' is (for me) mostly determined by whether the definition leads to simple equations. From this perspective, the mean resultant length wins by a landslide, as it is ubiquitous in formulae relating to the distribution of the parameters (e.g. for the von-Mises-Fisher), as well as many other computations in directional statistics. Situations where the directional variance (in whichever formulation) would simplify the equations are really rare.

From that perspective, the directional variances, as well as the circular standard deviation (which also doesn't have a 'natural' definition), are simply arbitrary measures of spread. They can sometimes be useful to compare the spread in different samples. However, their interpretation requires the user to be familiar with the scale and behaviour of these measures (e.g. what does it mean if the value doubles?).

As a result, I recommend thinking about it as follows:

  • Measure of location:
    • Linear: mean
    • Directional: mean direction
  • Measure of dispersion:
    • Linear: variance/standard deviation measures spread
    • Directional: mean resultant length measures precision

So to map this to the direct question, my two cents would be:

  • Directly expose the mean resultant length as scipy.stats.mean_resultant_length.
  • Directly expose the mean direction scipy.stats.mean_direction which returns a unit vector in the mean direction (arguably you could simplify to a scalar angle in the circular case).
  • Don't return the mean resultant length from the mean direction function. Instead, implement scipy.stats.resultant_vector, which is $R\boldsymbol{\mu}$, and/or scipy.stats.mean_resultant_vector, which is $\bar{R}\boldsymbol{\mu} = \frac{R}{n} \boldsymbol{\mu}.$
  • Optionally provide 'additional' measures of spread such as both the directional variances, but don't focus on these in formulations or documentation, and don't use them in calculations in directional statistics functions in scipy.
Kees Mulder
  • 1,674