fitting non-normal multivariate distributions in R

Question

I have many (n=317,823) observations on two variables. I want to fit a bivariate distribution to my observations, in order to identify descriptive features of the distribution (quantiles). However, my data do not appear normal or log-normal and I haven't found a package on the relevant CRAN task view that can help me. I am hoping to learn:

if there is an existing workflow to fit a somewhat idiosyncratic pdf like the one below
whether I am 'asking the wrong question' given my weak math skills. Maybe there is an easier way to approach my problem.

Context: my data originate from satellite observations of forest harvest around the world. For a randomly selected subset of these observations (1 pixel/100km2) I have sampled two global raster layers, one showing forest canopy height and one showing time to access cities. I believe that these are two observable aspects (height, accessibility) of a multidimensional distribution of 'forest quality', which contains many other unbelievable aspects (e.g., species composition). I am trying to characterize the joint distribution so that I may categorize subsequent observations of forest harvest as falling within particular (joint) quantiles.

Sampling data plotted in base via table() and persp():

Data are strictly positive by construction...and that's a funny looking (asymmetrical) pdf. Clearly the long tail is important. I used the excellent r package {fitdistrplus} to look at each variable separately, following the vignette.

Some diagnostic results for dimension 'height':

...and for dimension 'accessibility to cities':

The first looks like it could be described by a log-normal or gamma, and the second by gamma...possible a bivariate gamma distribution could fit the joint density well?

Current approach (my best option) is to accept the (large) inaccuracy and model the joint distribution as log-normal, e.g. by taking logs of both variables and fitting with package fMultivar, and then attempting to work out isolines as described in this post.

here are two downsampled versions of the dataset, obtained as data[sample(nrow(data),round(nrow(data)/scale)),], with scale=10 ("medium") or 100 ("small")

here is a related-but-not-useful post in which OP was told to take a different approach — antifrax, Mar 26 '19 at 12:59
There's the mixsmsn package for fitting skew normal, Student $t$, skew $t$ etc. distributions. Might be worth a try. — corey979, Mar 26 '19 at 17:12
My approach would be to re-sample down to a 50x50 grid (2500 total data points) and perform an equation search, then fit all of the data to that equation. I would down-sample as an equation search with 300,000 data points would be too much for the equipment I personally have available, and a 50x50 re-sample grid visually looks like it would be quite sufficient for a good equation search based on your plots. — James Phillips, Mar 26 '19 at 20:06
@corey979, those distributions are defined over negative values: my values (height, travel time) are strictly positive (0s were dropped as bad data). I've dialed down the bin width on the accessibility histogram to show that distribution better. I don't think the smallest bin is important (can explain) so that pretty much looks like decreasing exponential to me. — antifrax, Mar 26 '19 at 22:59
@JamesPhillips but what equations (== distributions?) would you use? I can post downsampled data. — antifrax, Mar 26 '19 at 23:03
My open source curve and surface fitting web site, zunzun.com, has hundreds of known, named equations and a "function finder" to perform equation searches using them. Post a link to the downsampled data and I will see if it might suggest any candidate surface equations of the form "z = f(x,y)". It's worth a try to see what turns up. — James Phillips, Mar 27 '19 at 00:41
posts about bivariate gamma. You could also try copulas, [tag:copula]. The R package VGAM do have a bivariate gamma implementation. Can you post a link to the data? — kjetil b halvorsen, Mar 27 '19 at 09:17
Thanks for these replies! Downsampled data has been added. @JamesPhillips, neat site! just a reminder that my problem treats the joint distribution as a PDF (I am interested in analyzing percentiles/quantiles, e.g., "rare" and "common" intervals). — antifrax, Mar 27 '19 at 20:47
Do you mean that a fitted surface equation of the form "z=f(x,y)" would not be of any use to you? — James Phillips, Mar 27 '19 at 22:40
I'm not sure. In this question I ask about fitting a bivariate Probability Density Function. This reflects a "theoretical" position (I'm sampling from an underlying distribution of which I can observe only two dimensions) and a "practical" concern (I want to talk about how rare or common observations are, <=> how probable they are). "z=f(x,y)" can describe the surface of a PDF only...I suppose one could integrate over a surface equation and call that a PDF? But this is beyond me, & I am hoping to stick to well-trod ground. — antifrax, Mar 28 '19 at 01:26
Thank you for clarifying. My now-understood-to-be-incorrect approach will not work for you here, my apology. — James Phillips, Mar 28 '19 at 01:49
Not at all; thank you for your attention/suggestions. @kjetilbhalvorsen, 'VGAM::bigamma.mckay' apparently requires Pr(X<Y)=1 (as I guess you know). This is not the case in ~5% of my data...and the generation procedure referenced in that link ('X is sum of a subset of squares of Y') doesn't make sense with my data...also wouldn't the X<Y condition be kind of nonsensical when X and Y are measured in different units? (as here). Will post bigamma.mckay fits if I can wrangle nice plots & will try cupolas next. — antifrax, Mar 28 '19 at 14:52

fitting non-normal multivariate distributions in R

0 Answers0