Log transformation, normalization and skewness

Question

I'm French Resident in Haematology working on myeloma disease.

I've got a 360 patient-cases continuous variable (MRD) which range from 0 to 7%. The fact, MRD is a residual variable disease calculated with a sensibility-technique of 10^-5, Thus any value below 10^-5 is returned to 0,00000.

I want to analyse in a regression model this continuous variable and need then to have normal fitting.

In the series, there is a plenty of values equal to 0 du to the limit sensibility of the assay (10^-5), Then how would you transform the variable considering 0 values to have a normal fitted distribution ?

the log-transform would be a good idea but i'm not sure how to replace 0 values not to be irrelevent

Thank you very much for helping !!

Edit - 02/02/2023 -

The aim is to :

estimate MRD distribution in each subgroups(n=2 or 3) of my population (thoses subgroups have well known different prognostics)

And i want to be able to compare the distribution of MRD of each subgroup to "estimate a disease clearence kinetics" for each.

Then this MRD would be more a predictor than an outcome considering a Cox model and hazard ratio, i wish to provide information like =

"In this subgroup... this hazard ratio represent the decrease in risk of a event that is associated with each log fold reduction in disease MRD.

NB : Outcome are survival outcome : overal survival, progression free survival and relapse risk

There's generally no need for any variable in a model to have a normal distribution. Is MRD an outcome variable or do you intend to use it as a predictor for some other variable? There are ways to use this type of variable successfully in either case, but the best choice depends on those details. Please edit the question to provide those details about your situation, as comments are easy to overlook and can be deleted. — EdM, Jan 28 '23 at 14:13

score 2 · Answer 1 · answered Feb 03 '23 at 03:00

There is no reason to force a set of predictor values into a normal distribution. Some ways of teaching least-squares regression (e.g, starting from correlations of normally distributed variables) might seem to begin with such an assumption, but there is absolutely no normality assumption about either the predictors or the outcome variables themselves in regression.

Your data are left censored at an MRD of $10^{-5}$. That means the actual MRD is somewhere between 0 and $10^{-5}$, but you don't know exactly where. This answer suggests a simple way to start. Add a separate binary predictor variable indicating whether the MRD value was observed or was set to 0 due to left censoring. For data points set to 0 due to left censoring, re-set their MRD values to the lower detection limit of $10^{-5}$. You might then transform all the MRD values to $\log_{10}(\text{MRD}) +5$ so that the left-censoring value is transformed to 0 and you get log-transformed values above that.

The regression coefficient for the binary predictor variable gives an overall estimate of the association of "0 MRD" in the original scale with outcome. The coefficient(s) for the log-transformed data (after resetting the 0 values to $10^{-5}$) provide the extra association with outcome as MRD values increase above the detection limit.

It's unlikely that there will be an exactly linear association between the log-transformed MRD and outcome. Modeling with a flexible regression spline will let the data tell you the nature of the actual association. With what seem to be many MRD values at or near the left-censoring limit, you might want to specify the knots of those regression splines yourself instead of relying on defaults that simply use quantiles of values.

Also, recognize that the MRD values themselves are subject to substantial measurement error particularly near the left-censoring limit. The MRD values in myeloma are typically determined by flow cytometry, with markers to distinguish normal from abnormal cells. The MRD is the fraction of total cells that are deemed abnormal. If $10^5$ cells are analyzed and none are deemed abnormal, there is still a 5% chance that the actual MRD value is $3/10^5$ or higher. A more sophisticated analysis might incorporate that type of uncertainty into a Bayesian model evaluating survival as a function of measured MRD. That's outside my expertise, but this page points the way to that type of analysis with left-censored predictor values.

bison2178 · Answer 2 · 2023-02-03T06:22:24.820

I am most likely echoing what @EdM mentioned above. For -ve values I would suggest adding a small jitter such that the lowest of the -ve values changes to 0.001. Which means you choose an offset value such that the smallest -ve value is now 0.001 and the corresponding values are shifted up the constant offset value accordingly.

There is no rule that the continuous outcome should be normally distributed, only the residual after the model fit should be normally distributed. You could also try truncated normal distribution to ignore values below the threshold.

Log transformation, normalization and skewness

2 Answers2