1

Disclaimer: I'm fascinated by statistics but it is not my greatest field. I am a rookie who has taken a few basic stats classes and majored in data science. I have the opportunity to use stats more in my career due to my current project and I'm very excited, but I want to seek guidance from experts instead of trying to figure it out by myself.

I am not asking anybody to solve this problem for me, I'm looking for advice on how to approach it.

I have two distributions of numbers. They are NOT normally distributed. They represent the signal strength of two different radio transmitters. One of the transmitters is more powerful than the other and the signal strengths in that distribution are higher by a statistically significant margin. I need those two distributions to look the same. I need to adjust the signal strengths sent by the weaker transmitter to match the equivalent signal sent by the stronger transmitter.

I started by standardizing these distributions, but I'm wondering if there's more that can be done. I have this concept of a single variable curve, something that knows how much weaker the signal is at that given signal level and adjusts accordingly. Maybe when the signal strength is 10, the stronger device is only 1-2 points higher, when the strength is 50 it may average 4-5 points higher, and when it reaches 80 say they are the same. Can this be accomplished with any statistical methods?

Thanks for reading and I'll appreciate any feedback.

2 Answers2

1

I am not sure exactly what your goal is, nor whether transformations of your data would be helpful in reaching that goal. However, some of what you want to do, can be done, and (with some reservations) I'll show you a method for transforming a sample from a continuous distribution to be approximately normal.

Consider samples of size $n=1000$ from each of three non-normal distributions, as follows:

set.seed(2022)
w = rexp(1000, 1/20)
x = rgamma(1000, 4, 1/10)
y = rbeta(1000, 10, 3)

par(mfrow=c(1,3)) hdr1 = "Exponential, mean 20" hist(w, prob=T, col="skyblue2", main=hdr1) hdr2 = "Gamma(shape=5, rate=0.1)" hist(x, prob=T, col="skyblue2", main=hdr2) hdr3 = "Beta(10, 5)" hist(y, prob=T, col="skyblue2", main=hdr3) par(mfrow=c(1,1))

enter image description here

Next, rank transformations (divided by a little more than the sample size) can be used to make the samples approximately $\mathsf{Unif}(0,1).$

wr = rank(w)/1010
xr = rank(x)/1010
yr = rank(y)/1010

par(mfrow=c(1,3)) hdr1 = "Rank transform: Exponential" hist(wr, prob=T, col="skyblue2", main=hdr1) hdr2 = "Rank transform: Gamma" hist(xr, prob=T, col="skyblue2", main=hdr2) hdr3 = "Rank transform: Beta" hist(yr, prob=T, col="skyblue2", main=hdr3) par(mfrow=c(1,1))

enter image description here

Finally, transforming by an appropriate normal quantile function (inverse CDF) will give an approximately normal sample, with the same means and SDs as the original samples. [if you were to omit the original sample means and variances, the default means would be $0$ and the default SDs $1.]$

wz = qnorm(wr, mean(w), sd(w))
xz = qnorm(xr, mean(x), sd(x))
yz = qnorm(yr, mean(y), sd(y))

par(mfrow=c(1,3)) hdr1 = "Quantile transform: Exponential" hist(wz, prob=T, col="skyblue2", main=hdr1) hdr2 = "Quantile transform: Gamma" hist(xz, prob=T, col="skyblue2", main=hdr2) hdr3 = "Quantile transform: Beta" hist(yz, prob=T, col="skyblue2", main=hdr3) par(mfrow=c(1,1))

enter image description here

A considerable reservation is that you would have to take care doing statistical analysis on the final "normal" samples, unless you are sure they make sense in the context of your data and objectives. (For example, two of my three normal samples take negative values; what would a negative signal strength mean?) Perhaps not everything that can be done, should be done. Also, a graphic display of any transformed data must include an explanation of what transformations were made and why.

BruceET
  • 56,185
0

I need those two distributions to look the same. I need to adjust the signal strengths sent by the weaker transmitter to match the equivalent signal sent by the stronger transmitter.

@BruceET shows how to turn an arbitrary distribution into a parametric one, like the normal. An alternative, which perhaps you could call non-parametric, is to have the final distribution be a "merge" of the two (or more) input ones. So you don't need to decide which final distribution to aim for and the normalized distribution retains some characteristics of the input ones. This quantile-quantile normalization has been used for normalizing gene expression arrays.

The idea is:

  • Sort each input distribution by signal value - you obtain a table with as many rows as observations and each column is a radio transmitter (two columns in your case, there could be more). Note that each column is sorted

  • Take the average of each row - this is the normalized value

  • Re-order the input distributions to return to the original order. Use the averaged value for further analysis.

Here's an example implementation in R to show the idea:

set.seed(2022)

Original distributions

w <- rexp(1000, 1/20) x <- rgamma(1000, 4, 1/10)

boxplot(cbind(w, x))

enter image description here

# Store original order in columns ow, ox 
wdat <- data.frame(w, ow= 1:length(w))
xdat <- data.frame(x, ox= 1:length(x))

Sort by signal strength and average

dat <- cbind(wdat[order(wdat$w), ], xdat[order(xdat$x), ]) dat$qq <- rowMeans(dat[, c('w', 'x')])

dat[1:10,] w ow x ox qq 859 0.03692582 859 4.556609 793 2.296767 925 0.06606661 925 4.700764 74 2.383415 162 0.07348524 162 6.246338 290 3.159911 404 0.11605468 404 7.248184 639 3.682119 242 0.12467654 242 7.378862 162 3.751770 527 0.13326337 527 7.542257 535 3.837760 308 0.16164975 308 7.640755 960 3.901203 418 0.17949846 418 7.734142 665 3.956820 103 0.19492898 103 7.907784 311 4.051356 526 0.20554043 526 7.919662 853 4.062601

Return to the original order:

qwdat <- dat[order(dat$ow), c('w', 'qq')] qxdat <- dat[order(dat$ox), c('x', 'qq')]

Plot normalized values:

boxplot(data.frame(qw= qwdat$qq, qx= qxdat$qq))

enter image description here

In practice, you should use an implementation that nicely handles NAs and ties like the function I linked above, assuming you use R.

dariober
  • 4,250