1

I am studying statistics and am working with R for the first time. I have some data that I am trying to represent it graphically. I have plotted a histogram and am trying to model it with gamma distribution. I have calculated the coefficients $\lambda$ and $n$ of this gamma distribution. I defined gamma_pdf as

gamma_pdf <- function(x, lambda, n) {
  area*lambda^(n)*x^(n - 1) * exp(-lambda * x) / gamma(n)
}

I have then plotted this histogram and the pdf of this gamma distribution (scaled by the area of the histogram) and gotten the following image.

enter image description here

I did this using ggplot with the following code:

ggplot(data, aes(x = cas)) +
geom_histogram() +
stat_function(fun = gamma_pdf, args = list(lambda = lambda, n = n), color = "red",linewidth = 1)

Now I am trying to transform (only) the x-scale logarithmically (with base 10). Just applying function scale_x_log10() doesn't work, as it yields:

enter image description here

I believe I have to apply the transformation formula for the random variable $Y = \log_{10} X$. Since $$ f_X(x) = \frac{\lambda^a}{\Gamma(a)} x^{a-1} e^{- \lambda x}, $$ we get that $$ f_Y(y) = 10^y \ln(10) \frac{\lambda^a}{\Gamma(a)}10^{y(a-1)} e^{- \lambda 10^y}. $$

So I defined

log_gamma_pdf <- function(x, lambda, n) {
  newarea*10^(x)*log(10) * lambda^(n) * 10^(x*(n-1)) * exp(-lambda*10^(x)) / gamma(n)
}

and plotted the following image,

enter image description here

using the following code

ggplot(data, aes(x = cas)) +
  geom_histogram() +
  stat_function(fun = log_gamma_pdf, args = list(lambda = lambda, n = n), color = "red",linewidth = 1)+
  scale_x_log10()

The shape appears to be better, but something still doesn't seem to go right. I have spent the entire day playing with this, but I have no idea what I'm doing wrong. Does anyone have any advice?


Edit: I came to this problem by trying to solve last year's exam problems. It specifically asks to transform the x-scale logarithmically and comment if this distribution is consistent with Poisson model (exponential distribution).

I am almost certain that the graph should fit nicely, because a friend who was taking this exam last year told me that he got the 2nd graph in my question, and stated that it did not seem consistent, but was then told by the professor that had he transformed the plot correctly, he should get that it was consistent.

Also, how could the original plot seem fully consistent, but when I transform it logarithmically, it seems completely off?

Jesus
  • 121
  • 1
    Greetings! Could you be more specific about why you think it may be off? – Shawn Hemelstrand Jul 02 '23 at 23:51
  • @ShawnHemelstrand ok I edited the question. How can I reopen it? – Jesus Jul 03 '23 at 06:52
  • 1
    As per CV rules, you need to add the self-study tag here. – utobi Jul 03 '23 at 13:22
  • Use the correct formula, as explained at https://stats.stackexchange.com/questions/14483/intuitive-explanation-for-density-of-transformed-variable/14490#14490. – whuber Jul 03 '23 at 13:30
  • @whuber I used the transformation formula in what appears to be a correct way. I don't see what on the link you said indicates otherwise. The only difference is that the function in your link was $x^2$, while the function I am working with is $\log_{10}(x)$. – Jesus Jul 03 '23 at 14:29
  • 1
    @Jesus I think I know what's going on. Basically scale_x_log10 doesn't change the underlying x values, i.e. y(x) = x would look like 10^x, and the histogram has increasing binwidths. If you make your code reproducible, as in define cas, area, ... I will gladly demonstrate what i mean exactly. – Lukas Lohse Jul 03 '23 at 15:23
  • @LukasLohse I don't know how to send the data to you, since it is in a CSV file, which consists of measurements of cas. As for area, this is always a factor which I get by multiplying the number of measurements by the binwidth of the histogram. If you have any idea how I can share all the files, I will. – Jesus Jul 03 '23 at 15:29
  • 1
    Ah, I'd assumed simulated data. Then I'm gonna fix something up myself later. – Lukas Lohse Jul 03 '23 at 15:36
  • @LukasLohse ok, thanks a lot :) – Jesus Jul 03 '23 at 15:40

1 Answers1

4

scale_x_log10 does not transform the x-variable but only the scale of the x-axis. We can see that if we plot $f(x) = x$ with it and the result looks like the graph of $10^x$:

ggplot() + 
  stat_function(fun = function(x){x}, color = "red",linewidth = 1) +
  scale_x_log10()

enter image description here

Now if you use geom_histogram with scale_x_log10 it does something interesting and keeps the graphical width of all the bins the same, which of means that the real width of the bins grows exponentially with every centimeter across the x-axis, but since the true $x$ values are maintained, the used binwidth is proportional to $x$, with a factor of $\log(10)$.

library(tidyverse)
n <- 10000
bw <- 0.1
# I used Rs internal Gamma-distribution which has a slightly different parametriaztion see help("rgamma")
df <- data.frame(x = rgamma(n, shape = 4, scale = 2))

gamma_pdf <- function(x){ dgamma(x, shape = 4, scale = 2) * bw * n }

untransformed

ggplot(df, aes(x = x)) + geom_histogram(binwidth = bw) + stat_function(fun = gamma_pdf, color = "red",linewidth = 1)

log10 fail

ggplot(df, aes(x = x))+ stat_function(fun = function(x){x}, color = "red",linewidth = 1) + scale_x_log10()

correct function in blue

ggplot(df, aes(x = x)) + geom_histogram(binwidth = bw) + stat_function(fun = gamma_pdf, color = "red",linewidth = 1) + stat_function(fun = function(x){gamma_pdf(x) * x * log(10)}, color = "blue",linewidth = 1) + scale_x_log10()

enter image description here

Of course if you actually transform and make a plot $\log_{10}(x)$ then you need the full transformed pdf:

ggplot(df, aes(x = log10(x))) +
  geom_histogram(binwidth = bw) +
  stat_function(fun = gamma_pdf, color = "red",linewidth = 1) +
  stat_function(fun = function(x){gamma_pdf(x) * x * log(10)}, color = "blue",linewidth = 1) +
  stat_function(fun = function(x){gamma_pdf(10^x) * 10^x * log(10)}, color = "green",linewidth = 1) 

enter image description here

If you look at the formulas you can see that they are almost the same, just replace $10^x$ in the transformed PDF with $x$.

Lukas Lohse
  • 2,482
  • 1
    Thank you a lot! I'm just not really sure how and why we get the factor with which we scale the graph to fit the distribution nicely. – Jesus Jul 03 '23 at 18:51
  • 1
    It comes from the derivative of $\log_{10}(x)$, just like in the pdf of the transformed variable. The derivative of a transformation can be interpreted as the local stretching factor. – Lukas Lohse Jul 03 '23 at 18:56
  • Ohhh I think I see exactly what scale_log does now! Thank you so much! – Jesus Jul 03 '23 at 18:58