6

Finding a distribution of the data is a crucial part of my thesis. I have to process this step in R eventhough there are some other tools to get these information in fast. I made some search to analyze which distribution fits best for the given variable, this instructions guided me a bit.

For instructions: via stackoverflow: how-to-determine-which-distribution-fits-my-data-best

However, I am lost to have distributions of the variables since I have about 18.

For example;

http://www.filedropper.com/samplest

library(fitdistrplus)   

importeddata <- read.csv(file.choose(), sep=";",na.strings = "", stringsAsFactors=FALSE, header = TRUE)

for(i in 1:tail(ncol(importeddata))){
  importeddata[,i] <- gsub(",", ".", importeddata[ , i])} 
xx<- as.matrix(as.data.frame(lapply(importeddata, as.numeric)))

descdist(xx[,1])

enter image description here

I can say that this variable may fit uniform, beta or normal distributions. Let's see.

    fit.norm <- fitdist(xx[,1], "norm")
    fit.norm
         Fitting of the distribution ' norm ' by maximum likelihood 
         Parameters:
              estimate Std. Error
         mean 13.428316  0.3652664
         sd    7.120353  0.2582823

    plot(fit.norm)

enter image description here

However, beta causes an error. Because, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.

   fitdist(xx[,1], "beta")

Error in start.arg.default(data10, distr = distname) : values must be in [0-1] to fit a beta distribution

  fit.uni <- fitdist(xx[,1], "beta")

       Fitting of the distribution ' unif ' by maximum likelihood 
       Parameters:
        estimate Std. Error
             min     3.12         NA
             max    29.64         NA

   plot(fit.uni)


  fit.uni$aic
  [1] NA

  fit.norm$aic
  [1] 2574.241

There are two questions to be asked:

  1. May I directly said that xx variable is normally distributed N(13.42,7.12)? How can I compare the distributions better or not?
  2. Is there alternative way to have these informations? Because it is going to be repeated 18 times.
Glen_b
  • 282,281
can.u
  • 61
  • 1
  • Why is it necessary to identify a distribution? What are you using this to do? 2. Why consider only those particular distributions and not others?
  • – Glen_b Sep 21 '16 at 11:56
  • Your data look to be distinctly discrete. Have they been binned? What do the numbers represent?
  • – Glen_b Sep 21 '16 at 12:48
  • @Glen_b

    This data had been gathered for market research which includes, duration, and the answers of the participants for asked question. I wanted to analyze normal, uniform and gama, since obersvation is close to them. I do not know exactly how can I find a distribution of raw data. That's why I may be looked as lost.

    – can.u Sep 21 '16 at 18:54
  • @Glen_b as you said I need to evaulate data for other distributions. Should I follow the code of following link? http://stackoverflow.com/questions/2661402/given-a-set-of-random-numbers-drawn-from-a-continuous-univariate-distribution-f – can.u Sep 21 '16 at 19:05
  • You don't explain how your data come to be discrete (yet somehow not integer); this discreteness may be an issue for all of your above choices. I don't want to recommend anything without understanding in detail how that discreteness arises for each variable. It's also still not clear what you need to fit a distributional form for. What are you going to do with it? What do you need a named distribution for that you couldn't get from the ECDF? – Glen_b Sep 21 '16 at 23:23
  • In response to your second comment: I did not say you need to evaluate other distributions. I asked why you chose the ones you did rather than something else. I sought insight into your reasons for choosing those vs some other possibility as a way of trying to get some idea what you're trying to achieve with all this. You should not follow the code at that link. Thinking is required here, not throwing a bunch of formulas at your data and trying to find something that sticks. – Glen_b Sep 21 '16 at 23:27
  • I'd like to understand how series A, for example, has a lot of "12.48" values. (Also, unless I am confused, your P-P plot looks like your axes are swapped.) – Glen_b Sep 21 '16 at 23:53