1

This is the data set I am working on, trying to predict count (last column) :

    datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0,32
...

The count distribution is an exponential decrease. I tried basic linear regression, but the result is bad. So, I guess there is an exponential correlation between count and, at least, one of its predictors. I also guess there is linear correlation between count and other predictors as well.

How to mix multiple linear and exponential regression ? I am working using the anaconda distribution of python, but i'd also like to understand the theory of the model if possible. Thanks !

Moebius
  • 153
  • 1
    This maybe relevant: http://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 – kjetil b halvorsen Apr 13 '16 at 15:32

1 Answers1

3

You give very little to go on (what is being counted? why do you expect some relationships to be linear? what kinds of variable are they?), so some of this may not be completely appropriate.

The most obvious first thing to try would be a Poisson GLM with log-link which would model such exponential relationships and be suitable for at least some count data. There are other possible alternatives.

You can find many posts relating to Poisson GLMs or Poisson regression on site.

Glen_b
  • 282,281
  • Thanks ! What kind of exploratory data analysis should I do to answer those questions ? I did the scatter matrix which plot every value of one variable after another, but not much. – Moebius Apr 13 '16 at 12:28
  • @Moebius None of the questions I asked should be answered from looking at the data. "What is being counted" you should already know; "why do you expect some relationships to be linear?" is a question about what caused you to say something you said, which you should already know; "what kinds of variable are they?" is an attempt to find out about the type of variable those ones you expect to have linear relationships are (is it continuous? a count? categorical? ...) because that can impact how you deal with them -- and again, which you should already know. – Glen_b Apr 13 '16 at 14:50
  • The variable to count is the variable count (last column of my data). I expect some of the relation to be linear...in fact I don't have a clue how variables interact together. So I expect some of them to be linear. That's the point of regression, knowing how variables interact between each other, no ? – Moebius Apr 13 '16 at 15:29
  • Let me try again. 1. What was counted? 2. What makes you think some variables will have linear relationships and some will not? Why is there a difference? 3. Where did mention of interactions come from? I didn't ask anything about them and you didn't previously mention interactions. 4. I don't think you can characterize the point of regression as knowing how variables interact. It may be some of it, but many regression models don't have interactions, for example, and even when they do, sometimes the point isn't to know about them per se. – Glen_b Apr 13 '16 at 15:50
  • What is counted is the number of rented bike at a given hour (edited to add the datetime column in the sample data). By interacting, I mean that the variable could predict the count variable. – Moebius Apr 13 '16 at 15:55
  • "interaction" has a particular meaning in statistics that differs from this sense of the word. I'd avoid using it to mean that. Is there a known limit to the number of rented bikes? (e.g if this is data for a bike rental place which has 80 bikes and rents by the hour they have a known limit of 80) -- this would perhaps suggest a binomial rather than a Poisson model – Glen_b Apr 13 '16 at 22:41
  • 1
    No, there is no defined limit (but there is not an infinite number of bike, but we don't know that number). – Moebius Apr 14 '16 at 14:08