0

I am trying to do Poisson regression from scratch (only using numpy). I have the following code, but the output makes no sense to me.

import numpy as np

First I generate some data:

# generate data
x1 = np.random.normal(10, 5, 1000)
x2 = np.random.normal(20, 5, 1000)
error = np.random.normal(0, 1, 1000)
y = 3*x1 + 4*x2 + error
X = np.stack((x1,x2)).T

From this data generating process it is clear that the correct values are close to 3 and 4. Next I define my class to perform the regression:

class GLM:
def __init__(self):
    self.weights = np.ones(np.shape(X)[1])

def logLikelihoodDerivative(self, X, y):
    derivative = np.dot(y - np.exp(np.dot(X, self.weights)), X)

    print(self.weights) # to observe how weights are updated
    return derivative

def fit(self, X, y, numberOfIterations=10000):
    for _ in range(0, numberOfIterations):
        self.gradientAscent(X, y)

def gradientAscent(self, X, y, learningRate=0.0001):
    gradient = self.logLikelihoodDerivative(X, y)
    self.weights = self.weights + learningRate * gradient
    return self.weights

Now I apply my model:

test = GLM()
test.fit(X, y)
print("My output:     ", test.weights[0], " and ", test.weights[1])

Finally I try to check my outcome with the Python library statsmodels:

import statsmodels.api as sm
check = sm.GLM(y, X, family=sm.families.Poisson()).fit()
print("Correct output:", check.params[0], " and ", check.params[1])

The results are:

My output:      nan  and  nan
Correct output: 2.9999999999999982  and  4.000000000000005

(Note: I am assuming a Poisson distribution).

In the code above the weights explode and become inf/-inf very quickly. I do not understand why.

What is going wrong in my code?

Xtiaan
  • 171
  • 5
  • You seem to confuse poisson and linear regression. If you want linear regression, then apply max like hood based on linear regression. Otherwise generate your data based on a poisson model, using $\lambda=\exp(3x_1+4x_2)$ – seanv507 Jun 13 '23 at 13:25
  • I want to do Poisson regression, a regression where the assumed distribution is the Poisson distribution. – Xtiaan Jun 13 '23 at 13:29
  • So first fix your data generation using eg https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.poisson.html – seanv507 Jun 13 '23 at 13:32
  • Add a log likelihood function to your code. And confirm you get a minimum at 3,4. (eg by fixing one coordinate at correct value and sweeping the other coordinate around the optimal value) – seanv507 Jun 13 '23 at 13:39
  • Whilst doing the gradient ascent, print out coordinates and log likelihood – seanv507 Jun 13 '23 at 13:40
  • @seanv507 Do you mean define x1 and x2 as x1 = np.random.poisson(10, 1000) and x2 = np.random.poisson(20, 1000)? Note that the data generating process is not very important, because I am comparing the outcome also with statsmodels and not just with the parameters of the data generating process – Xtiaan Jun 13 '23 at 13:43
  • No, your error and y assignment lines are wrong. Delete them. Define lambda as I specified, then generate random y by calling the numpy function I linked to using lam parameter= lambda – seanv507 Jun 13 '23 at 13:47
  • But I want to find regression coefficients using maximum likelihood, not lambda. – Xtiaan Jun 13 '23 at 13:50
  • I don't know why you are getting the correct value using stats models. I can only assume you are making further mistakes. So please fix your data generation first – seanv507 Jun 13 '23 at 13:51
  • I am explaining that you have to fix your data generation. ( you are still using max likelihood to find the correct parameters) – seanv507 Jun 13 '23 at 13:54

1 Answers1

3

You are doing multiple things wrong.

First, as noticed in the comments, your data-generating process is incorrect. If you want to do Poisson regression, the conditional distribution of your data should be Poisson, not Gaussian. The code should be something like

x1 = np.random.normal(10, 5, 1000)
x2 = np.random.normal(20, 5, 1000)
y = np.random.poisson(np.exp(3*x1 + 4*x2), 1000)
X = np.stack((x1,x2)).T

But this would throw

ValueError: lam value too large

Because np.exp(3*x1 + 4*x2) returns huge numbers.

The second problem is your choice of the optimization algorithm. Gradient descent is a rather poor algorithm. Both R and Python's statmodels use the iteratively reweighted least squares algorithm. There are also other alternatives like the Newton-Raphson algorithm. You could also use a general-purpose optimizer, but if you use a different algorithm than implemented in the software you are comparing to, you should not expect to see the same results.

Finally, gradient descent is sensitive to the choice of the learning rate. In the code you showed us, you are using a hard-coded learning rate. With different learning rates, you could see better results.

Tim
  • 138,066
  • I do not want to generate y, but define it such that I exactly know the true values of the regression coefficients. – Xtiaan Jun 13 '23 at 14:22
  • 1
    @Xtiaan nothing prevents you from doing this correctly. How you are doing it right now gives you data that does not follow Poisson distribution, so you should not expect Poisson regression to find the parameters. – Tim Jun 13 '23 at 14:24
  • How do you recommend I generate the data? I will look into other optimization algorithms. – Xtiaan Jun 13 '23 at 14:25
  • 1
    @Xtiaan see https://stats.stackexchange.com/questions/27443/generate-data-samples-from-poisson-regression – Tim Jun 13 '23 at 14:32