I am trying to do Poisson regression from scratch (only using numpy). I have the following code, but the output makes no sense to me.
import numpy as np
First I generate some data:
# generate data
x1 = np.random.normal(10, 5, 1000)
x2 = np.random.normal(20, 5, 1000)
error = np.random.normal(0, 1, 1000)
y = 3*x1 + 4*x2 + error
X = np.stack((x1,x2)).T
From this data generating process it is clear that the correct values are close to 3 and 4. Next I define my class to perform the regression:
class GLM:
def __init__(self):
self.weights = np.ones(np.shape(X)[1])
def logLikelihoodDerivative(self, X, y):
derivative = np.dot(y - np.exp(np.dot(X, self.weights)), X)
print(self.weights) # to observe how weights are updated
return derivative
def fit(self, X, y, numberOfIterations=10000):
for _ in range(0, numberOfIterations):
self.gradientAscent(X, y)
def gradientAscent(self, X, y, learningRate=0.0001):
gradient = self.logLikelihoodDerivative(X, y)
self.weights = self.weights + learningRate * gradient
return self.weights
Now I apply my model:
test = GLM()
test.fit(X, y)
print("My output: ", test.weights[0], " and ", test.weights[1])
Finally I try to check my outcome with the Python library statsmodels:
import statsmodels.api as sm
check = sm.GLM(y, X, family=sm.families.Poisson()).fit()
print("Correct output:", check.params[0], " and ", check.params[1])
The results are:
My output: nan and nan
Correct output: 2.9999999999999982 and 4.000000000000005
(Note: I am assuming a Poisson distribution).
In the code above the weights explode and become inf/-inf very quickly. I do not understand why.
What is going wrong in my code?
x1 = np.random.poisson(10, 1000)andx2 = np.random.poisson(20, 1000)? Note that the data generating process is not very important, because I am comparing the outcome also withstatsmodelsand not just with the parameters of the data generating process – Xtiaan Jun 13 '23 at 13:43