2

I'm having some trouble with optimising softmax regression via Newton's method. I'm not sure if the problem is arising with my equations for the Hessian and Gradient or with the code I've written. For a single sample $\mathbf{x} \in \mathbb{R}^{n_\text{features}}$, I define the probability vector $\mathbf{p} \in \mathbb{R}^{n_\text{classes}}$. The $m$th component models the probability for class $m$ and is defined as

$$ p_m(\mathbf{x}) = \frac{e^{\mathbf{w}_m^{\mathrm{T}}\mathbf{x}}}{\sum_{n=1}^{n_{\text{classes}}}e^{\mathbf{w}_n^{\mathrm{T}}\mathbf{x}}},$$

where $\mathbf{w}_n\in \mathbb{R}^{n_\text{features}}$ is the weight vector for class $m$. The class weight vectors are stacked to give the vector $\mathbf{w}\in \mathbb{R}^{n_\text{features} n_\text{classes}}$. The cost for a single sample is given as a function of this vector by the cross entropy $$J(\mathbf{w}) = -\log p_{y}(\mathbf{x};\mathbf{w}),$$ where $y \in \lbrace 1, \ldots, n_\text{classes}\rbrace$ is the observed class for $\mathbf{x}$. I have the gradient of the cost function as 1 $$ \nabla J(\mathbf{w}) = (\mathbf{p} -\mathbf{y}) \otimes \mathbf{x},$$ which is a vector of dimension $D = {n_\text{features} \cdot n_\text{classes}}$, and where $\mathbf{y}\in \mathbb{R}^{n_\text{classes}}$ is a vector with all components equal to zero except the $y$th component which is equal to one. The Hessian is a matrix of dimension ${D \times D}$ given by $$ \mathbf{H} \equiv \nabla^2 J(\mathbf{w}) = \left(\mathbf{\Lambda}_{\mathbf{p}} -\mathbf{p} \mathbf{p}^{\mathrm{T}} \right)\otimes \mathbf{x} \mathbf{x}^{\mathrm{T}},$$ where $\Lambda_\mathbf{p} \in \mathbb{R}^{n_{\text{classes}}\times n_{\text{classes}}}$ is a diagonal matrix with the components of $\mathbf{p}$ along the diagonal.

In the code below I have defined a python class to apply the Newton method to a matrix of test data $\mathbf{X}$ with sample vectors along the rows and a vector $\mathbf{y}$ of classes for the samples. The fit method for this class averages the gradient and Hessian over samples and then applies the update rule: $$\mathbf{w}_{t+1} = \mathbf{w}_t -\mathbf{H}^{-1}\cdot \nabla J(\mathbf{w}_t).$$

import numpy as np
class Logistic:
'''
    Logistic regression using softmax and Newton's method

    Parameters
    ----------
    max_epochs : int
        maximum number of iterations in gradient descent

    Attributes
    ----------
    W : numpy 2d-array, shape = (n_classes, n_features)
        Matrix of weights

    '''
    def __init__(self, max_epochs = 20):
        self.max_epochs = max_epochs

    @staticmethod   
    def softmax(X, W):
        #apply sigmoid functions
        P = np.exp(np.matmul(X, W.T))
        #normalise
        norms = np.diag(1/P.sum(1))
        P = np.matmul(norms, P)
        return P

    def fit(self, X, y):
        '''
        Apply logistic regression to input data using Newton's method


        Parameters
        ----------

        X : numpy 2d-array, shape = (n_samples, n_features)

        y : numpy 1d-array, shape = (n_samples, )
            Elements are floats of index in one-hot encoded class names

        '''
        #add bias column to data matrix
        X0 = np.expand_dims(np.full(len(X), 1), 1)
        X = np.concatenate((X0, X), axis = 1)

        self.n_samples, self.n_features = X.shape
        self.n_classes = len(set(y))

        #create one-hot encoded matrix of classes for samples
        Y = np.zeros((self.n_samples, self.n_classes))
        np.put_along_axis(Y, np.expand_dims(y, 1), 1, axis = 1)    

        #empty weights vector
        W = np.zeros((self.n_classes, self.n_features)).flatten()

        epoch = 0

        while epoch < self.max_epochs:
            #matrix of probabilities
            P = self.softmax(X, W.reshape(self.n_classes, self.n_features))
            grad = self.update_grad(X, Y, P)
            #print("grad",grad[:2])
            inv_H = self.update_inv_hessian(X, P)
            #Apply Newton's method
            v = np.dot(inv_H, grad)
            W -= v
            epoch+=1
        self.W = W.reshape(self.n_classes, self.n_features)

    def update_grad(self, X, Y, P):
        grad =  np.einsum('...i, ...j->...ij', P-Y, X)
        avg_grad = (1/self.n_samples)*(grad.sum(0)).flatten()
        return avg_grad

    def update_inv_hessian(self, X, P):
        #array of Lambda_p matrices
        A = np.zeros((self.n_samples, self.n_classes, 
                      self.n_classes))
        np.einsum('ijj->ij', A)[...] = P
        #array of outer products of probability vectors
        B = np.einsum('...i, ...j->...ij', P, P)
        #array of matrices on LHS of outer product
        l_matrices = A-B
        #array of matrices on RHS of outer product
        r_matrices =  np.einsum('...i, ...j->...ij', X, X)
        H = np.einsum("ijk,ilm->ijlkm", l_matrices, r_matrices)
        H = H.reshape(self.n_samples, 
                      self.n_classes*self.n_features, 
                      self.n_classes*self.n_features)
        avg_H = (1/self.n_samples)*H.sum(0)
        inv_H = np.linalg.pinv(avg_H)
        return inv_H

In the image below I show the trajectory of the last two components of the weight vector using Newton's method from the code above and using gradient descent (similar code) for the iris dataset. I also calculate the sklearn solution, and then plot a contour plot of the cost function varying just the last two components of the weight vector, the rest fixed at the sklearn output. My gradient descent method seems to converge very slowly to the sklearn solution, but the Newton method completely overshoots it, and then seems to converge to some completely different weight vector.

Trajectory of weight vector components

1 Böhning, D., 1992. Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics 44, 197–200.

oweydd
  • 225
  • @SextusEmpiricus I hope my edits make it clearer. The derivation is based on one shown in reference [1] but the definition for the cost is slightly different there. – oweydd Sep 25 '23 at 09:07
  • I realise that I have more parameters than necessary, but that doesn't make it wrong. 2. I mention before the code example that I average the given expressions for gradient and Hessian over multiple points.
  • – oweydd Sep 25 '23 at 11:01