Background:
I am making a convolutional neural network (CNN) to try and classify cytometric data. This data has a shape (num_cells, num_markers). Additional information on using CNNs for cytometric data can be found here (CellCNN, Deep Cytometric CNN model). They are supposed to work much better than other methods (SVM, random forest, autoencoder ...).
Dataset & data handling
The dataset I am currently using has a shape (172791, 24). I have ground truths for these cells. These ground truths basically tell me if the cells are sick or healthy. It is thus a binary classification problem. However, I also need to use a similar model later on a dataset with 3 classes and a different amount of markers, but to account for this in my network, this should just be changing a few parameters.
I split my dataset in an 80% train/20% test split. The train dataset is then split into an 80% train/20% validate split. I use the validate dataset to test the loss during each epoch and then after training I calculate metrics like accuracy scores and a confusion matrix using the test dataset.
Model Design:
I am using pytorch to create my model. I use a CrossEntropyLoss() criterion and as an optimizer, I use Adam(model.parameters(), lr=1E-4, weight_decay=1e-5) (weight decay in Adam optimizer in pytorch is apparently implemented as L2 regularisation).
I made a deep network containing 2 convolutional layers and 2 fully connected layers.
For the convolutional layers I use a 1D convolution since the cells (each row of 24 markers) are arranged randomly and have no correlation to each other. Each convolutional layer is followed by a ReLu activation layer and a batch normalization layer.
After the convolutional layers I use a polling layer (1D mean or max pooling).
For the first fully connected layer, I employ a Linear layer followed by a ReLu activation layer and a Dropout layer. The second fully connected layer then gives the output
Problem:
When I train the model and then use it on the test set, I get an accuracy of around 90% regardless of what I try. Admittedly this is definitely not bad. However, in literature (see the references in the background section) I find that CNNs should be a lot better at classifying cytometric data. I have also used a HistGradientBoost and an SVM algorithm on the data. Those models yield accuracies of around 98% - 99% after tuning their hyperparameters.
I have experimented with the following things:
- Different amounts of convolutional layers (1 - 3)
- Different amounts of fully connected layers (1 - 4)
- Using pooling after each convolutional layer instead of only the last
- Using no batch normalization
- Using different activation functions (this only yielded in lower accuracies)
- Using different output shapes (e.g. filter sizes) of the convolutional layers
- Using different sizes of the fully connected layers
- Using different values for the dropout layers
My metrics, a learning curve (train loss vs validate loss) and a confusion matrix, are shown below:

Now, it looks like the loss is still decreasing, which is true, however anything between about 3 and 50 epochs gives an accuracy of between 89% and 91%. After a few epochs nothing changes much which would lead me to believe that the model might be too complex, but i have tried reducing complexity which did not work. After more than 50 epochs the test score starts to increase again meaning we're definitely over-fitting by that point.
I am quite lost at this one. Does anyone have any ideas, suggestions or pointers as to what I could still try to increase the accuracy of my model. Or am I doing anything wrong. I admit that it's my first time designing a CNN (or any NN for that matter) so I might be misunderstanding something still.
Below I also provide my CNN implemented using pytorch for those interested:
import torch.nn as nn
class CytometryCNN(nn.Module):
def init(self):
super(CytometryCNN, self).init()
# Convolutional layers
self.conv1 = nn.Conv1d(1, 24, kernel_size=3, stride=1, padding=1)
self.conv2 = nn.Conv1d(24, 24, kernel_size=3, stride=1, padding=1)
# batch normalization layers
self.bn1 = nn.BatchNorm1d(24)
self.bn2 = nn.BatchNorm1d(24)
self.relu = nn.ReLU()
self.pool = nn.AvgPool1d(kernel_size=2, stride=2)
# fully connected layers
self.fc1 = nn.Linear(24 * 12, 64)
self.fc2 = nn.Linear(64, 2)
self.dropout1 = nn.Dropout(0.5)
def forward(self, x):
# Add a channel dimension shape (batch_size, n_features) -> (batch_size, 1, n_features)
x = x.unsqueeze(1)
# First convolutional layer
x = self.conv1(x)
x = self.relu(x)
x = self.bn1(x)
# Second convolutional layer
x = self.conv2(x)
x = self.relu(x)
x = self.bn2(x)
x = self.pool(x)
x = x.view(x.size(0), -1)
# First fully connected layer
x = self.fc1(x)
x = self.dropout1(x)
x = self.relu(x)
# Second fully connected layer
x = self.fc2(x)
return x
Thanks in advance!