9

We need to compute covariance matrices with sizes ranging from $10000\times10000$ to $100000\times100000$. We have access to GPUs and clusters, we wonder what is the best parallel approach for speeding up these computations.

nicoguaro
  • 8,500
  • 6
  • 23
  • 49
Open the way
  • 711
  • 1
  • 9
  • 15

2 Answers2

18

The first thing is to recognize that you can do this using BLAS. If you data matrix is $X = [x_1 x_2 x_3 ...] \in \mathbb{R}^{m\times n}$ (each $x$ is a column vector corresponding to one measurement; rows are trials), then you can write the covariance as: $$ C_{ij} = E[x_i,x_j] - E[x_i] E[x_j] = \frac{1}{n} \sum_k x_{ik} x_{jk} - \frac{1}{n^2} \left(\sum_k x_{ik} \right) \left(\sum_k x_{jk}\right) $$ We can write this as: $$C = \frac{1}{n} X^T X - \frac{1}{n^2} (1^T X)^T (1^T X) $$ where $(1^T)$ is the row-vector with all elements 1 so $(1^T X)$ is a row vector of the column sums of $X$. This can be written completely as BLAS, where $X^T X$ is either a GEMM or, better yet, a SYRK/HERK and you can get the $(1^T X) = b$ with a GEMV, $b^T b$ with GEMM or SYRK/HERK again, and the prefactors with SCAL.

Your data and result matrices can be around 64GB, so you're not going to fit on a single node or a node's worth of GPUs. For a non-GPU cluster, you might want to look at PBLAS, which feels like scalapack. For GPUs, multi-node libraries aren't quite there yet. Magma has some sort of underlying parallel BLAS implementation, but it may not be user friendly. I don't think CULA does multi-node yet, but its something to keep an eye on. CUBLAS is single-node.

I'd also suggest that you strongly consider implementing the parallelism yourself , especially if you are familiar with MPI and have to hook this into an existing code-base. That way, you can switch between CPU and GPU BLAS easily and begin and end with the data exactly where you want it. You shouldn't need more than a few MPI_ALLREDUCE calls.

Max Hutchinson
  • 3,051
  • 16
  • 29
  • Thank you for your analysis and list of relevant BLAS functions. After reading your answer I have used these to accelerate and optimize the computation of covariance matrix in the development version of Scilab (www.scilab.org). – Stéphane Mottelet Apr 08 '19 at 13:25
  • However, be warned that using this way of computing the covariance is subject to catastrophic cancelation when $ E[x_i,x_j]$ is close to $E[x_i] E[x_j]$, see e.g. https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance – Stéphane Mottelet Feb 19 '20 at 21:11
  • How does the equation for $C$ change when the data matrix $X$ is a column vector of row vectors? (Ie. each row of $X$ is is a row vector corresponding to one measurement, and the columns are trials) – étale-cohomology Jul 03 '21 at 09:38
1

I implemented formula given by @Max Hutchinson with CUBlas and Cuda Thrust and compared with online co variance calculation tools. It seems mine producing good results. The code below planned to QDA Bayes. So matrix given may contain more than one class. So multiple co variance matrices is calculated. I hope it will be useful for someone.

//! Calculates one or more than one coVarianceMatrix given data.
//  There can be many classes since many covariance matrixes.
/*!
    \param inMatrix This vector contains matrix data in major storage. 
    Forexample if inMatrix=[1 2 3 4 5 6] and trialSizes=[2] this means matrix we will work on a matrix like :
        |1 4 |
        |2 5 |
        |3 6 | -> 2 Trials, 3 Features. Columns contains feature rows contains trials (samples)
    \param trialSizes There can be many classes since many covariance matrixes. Samples from all classes will be given with inMatrix.
    But we need to know how many trials(samples) we have for each class. 
    For example if inMatrix=[1 2 3 4 5 6 7 8 9 10 11 12] and trialSizes=[2,2] 
    this means matrix we will work on a matrix like :
        |1 4 |  |7 10 |
        |2 5 |  |8 11 |
        |3 6 |  |9 12 |  --> Total number of trials(samples which is total rowCount) 2 + 2 = 4 , 
                             So colSize = inMatrix.size()/4 = 3(feature vector size)
                         --> There is two element in trialSize vec so each vector has to samples
*/
void multiQDACovianceCalculator(std::vector<float>& inMatrix, std::vector<int>& trialSizes)
{
    cublasHandle_t handle; // CUBLAS context
    int classCount = trialSizes.size();
    int rowSize = std::accumulate(trialSizes.begin(), trialSizes.end(), 0);
    int dimensionSize = inMatrix.size() / rowSize;
    float alpha = 1.0f;
    float beta = 0.0f; // bet =1

    thrust::device_vector<float> d_cov1(dimensionSize * dimensionSize);
    thrust::device_vector<float> d_cov2(dimensionSize * dimensionSize);
    thrust::device_vector<float> d_covResult(dimensionSize * dimensionSize);

    thrust::device_vector<float> d_wholeMatrix(inMatrix);
    thrust::device_vector<float> d_meansVec(dimensionSize); // rowVec of means of trials
    float *meanVecPtr = thrust::raw_pointer_cast(d_meansVec.data());
    float *device2DMatrixPtr = thrust::raw_pointer_cast(d_wholeMatrix.data());
    auto maxTrialNumber = *std::max_element(trialSizes.begin(), trialSizes.end());
    thrust::device_vector<float> deviceVector(maxTrialNumber, 1.0f);

    cublasCreate(&handle);
    // Inside of for loop  one covariance matrix calculated each time
    for (int i = 0; i < trialSizes.size(); i++)
    {
        // X*transpose(X) / N
        alpha = 1.0f / trialSizes[i];
        cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, dimensionSize, dimensionSize, trialSizes[i], &alpha,
            device2DMatrixPtr, dimensionSize, device2DMatrixPtr, dimensionSize, &beta,
            thrust::raw_pointer_cast(d_cov1.data()), dimensionSize);

        // Mean vector of each column
        alpha = 1.0f;
        cublasSgemv(handle, CUBLAS_OP_N, dimensionSize, trialSizes[i], &alpha, device2DMatrixPtr,
            dimensionSize, thrust::raw_pointer_cast(deviceVector.data()), 1, &beta, meanVecPtr, 1);

        // MeanVec * transpose(MeanVec) / N*N
        alpha = 1.0f / (trialSizes[i] * trialSizes[i]);
        cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, dimensionSize, dimensionSize, 1, &alpha,
            meanVecPtr, 1, meanVecPtr, 1, &beta,
            thrust::raw_pointer_cast(d_cov2.data()), dimensionSize);

        alpha = 1.0f;
        beta = -1.0f;
        //  (X*transpose(X) / N) -  (MeanVec * transpose(MeanVec) / N*N)
        cublasSgeam(handle, CUBLAS_OP_N, CUBLAS_OP_N, dimensionSize, dimensionSize, &alpha,
            thrust::raw_pointer_cast(d_cov1.data()), dimensionSize, &beta, thrust::raw_pointer_cast(d_cov2.data()), 
            dimensionSize, thrust::raw_pointer_cast(d_covResult.data()), dimensionSize);

        // Go to other class and calculate its covarianceMatrix
        device2DMatrixPtr += trialSizes[i] * dimensionSize;
    }
    printVector(d_covResult);
    cublasDestroy(handle);
}