Parallel computation of big covariance matrices

Question

We need to compute covariance matrices with sizes ranging from $10000\times10000$ to $100000\times100000$. We have access to GPUs and clusters, we wonder what is the best parallel approach for speeding up these computations.

Do you expect particularities for your covariance matrix? For example, large number of "near 0" correlations? — Dr_Sam, Mar 07 '13 at 12:10
What is your k? That is, how long is each data vector. Are they zero-mean already? — Max Hutchinson, Mar 07 '13 at 14:04
Just curious: what is your use of such a huge covariance matrix? — Arnold Neumaier, Mar 08 '13 at 11:50
@flow: ''clinical data'' is the appl8ication, but not the use. My question was: Suppose you have the covariance matrix, what are you going to do with it (from a mathematical point of view)? The reason I ask is that in the end one always computes very little from it, and if this is taken into account one can usually speed up things tremendously by avoiding to calculate the full covariance matrix while still get the wanted subsequent result. — Arnold Neumaier, Mar 11 '13 at 09:29
we need to calculate the full covariance matrix. then we analyze data, and depending on the degree of dependence of some variables, we study them further or not. we also need to show that for most variables there is no dependence at all — Open the way, Mar 11 '13 at 10:51

Max Hutchinson · Answer 1 · 2014-01-29T20:50:37.950

The first thing is to recognize that you can do this using BLAS. If you data matrix is $X = [x_1 x_2 x_3 ...] \in \mathbb{R}^{m\times n}$ (each $x$ is a column vector corresponding to one measurement; rows are trials), then you can write the covariance as: $$ C_{ij} = E[x_i,x_j] - E[x_i] E[x_j] = \frac{1}{n} \sum_k x_{ik} x_{jk} - \frac{1}{n^2} \left(\sum_k x_{ik} \right) \left(\sum_k x_{jk}\right) $$ We can write this as: $$C = \frac{1}{n} X^T X - \frac{1}{n^2} (1^T X)^T (1^T X) $$ where $(1^T)$ is the row-vector with all elements 1 so $(1^T X)$ is a row vector of the column sums of $X$. This can be written completely as BLAS, where $X^T X$ is either a GEMM or, better yet, a SYRK/HERK and you can get the $(1^T X) = b$ with a GEMV, $b^T b$ with GEMM or SYRK/HERK again, and the prefactors with SCAL.

Your data and result matrices can be around 64GB, so you're not going to fit on a single node or a node's worth of GPUs. For a non-GPU cluster, you might want to look at PBLAS, which feels like scalapack. For GPUs, multi-node libraries aren't quite there yet. Magma has some sort of underlying parallel BLAS implementation, but it may not be user friendly. I don't think CULA does multi-node yet, but its something to keep an eye on. CUBLAS is single-node.

I'd also suggest that you strongly consider implementing the parallelism yourself , especially if you are familiar with MPI and have to hook this into an existing code-base. That way, you can switch between CPU and GPU BLAS easily and begin and end with the data exactly where you want it. You shouldn't need more than a few MPI_ALLREDUCE calls.

Thank you for your analysis and list of relevant BLAS functions. After reading your answer I have used these to accelerate and optimize the computation of covariance matrix in the development version of Scilab (www.scilab.org). — Stéphane Mottelet, Apr 08 '19 at 13:25
However, be warned that using this way of computing the covariance is subject to catastrophic cancelation when $ E[x_i,x_j]$ is close to $E[x_i] E[x_j]$, see e.g. https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance — Stéphane Mottelet, Feb 19 '20 at 21:11
How does the equation for $C$ change when the data matrix $X$ is a column vector of row vectors? (Ie. each row of $X$ is is a row vector corresponding to one measurement, and the columns are trials) — étale-cohomology, Jul 03 '21 at 09:38

score 1 · Answer 2 · answered Aug 30 '15 at 11:48

I implemented formula given by @Max Hutchinson with CUBlas and Cuda Thrust and compared with online co variance calculation tools. It seems mine producing good results. The code below planned to QDA Bayes. So matrix given may contain more than one class. So multiple co variance matrices is calculated. I hope it will be useful for someone.

//! Calculates one or more than one coVarianceMatrix given data.
//  There can be many classes since many covariance matrixes.
/*!
    \param inMatrix This vector contains matrix data in major storage. 
    Forexample if inMatrix=[1 2 3 4 5 6] and trialSizes=[2] this means matrix we will work on a matrix like :
        |1 4 |
        |2 5 |
        |3 6 | -> 2 Trials, 3 Features. Columns contains feature rows contains trials (samples)
    \param trialSizes There can be many classes since many covariance matrixes. Samples from all classes will be given with inMatrix.
    But we need to know how many trials(samples) we have for each class. 
    For example if inMatrix=[1 2 3 4 5 6 7 8 9 10 11 12] and trialSizes=[2,2] 
    this means matrix we will work on a matrix like :
        |1 4 |  |7 10 |
        |2 5 |  |8 11 |
        |3 6 |  |9 12 |  --> Total number of trials(samples which is total rowCount) 2 + 2 = 4 , 
                             So colSize = inMatrix.size()/4 = 3(feature vector size)
                         --> There is two element in trialSize vec so each vector has to samples
*/
void multiQDACovianceCalculator(std::vector<float>& inMatrix, std::vector<int>& trialSizes)
{
    cublasHandle_t handle; // CUBLAS context
    int classCount = trialSizes.size();
    int rowSize = std::accumulate(trialSizes.begin(), trialSizes.end(), 0);
    int dimensionSize = inMatrix.size() / rowSize;
    float alpha = 1.0f;
    float beta = 0.0f; // bet =1

    thrust::device_vector<float> d_cov1(dimensionSize * dimensionSize);
    thrust::device_vector<float> d_cov2(dimensionSize * dimensionSize);
    thrust::device_vector<float> d_covResult(dimensionSize * dimensionSize);

    thrust::device_vector<float> d_wholeMatrix(inMatrix);
    thrust::device_vector<float> d_meansVec(dimensionSize); // rowVec of means of trials
    float *meanVecPtr = thrust::raw_pointer_cast(d_meansVec.data());
    float *device2DMatrixPtr = thrust::raw_pointer_cast(d_wholeMatrix.data());
    auto maxTrialNumber = *std::max_element(trialSizes.begin(), trialSizes.end());
    thrust::device_vector<float> deviceVector(maxTrialNumber, 1.0f);

    cublasCreate(&handle);
    // Inside of for loop  one covariance matrix calculated each time
    for (int i = 0; i < trialSizes.size(); i++)
    {
        // X*transpose(X) / N
        alpha = 1.0f / trialSizes[i];
        cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, dimensionSize, dimensionSize, trialSizes[i], &alpha,
            device2DMatrixPtr, dimensionSize, device2DMatrixPtr, dimensionSize, &beta,
            thrust::raw_pointer_cast(d_cov1.data()), dimensionSize);

        // Mean vector of each column
        alpha = 1.0f;
        cublasSgemv(handle, CUBLAS_OP_N, dimensionSize, trialSizes[i], &alpha, device2DMatrixPtr,
            dimensionSize, thrust::raw_pointer_cast(deviceVector.data()), 1, &beta, meanVecPtr, 1);

        // MeanVec * transpose(MeanVec) / N*N
        alpha = 1.0f / (trialSizes[i] * trialSizes[i]);
        cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, dimensionSize, dimensionSize, 1, &alpha,
            meanVecPtr, 1, meanVecPtr, 1, &beta,
            thrust::raw_pointer_cast(d_cov2.data()), dimensionSize);

        alpha = 1.0f;
        beta = -1.0f;
        //  (X*transpose(X) / N) -  (MeanVec * transpose(MeanVec) / N*N)
        cublasSgeam(handle, CUBLAS_OP_N, CUBLAS_OP_N, dimensionSize, dimensionSize, &alpha,
            thrust::raw_pointer_cast(d_cov1.data()), dimensionSize, &beta, thrust::raw_pointer_cast(d_cov2.data()), 
            dimensionSize, thrust::raw_pointer_cast(d_covResult.data()), dimensionSize);

        // Go to other class and calculate its covarianceMatrix
        device2DMatrixPtr += trialSizes[i] * dimensionSize;
    }
    printVector(d_covResult);
    cublasDestroy(handle);
}

Parallel computation of big covariance matrices

2 Answers2