1

I am trying to increase the performance of a sparse matrix-vector product using OpenMP in C++. I stored the sparse matrix in COO format, that is, I have a struct of 3 arrays that correspond to each nonzero entry of the sparse matrix. For each index of the struct, I can find the row index, column index and value of the nonzero entry. In addition, I can specify the number of threads to be used by the function by

export OMP_NUM_THREADS=n

where n is the number of threads I would like to use. Currently, my code is as follows

  void ompMatvec(const Vector& x, Vector& y) const {

    std::string envName = "OMP_NUM_THREADS";
    std::string thread_count = getenv(envName); 
    int thread_count2 = atoi(thread_count.c_str()); 
    Vector y2(y.numRows()); 
    size_type k; 
    #pragma omp parallel num_threads(thread_count2) 
    #pragma omp for 
      for (k = 0; k < arrayData.size(); ++k) {
        y(rowIndices[k]) += arrayData[k] * x(colIndices[k]);
      }
  }

However, when I measure the performance I find that my speed up is not too high. I am comparing the parallelized function above with:

  void matvec(const Vector& x, Vector& y) const {
    for (size_type k = 0; k < arrayData.size(); ++k) {
      y(rowIndices[k]) += arrayData[k] * x(colIndices[k]);
    }
  }

I would like to mention that I have created a Vector class with the private member function .numRows() which essentially provides the length of the vector. I am also running the code on 4 cores. Is there an implementation change that could increase performance using the OpenMP API? Or is it limited by the number of cores that my program is running on?

Any and all recommendations are greatly appreciated. Thank you!

Update: An attempt to avoid the race condition above:

 void ompMatvec(const Vector& x, Vector& y) const {

    std::string envName = "OMP_NUM_THREADS";
    std::string thread_count = getenv(envName); 
    int thread_count2 = atoi(thread_count.c_str());  
    size_type k; 
    #pragma omp parallel num_threads(thread_count2) \
    default(none) private(k) 
    Vector y2(y.numRows());
    #pragma omp for 
      for (k = 0; k < arrayData.size(); ++k) {
        y2(rowIndices[k]) += arrayData[k] * x(colIndices[k]);
      }
    #pragma omp critical
      for(k = 0; k < y.numRows(); ++k){
        y(k) += y2(k); 
      }
  }
KangHoon You
  • 123
  • 4
  • Assuming that there are duplicates within `rowIndices`, the code is outright wrong (race condition). There are certain approaches to parallelizing the problem, but I believe it will depend vastly on the sparsity of the data and your system how they perform. For any performance question you should always give your specific performance observations ("speed up is not too high" is terribly vague), measurement methods, hardware specification, compiler versions and options as well as a proper [mcve]. – Zulan May 15 '17 at 08:47
  • @Zulan Okay then, I guess I will restart from the basics. Could you explain to me why there will be a race condition? – KangHoon You May 15 '17 at 09:06
  • `y(rowIndices[k]) +=` happens concurrently on multiple threads: http://stackoverflow.com/questions/34510/what-is-a-race-condition – Zulan May 15 '17 at 09:10
  • @Zulan I tried to introduce a private variable so that the threads can alter these variables and in the critical section there values are merged. Does this solve the race condition? – KangHoon You May 15 '17 at 09:18
  • 1
    Yes, this is a reduction and solves the race condition. – Zulan May 15 '17 at 09:58
  • Do you need to write your own code here? Intel's MKL Is now gratis (https://software.intel.com/en-us/articles/free-mkl) , and can handle many different sparse matrix formats https://software.intel.com/en-us/node/471374 (FWIW, I work for Intel, but not on MKL) – Jim Cownie May 16 '17 at 08:54
  • Your last attempt is not parallel at all as the scope of the `parallel` region is the structured block that follows it. In that case the structured block is the single statement `Vector y2(y.numRows());`. In other news, parsing `OMP_NUM_THREADS` is done wrong as the standard allows it to be a list of values. Use `omp_get_max_threads()` instead. – Hristo Iliev May 16 '17 at 13:03
  • Also, sparse matrix-vector multiplication is a memory-bound problem and has poor scalability on CPUs with small caches and narrow memory interfaces, which includes many mobile and low-end desktop CPUs. Such problems scale with the memory bandwidth, e.g., with the number of sockets given modern CPUs with their own memory channels. – Hristo Iliev May 16 '17 at 13:08

0 Answers0