2

Below are two ways of writing what is seemingly (to me at least) exactly the same thing:

void do_some_work(MPI_Request* send_reqs, int* send_counter) {
    for (int i = 0; i < someNumber; ++i) {
        // working version
        MPI_Request req = send_reqs[*send_counter];
        MPI_Isend(ptr, n_elem, M_MPI_REAL, trg_rank, tag, MPI_COMM_WORLD, &req);
    // buggy version...
    MPI_Request* req = &amp;send_reqs[*send_counter];
    MPI_Isend(ptr, n_elem, M_MPI_REAL, trg_rank, tag, MPI_COMM_WORLD, req);

    // increment the counter
    (*send_counter)++;
}

}

// [...]

// Allocate the request array & start the sends MPI_Request* send_reqs = (MPI_Request*)calloc(someNumber, sizeof(MPI_Request)); int send_counter = 0; do_some_work(send_reqs, &send_counter);

// [...]

// later on, at the sync step: MPI_Waitall(send_counter, send_reqs, MPI_STATUSES_IGNORE);

However, in practice, the second one triggers random memory access errors (it has exactly the same symptoms as if MPI were writing random stuff all over the place, sometimes triggering segmentation faults or simply modifying values it shouldn't). To add to the weirdness of the issue, the code below works for 3 MPI ranks or less and starts acting out for 4 or 5 ranks (didn't test with more ranks).

Does anyone have any idea or explanation as to what changes and why one works while the other does not ?

  • Please add the code that passes your reuests to any MPI_Wait function. – Victor Eijkhout Feb 20 '24 at 12:50
  • Done, it's a simple MPI_Waitall on all the posted send/recv though – Gilles Poncelet Feb 20 '24 at 12:58
  • The first code cannot work. req is its own object, which is initialized by copy in the first line. You put the request into req, but you later wait on the element in send_reqs, which is not the same object. The second way you show avoids this problem. – Wolfgang Bangerth Feb 20 '24 at 13:51
  • I'm also confused by how send_counter is used as an integer in the last line, but apparently as a pointer in indexing into send_reqs. I think you need to show a complete piece of code. – Wolfgang Bangerth Feb 20 '24 at 13:52
  • yeah apologies for the pointer confusion, I updated the code so it's (hopefully) more clear for everyone (the actual code is fairly big, the only lines of interests are the two highlighted) – Gilles Poncelet Feb 20 '24 at 14:10
  • Surprisingly, the first code is the one working... and the second one is not x) – Gilles Poncelet Feb 20 '24 at 14:11
  • OK, I'm actually just a gigantic moron: @WolfgangBangerth is right, the first way is wrong (as it should be, this is reassuring x). I didn't allocate a request array big enough, hence the errors I got with the second way (and the first way 'hid' the issue by passing a copy...) – Gilles Poncelet Feb 20 '24 at 14:37

1 Answers1

3

As mentionned in the comments, the issue was that the request array wasn't big enough. Thus, the first two lines were hiding the memory allocation issue by creating copies of a non-existing MPI_Request (how it didn't bug out in the MPI_Waitall I've got no idea though)