2

I am trying to do Matrix multiplication using SSE. I have written a simple program for 4x4 matrices. Everything seems fine but when I print result , its some garbage values. please help to figure out problem/s. Secondly program stops working when I free memory, not a proper end of program.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <float.h>
#include <xmmintrin.h>

void main() {
    float **a, **b, **c;
    int a_r = 4, a_c = 4, b_c = 4, b_r = 4;
    int i, j, k;

    /* allocate memory for matrix one */
    a = (float **)malloc(sizeof(float) * a_r);
    for (i = 0; i < a_c; i++) {
        a[i] = (float *)malloc(sizeof(float) * a_c);
    }
    /* allocate memory for matrix two */
    b = (float **)malloc(sizeof(float) * b_r);
    for (i = 0; i < b_c; i++) {
        b[i] = (float *)malloc(sizeof(float) * b_c);
    }
    /* allocate memory for sum matrix */
    c = (float **)malloc(sizeof(float) * a_r);
    for (i = 0; i < b_c; i++) {
        c[i] = (float *)malloc(sizeof(float) * b_c);
    }
    printf("Initializing matrices...\n");

    //initializing first matrix
    for (i = 0; i < a_r; i++) {
        for (j = 0; j < a_c; j++) {
            a[i][j] = 2;
        }
    }
    // initializing second matrix
    for (i = 0; i < b_r; i++) {
        for (j = 0; j < b_c; j++) {
            b[i][j] = 2;
        }
    }
    /* initialize product matrix */
    for (i = 0; i < a_r; i++) {
        for (j = 0; j < b_c; j++) {
            c[i][j] = 0;
        }
    }

    int count = 0;
    /* multiply matrix one and matrix two */
    for (i = 0; i < a_r; i++) {
        for (j = 0; j < a_c; j++) {
            count = 0;
            __m128 result = _mm_setzero_ps();
            for (k = 0; k < 4; k += 4) {
                __m128 row1 = _mm_loadu_ps(&a[i][k]);
                __m128 row2 = _mm_loadu_ps(&b[k][j]);
                result = _mm_mul_ps(row1, row2);

                for (int t = 1; t < 4; t++) {
                    __m128 row3=_mm_loadu_ps(&a[t * 4]);
                    __m128 row4=_mm_loadu_ps(&b[i][t]);
                    __m128 row5 = _mm_mul_ps(row3,row4);
                    result = _mm_add_ps(row5, result);
                }
                _mm_storeu_ps(&c[i][j], result);
            }
        }
    }
    printf("******************************************************\n");
    printf ("Done.\n");

    for (i = 0; i < a_r ; i++) {
        for (j = 0; j < b_c; j++) {
            printf ("%f   ", c[i][j]);   // issue here when I print results.
        }
        printf("\n");
    }     //  Here program stops working.

    /*free memory*/
    for (i = 0; i < a_r; i++) {
        free(a[i]);
    }
    free(a);
    for (i = 0; i < a_c; i++) {
        free(b[i]);
    }
    free(b);
    for (i = 0; i < b_c; i++) {
        free(c[i]);
    }
    free(c);
}

please have look at address printed for output matrix. how to get aligned addresses, I have _aligned_malloc, but still not aligned.

enter image description here

chqrlie
  • 114,102
  • 10
  • 108
  • 170
Sarmad
  • 59
  • 6
  • could be because your allocated arrays are not *aligned* – willywonkadailyblah Jul 18 '17 at 10:36
  • @spug any idea how to align or check aligment? – Sarmad Jul 18 '17 at 10:39
  • 2
    What does _stops working_ mean? Did it crash? Or freeze? What happens when you examine it in the debugger? What compiler are you using? – Useless Jul 18 '17 at 10:40
  • it may not be useful to your goals (windows platform only), but windows does have a library that uses intrinsics for linear algebra: https://msdn.microsoft.com/en-us/library/windows/desktop/ee415594(v=vs.85).aspx – stack user Jul 18 '17 at 10:41
  • 1
    Could try https://stackoverflow.com/questions/227897/how-to-allocate-aligned-memory-only-using-the-standard-library. Google is your friend – willywonkadailyblah Jul 18 '17 at 10:41
  • 1) Look up the alignment requirements for those instructions. 2) Print the pointer value and see whether it is zero modulo that alignment. – Useless Jul 18 '17 at 10:41
  • @useless yes it crashes, I am using CodeBlocks. My main issue is regarding results. (I didnt check on debugger) – Sarmad Jul 18 '17 at 10:41
  • 1
    Alignment is not required if using `_mm_loadu_ps` and `_mm_storeu_ps`. – Rotem Jul 18 '17 at 10:43
  • 1
    It's not alignment. The "u" in "loadu" means "unaligned" – Art Jul 18 '17 at 10:43
  • 2
    You're using the the wrong sizeof when allocating the arrays of pointers. – Art Jul 18 '17 at 10:44
  • 2
    Stop casting malloc. It hides errors. – David Hoelzer Jul 18 '17 at 10:45
  • To expand on @Art 's comment, you are `malloc`ing `sizeof(float)*a_r`, whereas you should be `malloc`ing `sizeof(float*)*a_r`. – Rotem Jul 18 '17 at 11:04
  • @DavidHoelzer without casting malloc, how can i dynamically allocate matrices? – Sarmad Jul 18 '17 at 11:04
  • 1
    malloc doesn't need a cast in C; pointers are interconvertible. Also, in C, the `main` function returns `int`, not `void`. – Cody Gray Jul 18 '17 at 11:49
  • Which debugging tools have you used so far? I'd expect to see you have used a heap checker such as Valgrind's memcheck, for example. – Toby Speight Jul 18 '17 at 12:04

1 Answers1

3

The allocation for the matrix indirect pointers is incorrect. it should read:

a = (float **)malloc(sizeof(float*) * a_r);

A safer way to write these allocations is this:

a = malloc(sizeof(*a) * a_r);

Note that you could allocate 2D matrices directly:

float (*a)[4][4] = malloc(sizeof(*a));

Or better, as Cody Gray suggested:

float (*a)[4][4] = _aligned_malloc(sizeof(*a));

_aligned_malloc is a non standard function that ensures proper alignment for SSE operands.

If fact you probably do not even need to allocate these matrices with malloc():

float a[4][4];

But with this latter choice, you must ensure proper alignment for the SSE operations to succeed.

The rest of the code has other problems:

  • void main() is incorrect. It should be int main(void)

  • The second matrix operand should be transposed so you can read multiple values at a time. The second load would become:

    __m128 row2 = _mm_loadu_ps(&b[j][k]);
    
  • The summation phase seems incorrect too. And the final store is definitely incorrect, should just be:

    c[i][j] = sum;
    
chqrlie
  • 114,102
  • 10
  • 108
  • 170