I am new to CUDA programming. I have to implement recursive algorithm which includes varying number of matrix inversions on GPU . Currently I am using getrfBatched and getriBatched methods in CUBLAS library to get a inversion of an one matrix. I think I can speed up the whole algorithm if I can implement all those matrix inversions parallel . Can anyone suggest me a solution for this??
Thanks