According the following table (from this paper), numpy's np.dot performance is comparable to a CUDA implementation of matrix multiplication, in experiments with 320x320 matrices. And I did replicate this Speedup in my machine for np.dot with enough precision. Their code for CUDA with Numba ran much slower though, with a Speedup of about 1200 instead of the 49258 reported.
Why is numpy's implementation so fast?
Edit: here's the code taken from the paper. I just added the timeit calls. I ran it in the following laptop.