Based on Is numpy.einsum efficient compared to fortran or C? (related comparison Benchmarking (python vs. c++ using BLAS) and (numpy))
einsum can load BLAS at least in some tensor contractions and it is implemented in C.
I suppose for the contractions which BLAS can apply, the timing in C is similar to using Fortran/C/C++ to load BLAS.
There is a step of using python to call einsum, and it may lead to some overhead. How long does it take? I can run some benchmarks, but I appreciate some higher-level answers.