Numpy wraps with BLAS only for primitive operations specified with BLAS. That includes dot, innerproduct, vdot, matmul (new in 1.10), and functions that depend on it (tensordot etc.). einsum, on the other hand, only calls BLAS for operations that allow to fall back to it (as of Numpy 1.14.0).
If your problem can be decomposed into several BLAS operations, then I suggest you try that first in Numpy itself. It may require some temporary arrays in between (it would still be the case even if you were to write C/FORTRAN that uses BLAS). You can eliminate certain array creation overhead by using the out= parameter of the functions.
But most of the time, you're using einsum because it's not expressible in BLAS. See a simple example:
a = np.arange(60.).reshape(3,4,5)
b = np.arange(24.).reshape(4,3,2)
c = np.einsum('ijk,jil->kl', a, b)
To express the above in primitive operations, you need to swap the first two axes in b, do a element-wise multiplication for the first two dimensions, then sum them up, for each index k and l.
c2 = np.ndarray((5, 2))
b2 = np.swapaxes(b, 0, 1)
def manualeinsum(c2, a, b):
ny, nx = c2.shape
for k in range(ny):
for l in range(nx):
c2[k, l] = np.sum(a[..., k]*b2[...,l])
manualeinsum(c2, a, b2)
You can't BLAS that. UPDATE: The above problem can be expressed as matrix multiplication which can be accelerated using BLAS. See comment by @ali_m. For large enough arrays, BLAS approach is faster.
Meanwhile, note that einsum itself is written in C, creates a dimension-specific iterator for the indices given, and is also optimized for SSE.