9

Suppose I have a bunch of arrays, including x and y, and I want to check if they're equal. Generally, I can just use np.all(x == y) (barring some dumb corner cases which I'm ignoring now).

However this evaluates the entire array of (x == y), which is usually not needed. My arrays are really large, and I have a lot of them, and the probability of two arrays being equal is small, so in all likelihood, I really only need to evaluate a very small portion of (x == y) before the all function could return False, so this is not an optimal solution for me.

I've tried using the builtin all function, in combination with itertools.izip: all(val1==val2 for val1,val2 in itertools.izip(x, y))

However, that just seems much slower in the case that two arrays are equal, that overall, it's stil not worth using over np.all. I presume because of the builtin all's general-purposeness. And np.all doesn't work on generators.

Is there a way to do what I want in a more speedy manner?

I know this question is similar to previously asked questions (e.g. Comparing two numpy arrays for equality, element-wise) but they specifically don't cover the case of early termination.

Community
  • 1
  • 1
acdr
  • 4,328
  • 1
  • 14
  • 41
  • what about this function: https://docs.scipy.org/doc/numpy-1.12.0/reference/generated/numpy.array_equal.html – Thomas Kühn May 15 '17 at 07:51
  • @Thomas: That function just calls `np.all` internally, so it's kind of useless. (I would indeed expect a function specifically for this purpose to do short-circuiting, but alas, it does not.) – acdr May 15 '17 at 07:54
  • 1
    Hmm, that's a shame. A numpy-internal function would be your only chance, I guess, as any looping outside of numpy is almost bound to be slower. Have you considered contacting the developers directly? – Thomas Kühn May 15 '17 at 08:00

7 Answers7

12

Until this is implemented in numpy natively you can write your own function and jit-compile it with numba:

import numpy as np
import numba as nb


@nb.jit(nopython=True)
def arrays_equal(a, b):
    if a.shape != b.shape:
        return False
    for ai, bi in zip(a.flat, b.flat):
        if ai != bi:
            return False
    return True


a = np.random.rand(10, 20, 30)
b = np.random.rand(10, 20, 30)


%timeit np.all(a==b)  # 100000 loops, best of 3: 9.82 µs per loop
%timeit arrays_equal(a, a)  # 100000 loops, best of 3: 9.89 µs per loop
%timeit arrays_equal(a, b)  # 100000 loops, best of 3: 691 ns per loop

Worst case performance (arrays equal) is equivalent to np.all and in case of early stopping the compiled function has the potential to outperform np.all a lot.

MB-F
  • 21,224
  • 4
  • 58
  • 107
  • I like it, but for my test arrays, if they are equal, it still takes about 1.6 times longer than `np.all(arr1==arr2)`. (For reference, `arr1 = np.ones((1000000,), dtype=bool)`, 'arr2 = np.ones((1000000,), dtype=bool)', 'arr2[100000] = False`). (Make sure to lower the number on your timeit to something like 1000.) – acdr May 15 '17 at 09:35
  • @acdr When I use your arrays `np.all` takes 1.8 ms and `arrays_equal` takes 183 µs. If I compare `arr1` against itself both take about 1.8 ms. Maybe this discrepancy is caused by our systems? I have Python 3.5.2, numpy 1.12.1 and numba 0.27.0. – MB-F May 15 '17 at 09:44
  • Might be. I'm running much older stuff than you in general: Python 2.7.10.2, numpy 1.9.1, numba 0.20.0 – acdr May 15 '17 at 09:54
  • `Np.all` does not have branch instructions. In the case where arrays are identical, you'd expect a function without branches to be faster than a function with branches. That's probably where the discrepancy comes from. You should look at your use cases and decide what is more likely to happen. This is still python, and not assembly, so microoptimizations may not always have the effect you want. – dangom May 15 '17 at 09:54
  • @dangom Your point can be easily verified by writing a branchless version of above's `array_compare`. I did so and found that you are right - in the worst case scenario the branchless version is slightly faster. However, `np.all` performs equivalently to the branching version. (This casts doubt on your assertion that `np.all` does not have branch instructions.) Anyway, the point here is not to optimize the worst case but the *likely* case, which allegedly can benefit from short-circuit termination. I suspect that numba 0.20 produces less optimal code than numba 0.27 in the particular situation – MB-F May 15 '17 at 10:43
  • It is in numpy now, right? https://numpy.org/doc/stable/reference/generated/numpy.array_equal.html – Jann Poppinga Aug 19 '20 at 08:56
  • 1
    @JannPoppinga unfortunately, no; at least not in this function. `array_equal` just [calls `np.all(a==b)` internally](https://github.com/numpy/numpy/blob/v1.19.0/numpy/core/numeric.py#L2378). – MB-F Aug 19 '20 at 14:08
1

Adding short-circuit logic to array comparisons is apparently being discussed on the numpy page on github, and will thus presumably be available in a future version of numpy.

acdr
  • 4,328
  • 1
  • 14
  • 41
1

Hmmm, I know it is the poor answer but it seems there is no easy way for this. Numpy Creators should fix it. I suggest:

def compare(a, b):
    if len(a) > 0 and not np.array_equal(a[0], b[0]):
        return False
    if len(a) > 15 and not np.array_equal(a[:15], b[:15]):
        return False
    if len(a) > 200 and not np.array_equal(a[:200], b[:200]):
        return False
    return np.array_equal(a, b)

:)

Śmigło
  • 709
  • 7
  • 12
  • 1
    because nobody else said, that it can't be done with numpy, and the problem is still open I think – Śmigło Dec 18 '17 at 21:30
  • [This answer](https://stackoverflow.com/a/43975611/501011) has been accepted and uses numpy – MechMK1 Dec 18 '17 at 21:54
  • 2
    it uses numba. If you're not happy with someone's telling honestly there is no better way to do sth, you can flag it, but my answet at least contains creative solution. – Śmigło Dec 19 '17 at 10:18
1

Well, not really an answer as I haven't checked if it break-circuits, but:

assert_array_equal.

From the documentation:

Raises an AssertionError if two array_like objects are not equal.

Try Except it if not on a performance sensitive code path.

Or follow the underlying source code, maybe it's efficient.

matanster
  • 14,583
  • 17
  • 83
  • 152
  • 1
    Thanks for the suggestion. Unfortunately, the underlying code looks to just be a wrapper for `x == y`, with a few extra steps added in for some corner cases (like `NaN`s and `Inf`s). – acdr Mar 25 '19 at 10:20
0

You could iterate all elements of the arrays and check if they are equal. If the arrays are most likely not equal it will return much faster than the .all function. Something like this:

import numpy as np

a = np.array([1, 2, 3])
b = np.array([1, 3, 4])

areEqual = True

for x in range(0, a.size-1):
        if a[x] != b[x]:
                areEqual = False
                break
        else:
               print "a[x] is equal to b[x]\n"

if areEqual:
        print "The tables are equal\n"
else:
        print "The tables are not equal\n"
Anoroah
  • 1,767
  • 1
  • 18
  • 29
  • This is effectively what `all(val1==val2 for val1,val2 in itertools.izip(x, y))` does: it loops through `x` and `y`, returning pairs of `val1` and `val2`, checks if they're the same, and passes the result to `all`, which will return `False` as soon as it finds a non-equal pair. – acdr May 15 '17 at 09:11
  • Oh i see, i thought it goes through all elements of the arrays. – Anoroah May 15 '17 at 09:14
  • Luckily, the builtin `all` does do circuit breaking, unlike `np.all`. :) – acdr May 15 '17 at 09:36
0

Probably someone who understands the underlying data structure could optimize this or explain whether it's reliable/safe/good practice, but it seems to work.

np.all(a==b)
Out[]: True

memoryview(a.data)==memoryview(b.data)
Out[]: True

%timeit np.all(a==b)
The slowest run took 10.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.2 µs per loop

%timeit memoryview(a.data)==memoryview(b.data)
The slowest run took 8.55 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.85 µs per loop

If I understand this correctly, ndarray.data creates a pointer to the data buffer and memoryview creates a native python type that can be short-circuited out of the buffer.

I think.

EDIT: further testing shows it may not be as big a time-improvement as shown. previously a=b=np.eye(5)

a=np.random.randint(0,10,(100,100))

b=a.copy()

%timeit np.all(a==b)
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 17.7 µs per loop

%timeit memoryview(a.data)==memoryview(b.data)
10000 loops, best of 3: 30.1 µs per loop

np.all(a==b)
Out[]: True

memoryview(a.data)==memoryview(b.data)
Out[]: True
Daniel F
  • 12,948
  • 1
  • 24
  • 53
  • Doesn't this just test if the two arrays are actually different names for the same object, rather than two different objects that have the same values? – acdr May 15 '17 at 09:39
  • Not as far as I can tell. Tested using `.copy()` as above and then sequentially manipulating both random arrays above in identical ways. – Daniel F May 15 '17 at 09:44
  • Didn't work for me, with a current Anaconda version of numpy. Maybe it just doesn't like NaNs. – matanster Mar 23 '19 at 17:29
  • @matanster not sure what you tried, but in standard usage NaN != NaN – Daniel F Mar 25 '19 at 07:00
0

As Thomas Kühn wrote in a comment to your post, array_equal is a function which should solve the problem. It is described in Numpy's API reference.