85

I have a very large NumPy array

1 40 3
4 50 4
5 60 7
5 49 6
6 70 8
8 80 9
8 72 1
9 90 7
.... 

I want to check to see if a value exists in the 1st column of the array. I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method.

Thanks!

agf
  • 160,324
  • 40
  • 275
  • 231
thegreatt
  • 1,219
  • 2
  • 12
  • 17
  • 1
    You might use binary search if 1st index is in non-decreasing order or consider sorting if you do more than lets say 10 searches – Luka Rahne Aug 17 '11 at 06:39

7 Answers7

94

How about

if value in my_array[:, col_num]:
    do_whatever

Edit: I think __contains__ is implemented in such a way that this is the same as @detly's version

agf
  • 160,324
  • 40
  • 275
  • 231
  • 9
    You know, I've been using `numpy`'s `any()` function so heavily recently, I completely forgot about plain old `in`. – detly Aug 17 '11 at 06:22
  • 13
    Okay, this is (a) more readable and (b) about 40% faster than my answer. – detly Aug 17 '11 at 06:42
  • 7
    In principle, `value in …` can be faster than `any(… == value)`, because it can iterate over the array elements and stop whenever the value is encountered (as opposed to calculating whether each array element is equal to the value, and then checking whether one of the boolean results is true). – Eric O Lebigot Aug 17 '11 at 08:02
  • 1
    @EOL really? In Python, `any` is short-circuiting, is it not in `numpy`? – agf Aug 17 '11 at 08:08
  • 1
    Thanks, I've barely used `numpy`. – agf Aug 17 '11 at 09:07
  • @EricLeschinski Your edit is confusing and doesn't directly relate to the question, so I'm reverting it. Perhaps you can post it as a comment, and rewrite it to be more clear. – agf Oct 13 '17 at 16:22
  • 10
    Things changed since, note that in future @detly's answer would become the only working solution, currently a warning is thrown. for more see https://stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur for more. – borgr Jan 08 '18 at 14:28
56

The most obvious to me would be:

np.any(my_array[:, 0] == value)
eduardosufan
  • 1,294
  • 2
  • 18
  • 46
detly
  • 27,996
  • 15
  • 89
  • 145
  • 2
    HI @detly can you add more explaination. it seems very obvious to you but a beginner like me is not. My instinct tells me that this might be the solution that im looking for but i could not try it with out examples :D – jameshwart lopez Apr 11 '18 at 06:46
  • 2
    @jameshwartlopez `my_array[:, 0]` gives you all the rows (indicated by `:`) and for each row the `0`th element, i.e. the first column. This is a simple one-dimensional array, for example `[1, 3, 6, 2, 9]`. If you use the `==` operator in numpy with a scalar, it will do element-wise comparison and return a boolean numpy array of the same shape as the array. So `[1, 3, 6, 2, 9] == 3` gives `[False, True, False, False, False]`. Finally, `np.any` checks, if any of the values in this array are `True`. – Kilian Batzner May 16 '18 at 14:02
45

To check multiple values, you can use numpy.in1d(), which is an element-wise function version of the python keyword in. If your data is sorted, you can use numpy.searchsorted():

import numpy as np
data = np.array([1,4,5,5,6,8,8,9])
values = [2,3,4,6,7]
print np.in1d(values, data)

index = np.searchsorted(data, values)
print data[index] == values
HYRY
  • 89,863
  • 23
  • 181
  • 185
  • 4
    +1 for the less well-known `numpy.in1d()` and for the very fast `searchsorted()`. – Eric O Lebigot Aug 17 '11 at 08:06
  • @eryksun: Yeah, interesting. Same observation, here… – Eric O Lebigot Aug 17 '11 at 13:12
  • 1
    Note that the final line will throw an `IndexError` if any element of `values` is larger than the greatest value of `data`, so that requires specific attention. – fuglede Jul 30 '19 at 09:11
  • @fuglede It's possible to replace `index` with `index % len(data)` or `np.append(index[:-1],0)` equivalently in this case. – mathfux Jan 03 '20 at 16:19
  • 1
    [`np.in1d()`](https://numpy.org/doc/stable/reference/generated/numpy.in1d.html) is limimted only to 1-d numpy arrays. If you want to check if multiple values are in a multidimensional numpy array use [`np.isin()`](https://numpy.org/doc/stable/reference/generated/numpy.isin.html) method. – Aelius Jan 06 '21 at 21:42
25

Fascinating. I needed to improve the speed of a series of loops that must perform matching index determination in this same way. So I decided to time all the solutions here, along with some riff's.

Here are my speed tests for Python 2.7.10:

import timeit
timeit.timeit('N.any(N.in1d(sids, val))', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

18.86137104034424

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = [20010401010101+x for x in range(1000)]')

15.061666011810303

timeit.timeit('N.in1d(sids, val)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

11.613027095794678

timeit.timeit('N.any(val == sids)', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

7.670552015304565

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

5.610057830810547

timeit.timeit('val == sids', setup = 'import numpy as N; val = 20010401020091; sids = N.array([20010401010101+x for x in range(1000)])')

1.6632978916168213

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = set([20010401010101+x for x in range(1000)])')

0.0548710823059082

timeit.timeit('val in sids', setup = 'import numpy as N; val = 20010401020091; sids = dict(zip([20010401010101+x for x in range(1000)],[True,]*1000))')

0.054754018783569336

Very surprising! Orders of magnitude difference!

To summarize, if you just want to know whether something's in a 1D list or not:

  • 19s N.any(N.in1d(numpy array))
  • 15s x in (list)
  • 8s N.any(x == numpy array)
  • 6s x in (numpy array)
  • .1s x in (set or a dictionary)

If you want to know where something is in the list as well (order is important):

  • 12s N.in1d(x, numpy array)
  • 2s x == (numpy array)
Lukas Mandrake
  • 351
  • 3
  • 5
3

Adding to @HYRY's answer in1d seems to be fastest for numpy. This is using numpy 1.8 and python 2.7.6.

In this test in1d was fastest, however 10 in a look cleaner:

a = arange(0,99999,3)
%timeit 10 in a
%timeit in1d(a, 10)

10000 loops, best of 3: 150 µs per loop
10000 loops, best of 3: 61.9 µs per loop

Constructing a set is slower than calling in1d, but checking if the value exists is a bit faster:

s = set(range(0, 99999, 3))
%timeit 10 in s

10000000 loops, best of 3: 47 ns per loop
Joelmob
  • 1,026
  • 2
  • 9
  • 22
  • 2
    The comparison isn't fair. You need to count the cost of converting an array to a `set`. OP starts with a NumPy array. – jpp Aug 08 '18 at 08:39
  • I didn't mean to compare the methods like that so i edited the post to point out the cost of creating a set. If you already have python set, there is no big difference. – Joelmob Feb 26 '21 at 10:59
0

The most convenient way according to me is:

(Val in X[:, col_num])

where Val is the value that you want to check for and X is the array. In your example, suppose you want to check if the value 8 exists in your the third column. Simply write

(8 in X[:, 2])

This will return True if 8 is there in the third column, else False.

Loochie
  • 2,066
  • 10
  • 18
0

If you are looking for a list of integers, you may use indexing for doing the work. This also works with nd-arrays, but seems to be slower. It may be better when doing this more than once.

def valuesInArray(values, array):
    values = np.asanyarray(values)
    array = np.asanyarray(array)
    assert array.dtype == np.int and values.dtype == np.int
    
    matches = np.zeros(array.max()+1, dtype=np.bool_)
    matches[values] = True
    
    res = matches[array]
    
    return np.any(res), res
    
    
array = np.random.randint(0, 1000, (10000,3))
values = np.array((1,6,23,543,222))

matched, matches = valuesInArray(values, array)

By using numba and njit, I could get a speedup of this by ~x10.