1130

What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?

I know that all values in the list are unique as in this example.

The first method I try is (3.8 sec in my real code):

a = [4,2,3,1,5,6]

if a.count(7) == 1:
    b=a.index(7)
    "Do something with variable b"

The second method I try is (2x faster: 1.9 sec for my real code):

a = [4,2,3,1,5,6]

try:
    b=a.index(7)
except ValueError:
    "Do nothing"
else:
    "Do something with variable b"

Proposed methods from Stack Overflow user (2.74 sec for my real code):

a = [4,2,3,1,5,6]
if 7 in a:
    a.index(7)

In my real code, the first method takes 3.81 sec and the second method takes 1.88 sec. It's a good improvement, but:

I'm a beginner with Python/scripting, and is there a faster way to do the same things and save more processing time?

More specific explanation for my application:

In the Blender API I can access a list of particles:

particles = [1, 2, 3, 4, etc.]

From there, I can access a particle's location:

particles[x].location = [x,y,z]

And for each particle I test if a neighbour exists by searching each particle location like so:

if [x+1,y,z] in particles.location
    "Find the identity of this neighbour particle in x:the particle's index
    in the array"
    particles.index([x+1,y,z])
jtlz2
  • 6,105
  • 7
  • 54
  • 98
Jean-Francois Gallant
  • 12,533
  • 6
  • 19
  • 24
  • 7
    In python the thing in square brackets is called a list, not an array. Rather than using a list use a set. Or keep your list sorted and use the `bisect` module – Steven Rumbalski Sep 27 '11 at 15:27
  • So you really need to juggle indices? Or doesn't order actually matter and you just want to do member ship tests, intersections, etc.? In order words, it depends on what you're really trying to do. Sets may work for you, and then they are a really good answer, but we can't tell from the code you showed. –  Sep 27 '11 at 15:36
  • 2
    Probably you have to specify in your question that you need not the value, but its index. – Roman Bodnarchuk Sep 27 '11 at 15:36
  • I edit my question and try to explain more clearly what I want to do ... I hope so... – Jean-Francois Gallant Sep 27 '11 at 16:05
  • 1
    @StevenRumbalski: because set cannot contain duplication content, while Jean wants to store location of particles (x,y,z could be the same), we cannot use set in this case – Hieu Vo Jun 01 '14 at 05:35
  • In an unordered list you must search all items worst case. On average the half of it. If you going to search multiple times, you may benefit from sorting the list once. Also: Did you consider asking at [Blender](https://blender.stackexchange.com/)? – U. Windl Sep 11 '21 at 21:41
  • One could consider using dictionary instead where the numbers are keys and value 1 means "is in the set" (or any other data you want to link to that number). Accessing data in dictionary is obviously fast when you have the item to search in the key. Do you already have the list at this point, or can you change the data storage format? – Mikko Rantalainen Dec 01 '21 at 17:00

12 Answers12

2046
7 in a

Clearest and fastest way to do it.

You can also consider using a set, but constructing that set from your list may take more time than faster membership testing will save. The only way to be certain is to benchmark well. (this also depends on what operations you require)

Rafe Kettler
  • 73,388
  • 19
  • 151
  • 149
320

As stated by others, in can be very slow for large lists. Here are some comparisons of the performances for in, set and bisect. Note the time (in second) is in log scale.

enter image description here

Code for testing:

import random
import bisect
import matplotlib.pyplot as plt
import math
import time


def method_in(a, b, c):
    start_time = time.time()
    for i, x in enumerate(a):
        if x in b:
            c[i] = 1
    return time.time() - start_time


def method_set_in(a, b, c):
    start_time = time.time()
    s = set(b)
    for i, x in enumerate(a):
        if x in s:
            c[i] = 1
    return time.time() - start_time


def method_bisect(a, b, c):
    start_time = time.time()
    b.sort()
    for i, x in enumerate(a):
        index = bisect.bisect_left(b, x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return time.time() - start_time


def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    # adjust range down if runtime is too long or up if there are too many zero entries in any of the time_method lists
    Nls = [x for x in range(10000, 30000, 1000)]
    for N in Nls:
        a = [x for x in range(0, N)]
        random.shuffle(a)
        b = [x for x in range(0, N)]
        random.shuffle(b)
        c = [0 for x in range(0, N)]

        time_method_in.append(method_in(a, b, c))
        time_method_set_in.append(method_set_in(a, b, c))
        time_method_bisect.append(method_bisect(a, b, c))

    plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
    plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
    plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc='upper left')
    plt.yscale('log')
    plt.show()


profile()
szymmirr
  • 19
  • 7
xslittlegrass
  • 4,346
  • 4
  • 23
  • 32
  • 39
    Love cut-and-paste, executable code like this in answers. To save others a few seconds of time, you'll need 3 imports: `import random / import bisect / import matplotlib.pyplot as plt` and then call: `profile()` – kghastie Sep 01 '17 at 02:53
  • 2
    which version of python is this? – cowbert Sep 12 '17 at 20:02
  • always great to get the code but just heads up I had to import time to run – whla Feb 01 '18 at 18:22
  • 2
    And don't forget the humble `range()` object. When using `var in [integer list]`, see if a `range()` object can model the same sequence. Very close in performance to a set, but more concise. – Martijn Pieters Feb 04 '19 at 11:57
  • 1
    In my experience, converting a large list to set costs more time than searching directly in the list. – hui chen Nov 05 '21 at 09:01
  • It is probably worth mentioning that this only applies if you are looking for lots of elements in the list - in this code there is one conversion of list to set and then 1000s of membership checks so the faster lookup is more important than the conversion. If you only want to check for a single element @huichen is correct that it will take longer to do the conversion than a single `x in list` check. – Ben Rowland Dec 15 '21 at 21:42
61

You could put your items into a set. Set lookups are very efficient.

Try:

s = set(a)
if 7 in s:
  # do stuff

edit In a comment you say that you'd like to get the index of the element. Unfortunately, sets have no notion of element position. An alternative is to pre-sort your list and then use binary search every time you need to find an element.

Community
  • 1
  • 1
NPE
  • 464,258
  • 100
  • 912
  • 987
34
def check_availability(element, collection: iter):
    return element in collection

Usage

check_availability('a', [1,2,3,4,'a','b','c'])

I believe this is the fastest way to know if a chosen value is in an array.

Şafak Gezer
  • 3,697
  • 3
  • 42
  • 48
Tiago Moutinho
  • 1,344
  • 1
  • 13
  • 18
  • 4
    You need to put the code in a definition: def listValue(): a = [1,2,3,4,'a','b','c'] return 'a' in a x = listValue() print(x) – Tenzin Jun 04 '15 at 08:29
  • 12
    It's a valid Python answer it's just not good, readable code. – Rick Henderson May 12 '16 at 19:23
  • 1
    Beware ! This matches while this is very probably what you did not expect: `o='--skip'; o in ("--skip-ias"); # returns True !` – Alex F Feb 28 '18 at 10:02
  • 3
    @Alex F the `in` operator works the same way to test substring membership. The confusing part here is probably that `("hello")` is not a single-value tuple, while `("hello",)` is -- the comma makes the difference. `o in ("--skip-ias",)` is `False` as expected. – MoxieBall May 23 '18 at 21:58
  • This one was really usefull to me,but what I have to understand by "collection: iter" – Kev Aug 10 '20 at 00:26
31

The original question was:

What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?

Thus there are two things to find:

  1. is an item in the list, and
  2. what is the index (if in the list).

Towards this, I modified @xslittlegrass code to compute indexes in all cases, and added an additional method.

Results

enter image description here

Methods are:

  1. in--basically if x in b: return b.index(x)
  2. try--try/catch on b.index(x) (skips having to check if x in b)
  3. set--basically if x in set(b): return b.index(x)
  4. bisect--sort b with its index, binary search for x in sorted(b). Note mod from @xslittlegrass who returns the index in the sorted b, rather than the original b)
  5. reverse--form a reverse lookup dictionary d for b; then d[x] provides the index of x.

Results show that method 5 is the fastest.

Interestingly the try and the set methods are equivalent in time.


Test Code

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()
DarrylG
  • 14,084
  • 2
  • 15
  • 21
19
a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

This will only be a good idea if a doesn't change and thus we can do the dict() part once and then use it repeatedly. If a does change, please provide more detail on what you are doing.

Winston Ewert
  • 42,447
  • 10
  • 67
  • 81
  • It's working but not when implemented in my code: "TypeError: unhashable type:'list' – Jean-Francois Gallant Sep 27 '11 at 18:24
  • 1
    @Jean-FrancoisGallant, that's probably because you are using lists where you really ought to be using tuples. If you want comprehensive advice on how to speed up your code, you should post it at codereview.stackexchange.com. There you'll get style and performance advice. – Winston Ewert Sep 27 '11 at 18:32
  • 1
    This is a very clever solution to the problem. Instead of the try except construct, I would do: a_index = index.get(7) which will default to None if the key is not found. – murphsp1 Jul 20 '14 at 14:11
9

Be aware that the in operator tests not only equality (==) but also identity (is), the in logic for lists is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):

for element in s:
    if element is target:
        # fast check for identity implies equality
        return True
    if element == target:
        # slower check for actual equality
        return True
return False

In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN has the unusual property of being not being equal to itself:

>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True

To distinguish between these unusual cases you could use any() like:

>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True 

Note the in logic for lists with any() would be:

any(element is target or element == target for element in lst)

However, I should emphasize that this is an edge case, and for the vast majority of cases the in operator is highly optimised and exactly what you want of course (either with a list or with a set).

Chris_Rands
  • 35,097
  • 12
  • 75
  • 106
  • 1
    NAN == NAN returning false has nothing unusual about it. It is the behavior defined in the IEEE 754 standard. – TommyD Jan 17 '19 at 16:18
8

It sounds like your application might gain advantage from the use of a Bloom Filter data structure.

In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set. Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list. So if your application tends to get the "not found" result much more often then the "found" result, you might see a speed up by adding a Bloom Filter.

For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for "python bloom filter library" will provide at least a couple useful implementations.

matt2000
  • 1,004
  • 9
  • 17
6

If you only want to check the existence of one element in a list,

7 in list_data

is the fastest solution. Note though that

7 in set_data

is a near-free operation, independently of the size of the set! Creating a set from a large list is 300 to 400 times slower than in, so if you need to check for many elements, creating a set first is faster.

enter image description here

Plot created with perfplot:

import perfplot
import numpy as np


def setup(n):
    data = np.arange(n)
    np.random.shuffle(data)
    return data, set(data)


def list_in(data):
    return 7 in data[0]


def create_set_from_list(data):
    return set(data[0])


def set_in(data):
    return 7 in data[1]


b = perfplot.bench(
    setup=setup,
    kernels=[list_in, set_in, create_set_from_list],
    n_range=[2 ** k for k in range(24)],
    xlabel="len(data)",
    equality_check=None,
)
b.save("out.png")
b.show()
Nico Schlömer
  • 46,467
  • 24
  • 178
  • 218
5

Or use __contains__:

sequence.__contains__(value)

Demo:

>>> l = [1, 2, 3]
>>> l.__contains__(3)
True
>>> 
U12-Forward
  • 65,118
  • 12
  • 70
  • 89
  • 1
    `__contains__` is the implementation for `in`. 99 times out of 100, there's no need to call it directly. – CrazyChucky May 18 '21 at 03:03
  • 2
    @CrazyChucky Of course, I am not trying to say that my answer works the best, I am just providing a solution to the OP if maybe that 1 time he needs to use this. – U12-Forward May 18 '21 at 03:46
2

This is not the code, but the algorithm for very fast searching.

If your list and the value you are looking for are all numbers, this is pretty straightforward. If strings: look at the bottom:

  • -Let "n" be the length of your list
  • -Optional step: if you need the index of the element: add a second column to the list with current index of elements (0 to n-1) - see later
  • Order your list or a copy of it (.sort())
  • Loop through:
    • Compare your number to the n/2th element of the list
      • If larger, loop again between indexes n/2-n
      • If smaller, loop again between indexes 0-n/2
      • If the same: you found it
  • Keep narrowing the list until you have found it or only have 2 numbers (below and above the one you are looking for)
  • This will find any element in at most 19 steps for a list of 1.000.000 (log(2)n to be precise)

If you also need the original position of your number, look for it in the second, index column.

If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.

Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Adam
  • 304
  • 4
  • 12
2

Because the question is not always supposed to be understood as the fastest technical way - I always suggest the most straightforward fastest way to understand/write: a list comprehension, one-liner

[i for i in list_from_which_to_search if i in list_to_search_in]

I had a list_to_search_in with all the items, and wanted to return the indexes of the items in the list_from_which_to_search.

This returns the indexes in a nice list.

There are other ways to check this problem - however list comprehensions are quick enough, adding to the fact of writing it quick enough, to solve a problem.

Vaidøtas I.
  • 531
  • 5
  • 22