20

In Python you can get the intersection of two sets doing:

>>> s1 = {1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> s2 = {0, 3, 5, 6, 10}
>>> s1 & s2
set([3, 5, 6])
>>> s1.intersection(s2)
set([3, 5, 6])

Anybody knows the complexity of this intersection (&) algorithm?

EDIT: In addition, does anyone know what is the data structure behind a Python set?

juliomalegria
  • 22,918
  • 12
  • 67
  • 88

3 Answers3

30

The intersection algorithm always runs at O(min(len(s1), len(s2))).

In pure Python, it looks like this:

    def intersection(self, other):
        if len(self) <= len(other):
            little, big = self, other
        else:
            little, big = other, self
        result = set()
        for elem in little:
            if elem in big:
                result.add(elem)
        return result

[Answer to the question in the additional edit] The data structure behind sets is a hash table.

Raymond Hettinger
  • 199,887
  • 59
  • 344
  • 454
  • According to the wiki I linked above, the worst case for `elem in big` in your code is O(n) (although the average is of course O(1)). That's the basis for the intersection worst case of O(len(s)*len(t)). Any idea why? – Kurt McKee Nov 12 '11 at 04:33
  • 14
    The "worst case" assumes data that is inappropriate for use in the hash table used by *dict* and *set*. The data would have to be something such that every datum had exactly the same hash value -- this would force the hash table to do something akin to a linear search to do the \_\_contains\_\_ check. IOW, I wouldn't worry about this at all. Set intersection is blindly fast -- it even reuses the internally stored hash values so it won't need to make any calls to *hash()*. – Raymond Hettinger Nov 12 '11 at 04:47
  • Link to 3.x code: [here](https://github.com/python/cpython/blob/3.9/Objects/setobject.c#L1181) it is for 3.9. – user650654 Feb 02 '21 at 19:32
18

The answer appears to be a search engine query away. You can also use this direct link to the Time Complexity page at python.org. Quick summary:

Average:     O(min(len(s), len(t))
Worst case:  O(len(s) * len(t))

EDIT: As Raymond points out below, the "worst case" scenario isn't likely to occur. I included it originally to be thorough, and I'm leaving it to provide context for the discussion below, but I think Raymond's right.

Kurt McKee
  • 1,386
  • 12
  • 16
  • 2
    that's a nasty worst case, isn't it? – juliomalegria Nov 12 '11 at 04:19
  • I was surprised by it as well! Maybe it's an issue of having different data types mixed in the two sets being intersected? – Kurt McKee Nov 12 '11 at 04:22
  • 1
    It doesn't look like it uses a sort first (which *requires the objects have an ordering*), but rather just does a "hash probe": perhaps for a better `C` and Average (and *no ordering requirement*). The maximum required complexity, AFAIK, is about `O(n log n) + O(n)`, with a sort. However, Big-O is an upper-bounds and there are practical considerations so... –  Nov 12 '11 at 04:26
  • 1
    I think the major issue here is that the set is an unordered collection. In C++, you can make an intersection (with two sorted lists) in 2*(L1+L2)-1. That's a damn good complexity! http://cplusplus.com/reference/algorithm/set_intersection/ – juliomalegria Nov 12 '11 at 04:27
  • Nice! I really appreciate the insight; hash collisions make bunches of sense. +1's all around! – Kurt McKee Nov 12 '11 at 04:42
  • 4
    This answer is somewhat misleading with respect to the "worst case" time. Don't let it lead you away from a perfectly fine algorithm. – Raymond Hettinger Nov 12 '11 at 05:09
  • I agree with you; fixed. Again, I appreciate the discussion! – Kurt McKee Nov 12 '11 at 05:23
  • Please don't say "google it" in these answers. These answers often become the results of googling something. This answer was the first search result for me. – user124384 Sep 15 '18 at 19:13
  • Likelihood of something happening is irrelevant when discussion the "worst case". The worst case is the worst case as long as there is a nonzero chance of it happening. However, if Python is using a hash table, that $n^2$ complexity should not happen ever. Therefore, it's not even the worst case at all. – user124384 Sep 15 '18 at 19:17
  • 1
    @user124384 It's funny how the first search result is this thread from that hyperlink – THIS USER NEEDS HELP Aug 01 '19 at 05:35
2

Set intersection of two sets of sizes m,n can be achieved with O(max{m,n} * log(min{m,n})) in the following way: Assume m << n

1. Represent the two sets as list/array(something sortable)
2. Sort the **smaller** list/array (cost: m*logm)
3. Do until all elements in the bigger list has been checked:
    3.1 Sort the next **m** items on the bigger list(cost: m*logm)
    3.2 With a single pass compare the smaller list and the m items you just sorted and take the ones that appear in both of them(cost: m)
4. Return the new set

The loop in step 3 will run for n/m iterations and each iteration will take O(m*logm), so you will have time complexity of O(nlogm) for m << n.

I think that's the best lower bound that exists

Elad Yehezkel
  • 111
  • 1
  • 5