1

There is a contradiction on whether the insert operation on dict() or add operation in set() is O(n) or O(1), where n is the length of the string.

Suppose we have strings which vary in length i.e. n1, n2, ...n_x. Then the time complexity of performing the following:

s = set()
d = dict()
for x in {N}: # where N = [n1, n2, ... n_x]
  s.add(x)
  d[x] = 1

is O(len(N) * Z) where Z = len(n_1) + len(n_2) + ... len(n_x) If we assume that add or insert is O(1) operation then the time complexity will be O(len(N)).

Is the above true?

From: http://svn.python.org/projects/python/trunk/Objects/stringobject.c we see that the computation of the hash depends on the length of the string which is what I am assuming is len below:

static long string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

    if (a->ob_shash != -1)
        return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

Here (efficiency of long (str) keys in python dictionary) someone showed that varying the length of the string does not affect the time of computing a hash. But this contradicts the above code.

From the following link we know that the hash value once computed is stored in the object. This means that lookup will be constant time O(1). Get dictionary keys hashes without recalculation However, the insertion/adding which is when the computation of the hash is done should be linear.

Daniel Dubovski
  • 2,064
  • 25
  • 28
block-ch
  • 27
  • 1
  • 6
  • The size of each string is *independent* of the number of items already in the set. `len(N) * Z` consists of two *different* `len` functions: one that computes the length of the set, and another that computes the length of a string; and the second one is *constant* in terms of the number of items in the set. – chepner May 02 '19 at 14:40
  • Wait, suppose N is the input supplied to the function where N varies that cannot be considered constant, is it? – block-ch May 02 '19 at 14:47
  • Nothing about the *algorithm* assumes or cares if input will be `N`; if you provide such an input, that's a *coincidence*, not something that affects the *asymptotic* worst-case running time of the operation. – chepner May 02 '19 at 14:58

1 Answers1

3

There are gazillion things on which the performance of insert depends on. The calculation of hash function indeed is O(k) for a string of length k, but it is just uninteresting in general case.

If you consider string keys of only 8 bytes of length, there are 18446744073709551616 different combinations and 8 is a constant, calculation of hash for 8-byte key is O(8) is O(1).

But at 18446744073709551616 items, insertion to hash table could still take 1 µs. And for list, where insertion to the beginning would be O(n), and the insertion/copying of one item took only one nanosecond at the end of the list, insertion to the beginning of a list of that many items could take 585 years.

OTOH, while it is conceivable that you might have a collection of 4294967296 or even 18446744073709551616 items, if you've got a key of 4294967296 or 18446744073709551616 bytes to your hash table you seriously need to rethink your architecture.

  • But then why is it that in here the hash computation appears to be O(1): https://stackoverflow.com/questions/28150047/efficiency-of-long-str-keys-in-python-dictionary – block-ch May 02 '19 at 14:33
  • Good point. Of course it is case of Martijn being wrong (again). The hash code is cached on the string. – Antti Haapala -- Слава Україні May 02 '19 at 14:36
  • Sure, the hash is cached on the string once computed using O(k) time. Also others have been saying that computing hash in a hash table is independent on the length of the string i.e. some fixed number of bits is used in the computation: https://stackoverflow.com/questions/2771368/can-hash-tables-really-be-o1 . I tried to disprove that by showing the python source code of the implementation of the hash function. The link could actually be right if python is not using that hash function but some other one, hence my confusion over everything. – block-ch May 02 '19 at 14:41
  • Thanks, so, in here https://wiki.python.org/moin/TimeComplexity in the dict section, set item is said to be O(1). It is O(1) if we are setting a value to a key which has already been computed and had a value assigned, but if we have all the keys different and being computed the first time the time complexity is O(k) where k is the length of each string which could vary. Is that correct? – block-ch May 02 '19 at 14:56
  • It is O(1) as a function of the size of the collection. – Antti Haapala -- Слава Україні Mar 06 '21 at 15:34