Alternative to python hash function for arbitrary objects

Question

In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:

class PersistentDict(object):
  def __setitem__(self, key, value):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    self._store_to_bucket(bucket_index, key, value)

  def __getitem__(self, key):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    return self._fetch_from_bucket(bucket_index)[key]

In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:

Must be stable across interpreter invocations
May require parameters supplied at execution time, e.g. setting a salt in the call
Must support arbitrary objects (anything supported by dict/set)

I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.

[1] Using hash() to identify buckets is either:

Not reliable across interpreter invocations, if salts are random
Prevents applications from using the random salting feature, if salts are fixed
Unusable if two PersistentDicts were created with different salts

The random hash only applies to `str`, `bytes` and `datetime` objects. You'd only need an alternative for those types. — Martijn Pieters, Jun 24 '16 at 09:17
@MartijnPieters Thanks! I guess str and bytes can be covered by zlib/hashlib. I'll see whether I can find something fast for datetime. — MisterMiyagi, Jun 24 '16 at 09:26
`str(datetime)` should suffice; that gets you the ISO8601 format with microseconds and optionally the timezone if attached, and then you have a `str`... — Martijn Pieters, Jun 24 '16 at 09:46
I'd still test thoroughly and look for alternatives if timezones are involved however; I suspect that timezone subtleties could lead to equal ISO8601 representations for otherwise different timezone objects, in ways that could matter. — Martijn Pieters, Jun 24 '16 at 09:47
@MartijnPieters It seems that salting also applies to ``None``. I have yet to find a suitable solution for hashable containers, without restricting it to ``repr`` or the like. — MisterMiyagi, Dec 07 '17 at 09:50
For `None` the `__hash__` implementation is inherited from `type`, which produces the hash of the memory address of the object, since this varies from interpreter to interpreter process, the value can appear random. This means that setting `PYTHONHASHSEED` **will not affect the hash value of `None`**, like it would for types to which the seed does apply. — Martijn Pieters, Dec 07 '17 at 10:28
So I guess I have to modify my first comment to include type objects (custom classes and built-in) and `None`. — Martijn Pieters, Dec 07 '17 at 10:31

MisterMiyagi · Answer 1 · 2017-12-07T11:00:45.917

I've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:

def hashkey(obj, salt=0):
  """
  Create a key suitable for use in hashmaps

  :param obj: object for which to create a key
  :type: str, bytes, :py:class:`datetime.datetime`, object
  :param salt: an optional salt to add to the key value
  :type salt: int
  :return: numeric key to `obj`
  :rtype: int
  """
  if obj is None:
    return 0
  if isinstance(obj, str):
    return zlib.adler32(obj.encode(), salt) & 0xffffffff
  elif isinstance(obj, bytes):
    return zlib.adler32(obj, salt) & 0xffffffff
  elif isinstance(obj, datetime_type):
    return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff

With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.

Overhead is roughly as follows:

0.1 usec for the function call (hash(obj) vs def pyhash(obj): return hash(obj))
0.2 usec to 0.5 usec for selecting the hash function via isinstance
0.75 usec for zlib.adler32 or zlib.crc32 vs hash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)
0.15 usec for obj.encode() of str objects ("foobar")
1.5 usec for str(obj).encode() of datetime.datetime objects

The most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:

def hashkey_c(obj, salt=0):
  if obj.__class__ in hashkey_c.types:
    if obj is None:
      return 0
    if obj.__class__ is str:
      return zlib.adler32(obj.encode(), salt) & 0xffffffff
    elif obj.__class__ is bytes:
      return zlib.adler32(obj, salt) & 0xffffffff
    elif obj.__class__ is datetime_type:
      return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}

Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).

Some additional notes:

hash is not stable across interpreter starts for any object using the default __hash__ implementation, including None
It does not work properly for immutable containers (which define __hash__) containing a salted type, e.g. (1, 2, 'three')

As of python3.6 or so this is not working. It is not stable across different interpreter executions. E.g. try hashkey(None). — Neal Young, Dec 07 '17 at 06:24
@NealYoung Thanks for checking. It seems ``hash(None)`` is salted at least since Python3.4 - I've added this case now. I also realised that hashable containers, such as Tuple, would require special casing as well. — MisterMiyagi, Dec 07 '17 at 09:48

Alternative to python hash function for arbitrary objects

1 Answers1

Linked