1

In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:

class PersistentDict(object):
  def __setitem__(self, key, value):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    self._store_to_bucket(bucket_index, key, value)

  def __getitem__(self, key):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    return self._fetch_from_bucket(bucket_index)[key]

In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:

  • Must be stable across interpreter invocations
  • May require parameters supplied at execution time, e.g. setting a salt in the call
  • Must support arbitrary objects (anything supported by dict/set)

I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.


[1] Using hash() to identify buckets is either:

  • Not reliable across interpreter invocations, if salts are random
  • Prevents applications from using the random salting feature, if salts are fixed
  • Unusable if two PersistentDicts were created with different salts
Community
  • 1
  • 1
MisterMiyagi
  • 36,972
  • 7
  • 82
  • 99
  • 1
    The random hash only applies to `str`, `bytes` and `datetime` objects. You'd only need an alternative for those types. – Martijn Pieters Jun 24 '16 at 09:17
  • @MartijnPieters Thanks! I guess str and bytes can be covered by zlib/hashlib. I'll see whether I can find something fast for datetime. – MisterMiyagi Jun 24 '16 at 09:26
  • `str(datetime)` should suffice; that gets you the ISO8601 format with microseconds and optionally the timezone if attached, and then you have a `str`... – Martijn Pieters Jun 24 '16 at 09:46
  • 1
    I'd still test thoroughly and look for alternatives if timezones are involved however; I suspect that timezone subtleties could lead to equal ISO8601 representations for otherwise different timezone objects, in ways that could matter. – Martijn Pieters Jun 24 '16 at 09:47
  • @MartijnPieters It seems that salting also applies to ``None``. I have yet to find a suitable solution for hashable containers, without restricting it to ``repr`` or the like. – MisterMiyagi Dec 07 '17 at 09:50
  • 1
    For `None` the `__hash__` implementation is inherited from `type`, which produces the hash of the memory address of the object, since this varies from interpreter to interpreter process, the value can appear random. This means that setting `PYTHONHASHSEED` **will not affect the hash value of `None`**, like it would for types to which the seed does apply. – Martijn Pieters Dec 07 '17 at 10:28
  • 1
    So I guess I have to modify my first comment to include type objects (custom classes and built-in) and `None`. – Martijn Pieters Dec 07 '17 at 10:31

1 Answers1

0

I've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:

def hashkey(obj, salt=0):
  """
  Create a key suitable for use in hashmaps

  :param obj: object for which to create a key
  :type: str, bytes, :py:class:`datetime.datetime`, object
  :param salt: an optional salt to add to the key value
  :type salt: int
  :return: numeric key to `obj`
  :rtype: int
  """
  if obj is None:
    return 0
  if isinstance(obj, str):
    return zlib.adler32(obj.encode(), salt) & 0xffffffff
  elif isinstance(obj, bytes):
    return zlib.adler32(obj, salt) & 0xffffffff
  elif isinstance(obj, datetime_type):
    return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff

With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.

Overhead is roughly as follows:

  • 0.1 usec for the function call (hash(obj) vs def pyhash(obj): return hash(obj))
  • 0.2 usec to 0.5 usec for selecting the hash function via isinstance
  • 0.75 usec for zlib.adler32 or zlib.crc32 vs hash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)
  • 0.15 usec for obj.encode() of str objects ("foobar")
  • 1.5 usec for str(obj).encode() of datetime.datetime objects

The most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:

def hashkey_c(obj, salt=0):
  if obj.__class__ in hashkey_c.types:
    if obj is None:
      return 0
    if obj.__class__ is str:
      return zlib.adler32(obj.encode(), salt) & 0xffffffff
    elif obj.__class__ is bytes:
      return zlib.adler32(obj, salt) & 0xffffffff
    elif obj.__class__ is datetime_type:
      return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}

Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).


Some additional notes:

  • hash is not stable across interpreter starts for any object using the default __hash__ implementation, including None
  • It does not work properly for immutable containers (which define __hash__) containing a salted type, e.g. (1, 2, 'three')
MisterMiyagi
  • 36,972
  • 7
  • 82
  • 99
  • As of python3.6 or so this is not working. It is not stable across different interpreter executions. E.g. try hashkey(None). – Neal Young Dec 07 '17 at 06:24
  • @NealYoung Thanks for checking. It seems ``hash(None)`` is salted at least since Python3.4 - I've added this case now. I also realised that hashable containers, such as Tuple, would require special casing as well. – MisterMiyagi Dec 07 '17 at 09:48