27

Is there a hash function which has no collisions?

To clarify: it would be some function which would produce variable-length output, and never produce the same output for differing input. It would also be computationally hard to derive the input from the output.

e-sushi
  • 17,891
  • 12
  • 83
  • 229
benj
  • 371
  • 1
  • 3
  • 3
  • There are collision free one way functions. For example $y = g^x \mod p$. But I'm pretty sure that SHA-512 would be a better choice, despite the collisions. – CodesInChaos Jun 19 '13 at 05:53
  • @CodesInChaos How is this collision free? If we consider all the possible inputs of ⌊²log(p)⌋+1 bits, then by definition there will be at least one collision amongst them? – RocketNuts Feb 05 '19 at 00:00

2 Answers2

20

Your hypothetical hash function would need to have an output length at least equal to the input length to satisfy your conditions, so it wouldn't be a hash function. See the Pigeonhole principle.

Remember an n-bit hash function is a function from $\{0,1\}^∗$ to $\{0,1\}^n$, no such function can meet both of your conditions. Essentially, if it has length $n$ bits, it can only guarantee uniqueness for inputs up to $n$ bits, and even then it would not be a good PRF as it would be a permutation - which is not what hash functions are - so you would want the output size to be longer than the input size, which is now really far from the definition of a hash function.


Now, if you are willing to call it something else than a hash function, then, yes, it is possible to construct such primitives, under the assumption that the output length must be calculated in a way that if there are $m$ possible inputs for an $n$-bit output, then $m \leq 2^n$. The obvious one is a block cipher, which satisfies your conditions except that it has the additional property that all outputs have a corresponding input, which may not be what you want.

As you can see, if you don't want a permutation, you are basically left with a function which "expands" the input pseudorandomly, such that all inputs have outputs but not all outputs have inputs. For instance, CodesInChaos's example of $y = g^x \mod{p}$ is collision-free if $|X| \leq p$ where $X$ is the set of inputs to the function and is one-way for sufficiently large prime $p$ (actually, it needs to have a subgroup of large order, generally $p = 2q + 1$ is a common choice for large prime $q$), as you would need to solve the discrete logarithm problem to reverse it.

Thomas
  • 7,478
  • 1
  • 31
  • 44
  • I understand that the output would need to be longer than the input, I was thinking of a function that one would use to prove possession of data that they would release at a later date, to prove they had the data at least that far back. – benj Jun 19 '13 at 05:31
  • @improv32 Hash functions can do that in what is called a commitment scheme. Why do you require the "no collision" property? Or did you really mean, "with a negligibly low chance of collision, which then is what cryptographic hash functions are for in that while there are infinitely many collisions, it is very hard to find one either accidentally or on purpose"? – Thomas Jun 19 '13 at 05:33
  • I was just curious if there was a function which did not have negligible chance of collision, but provably has no collisions. – benj Jun 19 '13 at 05:58
  • @improv32 I understand. No, there isn't, under the standard definition of "hash function". But there are certainly one-way, collision-free functions, as CodesInChaos mentioned in a comment. – Thomas Jun 19 '13 at 06:04
  • 3
    @improv32: Just use any encryption scheme you like with a key much shorter than the data. (You can easily remove this requirement if needed.) To later prove you had the data earlier, release the key you encrypted it with. – David Schwartz Jun 21 '13 at 09:27
  • @DavidSchwartz: I think the goal is to be a "public permanently-one-way" function. Encryption wouldn't work, since if the key is 128 bits, there would be 2^128 different files you could potentially claim to have possessed, and it would be impossible to prove that you didn't find a trio of (encryptedText, key1, key2) such that the encrypted text can be decrypted in two different valid ways. – supercat Aug 26 '14 at 21:57
  • @supercat That's just not a valid argument. Impossibility is, for practical purposes, indistinguishable from sufficiently impractical. For example, it's trivial to prove that public key cryptosystems cannot possibly make decryption by an attacker impossible (by trial encryption, for example), but such schemes are entirely practical. If you insisted an alarm system make burglary impossible, you'd use no alarm system at all, and that's foolish. You just need to make it impractical. – David Schwartz Aug 27 '14 at 05:08
  • @DavidSchwartz: The advantage of a bijection (which is what the OP asked for) is that it shows that there is only one possible original file. If one isn't concerned about the possibility of multiple files yielding the same hash, there would be no need to use a bijection. The alternative isn't between using a perfect alarm system versus none, but between using a cheaper alarm system, and a much more expensive one with a weakness that makes it not much better than the cheap one, or using a better alarm system that actually works. – supercat Aug 27 '14 at 13:22
  • @supercat All you've shown is that the OP is probably making the same mistake you are, believing that it's better to make things impossible than impractical. For practical purposes, which is all that matters, there is no difference. Bijections have huge practical disadvantages and no practical advantages. – David Schwartz Aug 28 '14 at 00:57
  • @DavidSchwartz: Suppose one wishes to perform an action equivalent to calling a coin flip by writing a file that contains a "heads" or "tails" indication along with a bunch of arbitrary data of your choosing. If one hashes the data with a pre-agreed bijection, then revealing the original message will prove what it was. Otherwise, it may be difficult to prove that one hadn't found a pair of seemingly-valid messages which could yield the same ciphertext if encrypted with different keys. If one is expected to fill much of the message with random data, making a phony pair of messages... – supercat Aug 28 '14 at 01:33
  • ...may be much easier than inverting a quality bijective function. Further, for better or for worse, the bijective function allows anyone who has the original file to determine that they have it, without giving them a means of constructing the file if they don't have it. Your encryption approach doesn't accomplish that. – supercat Aug 28 '14 at 01:36
  • @supercat and David you may want to create a chat room to discuss this (long comment chains like these tend to get automatically deleted after a bit) – Thomas Aug 28 '14 at 01:49
5

Yes they are called Perfect hash functions on wiki iv also seen them being called collision free hash functions. If you follow the link at the bottom of the page there are links to articles and source code.

vbms
  • 97
  • 1
  • 2
  • 1
    Perfect hash functions are not cryptographic hash functions as their domain is finite. – Thomas Jan 05 '15 at 15:01
  • @Thomas what does it mean "a finite domain"? – cregox Sep 18 '15 at 18:23
  • 3
    A finite domain would mean that the number of possible outputs is finite. I.e. that there is some integer N so that the function has at most N different possible outputs. Which is mutually exclusive with having no collisions, because you can always generate N+1 different inputs (for example, text files containing the numbers 1, 2, 3, ... N+1) and because there are only N possible outputs, there will be at least one collision amongst them. – RocketNuts Feb 04 '19 at 23:35
  • 2
    Actually, the right term should be codomain rather than domain. Domain refers to the input space of a function, and codomain to its output space. – RocketNuts Feb 04 '19 at 23:39