9

I just realized that implementing the following algorithm to compute the hash code for a stream is not possible using Stream.reduce(...). The problem is that the initial seed for the hash code is 1 which is not an identity for the accumulator.

The algorithm for List.hashCode() :

int hashCode = 1;
for (E e : list)
  hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());

You might be tempted to think that the following is correct but it isn't, although it will work if the stream processing is not split up.

List<Object> list = Arrays.asList(1,null, new Object(),4,5,6);
int hashCode = list.stream().map(Objects::hashCode).reduce(1, (a, b) -> 31 * a + b);

It seems that the only sensible way of doing it would be to get the Iterator of the Stream and do normal sequential processing or collect it to a List first.

Roland
  • 7,028
  • 12
  • 58
  • 109
  • 4
    The question is why you'd want to calculate the `hash code for a stream`. What are you going to do with it? – Eran Sep 08 '16 at 08:18
  • 2
    Sounds like an [XY problem](http://xyproblem.info/). Doesn't the *complex datastructure* implements `hashCode()` ? If not, aren't you allowed to implement it ? Do you really need to compute this hash in parallel on the stream so that the `reduce` isn't relevant ? Maybe as a last resort, have you considered `stream.collect(Collectors.toList()).hashCode()` ? – Spotted Sep 08 '16 at 09:15
  • 1
    So if I understand correctly, you want to know if 2 objects are equals based on only a part of their state that has been extracted/transformed by some `Functions` ? – Spotted Sep 08 '16 at 09:45
  • 1
    In the light of this new problematic, can you also update your question so that more people can more easily help you ? – Spotted Sep 08 '16 at 11:28
  • 4
    @Eran: because we can. Well, at least I enjoyed solving the problem. ;^) And, who knows, perhaps at the future, someone really has an extremely large list to hash in parallel… – Holger Sep 08 '16 at 17:09

4 Answers4

13

While, at the first glance, the hash code algorithm seems to be non-parallelizable due to its non-associativity, it is possible, if we transform the function:

((a * 31 + b) * 31 + c ) * 31 + d

to

a * 31 * 31 * 31 + b * 31 * 31 + c * 31 + d

which basically is

a * 31³ + b * 31² + c * 31¹ + d * 31⁰

or for an arbitrary List of size n:

1 * 31ⁿ + e₀ * 31ⁿ⁻¹ + e₁ * 31ⁿ⁻² + e₂ * 31ⁿ⁻³ +  …  + eₙ₋₃ * 31² + eₙ₋₂ * 31¹ + eₙ₋₁ * 31⁰

with the first 1 being the initial value of the original algorithm and eₓ being the hash code of the list element at index x. While the summands are evaluation order independent now, there’s obviously a dependency to the element’s position, which we can solve by streaming over the indices in the first place, which works for random access lists and arrays, or solve generally, with a collector which tracks the number of encountered objects. The collector can resort to the repeated multiplications for the accumulation and has to resort to the power function only for combining results:

static <T> Collector<T,?,Integer> hashing() {
    return Collector.of(() -> new int[2],
        (a,o)    -> { a[0]=a[0]*31+Objects.hashCode(o); a[1]++; },
        (a1, a2) -> { a1[0]=a1[0]*iPow(31,a2[1])+a2[0]; a1[1]+=a2[1]; return a1; },
        a -> iPow(31,a[1])+a[0]);
}
// derived from http://stackoverflow.com/questions/101439
private static int iPow(int base, int exp) {
    int result = 1;
    for(; exp>0; exp >>= 1, base *= base)
        if((exp & 1)!=0) result *= base;
    return result;
}

 

List<Object> list = Arrays.asList(1,null, new Object(),4,5,6);
int expected = list.hashCode();

int hashCode = list.stream().collect(hashing());
if(hashCode != expected)
    throw new AssertionError();

// works in parallel
hashCode = list.parallelStream().collect(hashing());
if(hashCode != expected)
    throw new AssertionError();

// a method avoiding auto-boxing is more complicated:
int[] result=list.parallelStream().mapToInt(Objects::hashCode)
    .collect(() -> new int[2],
    (a,o)    -> { a[0]=a[0]*31+Objects.hashCode(o); a[1]++; },
    (a1, a2) -> { a1[0]=a1[0]*iPow(31,a2[1])+a2[0]; a1[1]+=a2[1]; });
hashCode = iPow(31,result[1])+result[0];

if(hashCode != expected)
    throw new AssertionError();

// random access lists allow a better solution:
hashCode = IntStream.range(0, list.size()).parallel()
    .map(ix -> Objects.hashCode(list.get(ix))*iPow(31, list.size()-ix-1))
    .sum() + iPow(31, list.size());

if(hashCode != expected)
    throw new AssertionError();
Holger
  • 267,107
  • 35
  • 402
  • 715
  • Awesome!!! I like the trick of creating an `Array` to store information in the `Collector` – Roland Sep 09 '16 at 07:20
  • 2
    @Roland: the array is actually just a work-around for the absence of a `pair` or `tuple` type. It’s the general concept of collectors, to have a mutable container type. Even the built-in collectors use single-element arrays whens they need a mutable `int` or `long` container. – Holger Sep 09 '16 at 09:20
  • 1
    @Roland: these are unicode characters, easy to type if you have the right keyboard ;^), e.g. I use [NEO2](https://en.wikipedia.org/wiki/Keyboard_layout#Neo). If you don’t have such a keyboard (layout), you can simply copy & paste, e.g. from [here](https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block) – Holger Sep 09 '16 at 12:37
  • 1
    Which browser/os? – Holger Sep 09 '16 at 12:45
  • 1
    Seems to be merely a font issue. So unless there’s a simple alternative (we’re not in the TeX site), I think waiting for software/fonts to catch up is the best option. – Holger Sep 09 '16 at 13:41
  • 1
    I tested Chrome, Explorer and Edge. But it’s a newer Windows version. I also tested with Android where some characters could not be displayed in the browser window but copying the text from the browser into other input fields, even the address field of the same browser, made it look correctly. So it’s obvious that the font used in the html view lacked the characters but the other fonts had them. – Holger Sep 09 '16 at 14:52
  • 1
    Firefox under Windows 10 also displays everything correctly. – Holger Sep 09 '16 at 15:04
3

As a first approach I would use the collect-to-a-list solution as long as you don't have performance concerns. That way you avoid reimplementing the wheel and if one day the hash algorithm changes you benefit from that and you are also safe if the stream is parallelized (even if I'm not sure that's a real concern).

The way I would implement it can vary depending on how and when you need to compare your different datastructures (let's call it Foo).

If you do it manually and sparsly a simple static function may be enough:

public static int computeHash(Foo origin, Collection<Function<Foo, ?>> selectors) {
    return selectors.stream()
            .map(f -> f.apply(origin))
            .collect(Collectors.toList())
            .hashCode();
}

And use it like this

if(computeHash(foo1, selectors) == computeHash(foo2, selectors)) { ... }

However, if instances of Foo are themselves stored in Collection and you need both hashCode() and equals() (from Object) to be implemented, I would wrap it inside a FooEqualable:

public final class FooEqualable {
    private final Foo origin;
    private final Collection<Function<Foo, ?>> selectors;

    public FooEqualable(Foo origin, Collection<Function<Foo, ?>> selectors) {
        this.origin = origin;
        this.selectors = selectors;
    }

    @Override
    public int hashCode() {
        return selectors.stream()
                .map(f -> f.apply(origin))
                .collect(Collectors.toList())
                .hashCode();
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof FooEqualable) {
            FooEqualable that = (FooEqualable) obj;

            Object[] a1 = selectors.stream().map(f -> f.apply(this.origin)).toArray();
            Object[] a2 = selectors.stream().map(f -> f.apply(that.origin)).toArray();

            return Arrays.equals(a1, a2);
        }
        return false;
    }
}

I'm fully aware that this solution isn't optimized (performance-wise) if multiple calls to hashCode() and equals() are made but I tend not to optimize except if it becomes a concern.

Spotted
  • 3,991
  • 16
  • 33
2

Holger wrote the right solution, if you want a simple way of doing it there are two additional possibilities:

1. collect to List and call hashCode()

Stream<? extends Object> stream;
int hashCode = stream.collect(toList()).hashCode();

2. use Stream.iterator()

Stream<? extends Object> stream;
Iterator<? extends Object> iter = stream.iterator();
int hashCode = 1;
while(iter.hasNext()) {
  hashCode = 31 *hashCode + Objects.hashCode(iter.next());
}

Just as a reminder the algorithm that List.hashCode() uses:

int hashCode = 1;
for (E e : list)
  hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
Roland
  • 7,028
  • 12
  • 58
  • 109
0

The easiest and shortest way I found was to implement a Collector using Collectors.reducing:

/**
 * Creates a new Collector that collects the hash code of the elements.
 * @param <T> the type of the input elements
 * @return the hash code
 * @see Arrays#hashCode(java.lang.Object[])
 * @see AbstractList#hashCode()
 */
public static <T> Collector<T, ?, Integer> toHashCode() {
    return Collectors.reducing(1, Objects::hashCode, (i, j) -> 31 *  i + j);
}

@Test
public void testHashCode() {
    List<?> list = Arrays.asList(Math.PI, 42, "stackoverflow.com");
    int expected = list.hashCode();
    int actual = list.stream().collect(StreamUtils.toHashCode());
    assertEquals(expected, actual);
}
simon04
  • 2,708
  • 26
  • 25
  • This is incorrect because 1 is not an identity for this reduction. It will break if the processing is parallelized. – Roland Jun 11 '20 at 11:13