Hamming Distance optimization for MySQL or PostgreSQL?

Question

I trying to improve search similar images pHashed in MySQL database. Right now I comparing pHash counting hamming distance like this:

SELECT * FROM images WHERE BIT_COUNT(hash ^ 2028359052535108275) <= 4

Results for selecting (engine MyISAM)

20000 rows ; query time < 20ms
100000 rows ; query time ~ 60ms # this was just fine, until its reached 150000 rows
300000 rows ; query time ~ 150ms

So query time encrease depends of the number of rows in table.

I also try solutions found on stackoverflow Hamming distance on binary strings in SQL

SELECT * FROM images WHERE 
BIT_COUNT(h1 ^ 11110011) + 
BIT_COUNT(h2 ^ 10110100) + 
BIT_COUNT(h3 ^ 11001001) + 
BIT_COUNT(h4 ^ 11010001) + 
BIT_COUNT(h5 ^ 00100011) + 
BIT_COUNT(h6 ^ 00010100) + 
BIT_COUNT(h7 ^ 00011111) + 
BIT_COUNT(h8 ^ 00001111) <= 4

rows 300000 ; query time ~ 240ms

I changed database engine to PostgreSQL. Translate this MySQL query to PyGreSQL Without success. rows 300000 ; query time ~ 18s

Is there any solution to optimize above queries? I mean optimization not depended of the number of rows.

I have limited ways (tools) to solve this problem. MySQL so far seemed to be the simplest solution but I can deploy code on every open source database engine that will work with Ruby on dedicated machine. There is some ready solutions for MsSQL https://stackoverflow.com/a/5930944/766217 (not tested). Maybe someone know how to translate it for MySQL or PostgreSQL.

Please, post answers based on some code or observations. We have a lot of theoretical issues about hamming distance on stackoverflow.com

Thanks!

hey, i'm trying to do a similar image search just like you. but i returned always is 0? can you provide me sample code about related search with hash string ? — TomSawyer, Oct 09 '13 at 08:53

score 3 · Answer 1 · answered Feb 19 '13 at 03:27

When considering the efficiency of algorithms, computer scientists use the concept of the order denoted O(something) where something is a function of n, the number of things being computed, in this case rows. So we get, in increasing time:

O(1) - independent of the number of items
O(log(n)) - increases as the logarithm of the items
O(n) - increases in proportion of the items (what you have)
O(n^2) - increases as the square of the items
O(n^3) - etc
O(2^n) - increases exponentially
O(n!) - increases with the factorial of the number

The last 2 are effectively uncomputable for any reasonable number of n (80+).

Only the most significant term matters since this dominates for large n so n^2 and 65*n^2+787*n+4656566 are both O(n^2)

Bearing in mind that this is a mathematical construction and the time an algorithm takes with real software on real hardware using real data may be heavily influenced by other things (e.g. an O(n^2) memory operation may take less time than an O(n) disk operation).

For your problem, you need to run through each row and compute BIT_COUNT(hash ^ 2028359052535108275) <= 4. This is an O(n) operation.

The only way this could be improved is by utilizing an index since a b-tree index retrieval is an O(log(n)) operation.

However, because your column field is contained within a function, an index on that column cannot be used. You have 2 possibilities:

This is an SQL server solution and I don't know if it is portable to MySQL. Create a persisted calculated column in your table with the formula BIT_COUNT(hash ^ 2028359052535108275) and put an index on it. This will not be suitable if you need to change the bit mask.
Work out a way of doing the bitwise arithmetic without using the BIT_COUNT function.

Solution 1 cannot be used, because of course bit mask needs to be changed on each request. Solution 2 too abstract - seems I have solution, but I can't tell it, because I want to make money :) — happy_marmoset, Dec 08 '13 at 13:49
Writing postgres extensions can be a solution, if you know C well. Working project https://github.com/lalinsky/acoustid-server/blob/master/postgresql/acoustid_compare.c — mateuszdw, Feb 28 '14 at 11:08
FWIW, you *can* use a tree structure to speed this sort of query up. You use a [BK-tree](http://en.wikipedia.org/wiki/BK-tree), which gives you O(log(n)) time (albeit with the distance affecting the value of n pretty significantly). In any event, you can reduce a full table scan to < 10% table scan [for edit distances of <= 2, in many cases](http://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/). — Fake Name, Oct 09 '14 at 05:08

alfadog67 · Answer 2 · 2014-08-12T22:14:44.750

This solution made things a bit faster for me. It makes a derived table for each hash compare, and returns only the results that are less than the ham distance. This way, it's not doing the BIT_COUNT on a pHash that has already exceeded the ham. It returns all matches in about 2.25 seconds on 2.6 million records.

It's InnoDB, and I have very few indexes.

If somebody can make it faster, I'll appreciate you.

SELECT *, BIT_COUNT(pHash3 ^ 42597524) + BC2 AS BC3 
FROM ( 
    SELECT *, BIT_COUNT(pHash2 ^ 258741369) + BC1 AS BC2 
    FROM ( 
        SELECT *, BIT_COUNT(pHash1 ^ 5678910) + BC0 AS BC1 
        FROM ( 
            SELECT `Key`, pHash0, pHash1, pHash2, pHash3, BIT_COUNT(pHash0 ^ 1234567) as BC0 
            FROM files 
            WHERE  BIT_COUNT(pHash0 ^ 1234567) <= 3 
        ) AS BCQ0 
        WHERE BIT_COUNT(pHash1 ^ 5678910) + BC0 <= 3 
    ) AS BCQ1 
    WHERE BIT_COUNT(pHash2 ^ 258741369) + BC1 <= 3 
    ) AS BCQ2 
WHERE BIT_COUNT(pHash3 ^ 42597524) + BC2 <= 3

This is the equivalent query, but without the derived tables. Its return time is almost 3 times as long.

SELECT `Key`, pHash0, pHash1, pHash2, pHash3 
FROM Files 
WHERE BIT_COUNT(pHash0 ^ 1234567) + BIT_COUNT(pHash1 ^ 5678910) + BIT_COUNT(pHash2 ^ 258741369) + BIT_COUNT(pHash3 ^ 42597524) <=3

Keeping in mind that the lower the ham value on the first one, the faster it will run.

I can't take credit for this but thought I'd point you to the answer here: http://stackoverflow.com/questions/35065675/how-do-i-speed-up-this-bit-count-query-for-hamming-distance — Brian F Leighty, Apr 15 '16 at 15:04
Thanks, @Brian-F-Leighty, it's actually my own question you pointed to. And yes, the answer has shaved millennia off of my queries. — alfadog67, Apr 15 '16 at 17:50
Sorry should of looked to see if you were on the other question. I just know I'm planning to use the same approach and thought I'd share. Good to know it's working well for you. — Brian F Leighty, Apr 15 '16 at 18:12
No apologies necessary, and my thanks are in order! I have come a long way with this project, so please feel free to use me as a resource. — alfadog67, Apr 15 '16 at 20:06

Hamming Distance optimization for MySQL or PostgreSQL?

2 Answers2