I trying to improve search similar images pHashed in MySQL database. Right now I comparing pHash counting hamming distance like this:
SELECT * FROM images WHERE BIT_COUNT(hash ^ 2028359052535108275) <= 4
Results for selecting (engine MyISAM)
- 20000 rows ; query time < 20ms
- 100000 rows ; query time ~ 60ms # this was just fine, until its reached 150000 rows
- 300000 rows ; query time ~ 150ms
So query time encrease depends of the number of rows in table.
I also try solutions found on stackoverflow Hamming distance on binary strings in SQL
SELECT * FROM images WHERE
BIT_COUNT(h1 ^ 11110011) +
BIT_COUNT(h2 ^ 10110100) +
BIT_COUNT(h3 ^ 11001001) +
BIT_COUNT(h4 ^ 11010001) +
BIT_COUNT(h5 ^ 00100011) +
BIT_COUNT(h6 ^ 00010100) +
BIT_COUNT(h7 ^ 00011111) +
BIT_COUNT(h8 ^ 00001111) <= 4
rows 300000 ; query time ~ 240ms
I changed database engine to PostgreSQL. Translate this MySQL query to PyGreSQL Without success. rows 300000 ; query time ~ 18s
Is there any solution to optimize above queries? I mean optimization not depended of the number of rows.
I have limited ways (tools) to solve this problem. MySQL so far seemed to be the simplest solution but I can deploy code on every open source database engine that will work with Ruby on dedicated machine. There is some ready solutions for MsSQL https://stackoverflow.com/a/5930944/766217 (not tested). Maybe someone know how to translate it for MySQL or PostgreSQL.
Please, post answers based on some code or observations. We have a lot of theoretical issues about hamming distance on stackoverflow.com
Thanks!