0

I have an image master.png and more than 10.000 of other images (slave_1.png, slave_2.png, ...). They all have:

  • The same dimensions (Eg. 100x50 pixels)
  • The same format (png)
  • The same image background

98% of the slaves are identical to the master, but 2% of the slaves have a slightly different content:

  • New colors appear
  • New small shapes appear in the middle of the image

I need to spot those different slaves. I'm using Ruby, but I have no problem in use a different technology.

I tried to File.binread both images and then compare using ==. It worked for 80% of the slaves. In other slaves, it was spotting changes but the images was visually identical. So it doesn't work.

Alternatives are:

  1. Count the number of colors present in each slave and compare with master. It will work in 100% of the time. But I don't know how to do it in Ruby in a "light" way.
  2. Use some image processor to compare by histograms like RMagick or ruby-vips8. This way should also work but I need to consume the less CPU/Memory possible.
  3. Write a C++/Go/Crystal program to read pixel by pixel and return a number of colors. I think in this way we can get performance out of if. But for sure is the hard way.

Any enlightenment? Suggestions?

sawa
  • 160,959
  • 41
  • 265
  • 366
fschuindt
  • 777
  • 1
  • 7
  • 22
  • 1
    Look into [this question](http://stackoverflow.com/questions/4196453/simple-and-fast-method-to-compare-images-for-similarity). Many options have been discussed there. – Uzbekjon Apr 14 '16 at 17:18
  • Another note about comparing with `File.binread`. Since you are simply comparing file contents and resources and performance of an importance, then it'd be better to simply use bash to do that. Look into: `diff`, `cmp` or `md5`. – Uzbekjon Apr 14 '16 at 17:50
  • Could be a job for [Tensor Flow](https://www.tensorflow.org) if you need a classifier. – tadman Apr 14 '16 at 19:08
  • Do you really mean you don't want to use much CPU when you say you want to do it in a light way? Or do you mean you want the answer fast - which may mean using all the CPU for a time? – Mark Setchell Apr 17 '16 at 20:53
  • @MarkSetchell By "light" I mean using the less CPU/RAM possible. – fschuindt Apr 18 '16 at 13:47
  • How about showing a master and a couple of slaves plus a *different* slave? – Mark Setchell Apr 22 '16 at 11:37

1 Answers1

0

In ruby-vips, you could do it like this:

require 'vips'

# find normalised histogram of reference image
ref = VIPS::Image.new ARGV[0], :sequential => true
ref_hist = ref.hist.histnorm

# trigger a GC every few loops to keep memuse down
loop = 0

ARGV[1..-1].each do |filename|
    # find sample hist
    sample = VIPS::Image.new filename, :sequential => true
    sample_hist = sample.hist.histnorm

    # calculate sum of squares of differences, if it's over a threshold, print
    # the filename
    diff_hist = ref_hist.subtract(sample_hist).pow(2)
    diff = diff_hist.avg * diff_hist.x_size * diff_hist.y_size

    if diff > 100
        puts "#{filename}, #{diff}"
    end

    loop += 1
    if loop % 100 == 0
        GC.start
    end
end

The occasional GC.start is necessary to make Ruby free things and prevent memory filling. Even though it's only once every 100 images, it still spends a lot of time in garbage collection, sadly.

$ vips crop ~/pics/k2.jpg ref.png 0 0 100 50
$ for i in {1..10000}; do cp ref.png $i.png; done
$ time ../similarity.rb ref.png *.png
real    2m44.294s
user    7m30.696s
sys 0m20.780s
peak mem 270mb

If you're willing to consider Python, it's a lot quicker, since it does reference counting and doesn't need to scan all the time.

import sys
from gi.repository import Vips

# find normalised histogram of reference image
ref = Vips.Image.new_from_file(sys.argv[1], access = Vips.Access.SEQUENTIAL)
ref_hist = ref.hist_find().hist_norm()

for filename in sys.argv[2:]:
    # find sample hist
    sample = Vips.Image.new_from_file(filename, access = Vips.Access.SEQUENTIAL)
    sample_hist = sample.hist_find().hist_norm()

    # calculate sum of squares of difference, if it's over a threshold, print
    # the filename
    diff_hist = (ref_hist - sample_hist) ** 2
    diff = diff_hist.avg() * diff_hist.width * diff_hist.height

    if diff > 100:
        print filename, ", ", diff

I see:

$ time ../similarity.py ref.png *.png
real    1m4.001s
user    1m3.508s
sys 0m10.060s
peak mem 58mb
jcupitt
  • 8,967
  • 1
  • 21
  • 36