0

I have a numpy array consisting of 20,000 RGB images of 220x220 pixels each. The array, X_data, therefore has the shape (20000, 220, 220, 3).

I'm looking for the fastest way to find the minimum and maximum pixel values across the entire dataset. I appreciate this type of task will take time because I'm searching through approximately 3 billion pixel values, but I'm hoping to improve on the solutions I have found already; which are the following:

Option 1: Flatten the array

Use np.flatten and then np.min and np.max on the resulting array:

flat = X_data.flatten()
np.min(flat)
np.max(flat)

This method took a total of 13min 11s (wall time) to find the min and max values.

Option 2: List comprehension

Use np.amin and np.amax to find the min and max for each image, append them to a list, and then find the min and max of that list:

min_val = np.min([np.amin(X_data[i]) for i in np.arange(X_data.shape[0])])
max_val = np.max([np.amax(X_data[i]) for i in np.arange(X_data.shape[0])])

This method took a total of 8 minutes (wall time).

Are there any faster methods for completing this task?

EDIT:

I forgot to mention in the original formulation of the question that I would like this to work on image datasets that have not been rescaled, ie those that contain images of varying sizes. This means that using np.min and np.max will not work, even though it is faster that the above options.

Many thanks!

celery_gemini
  • 108
  • 11
  • Do you have a decent multi-core CPU? What OS do you run? – Mark Setchell Oct 23 '20 at 16:30
  • I'm just using an off the shelf MacBook Air 2020 running Catalina with 1.6 GHz Dual-Core Intel Core i5. – celery_gemini Oct 23 '20 at 16:47
  • So do you mean that your opening statement is incorrect... that *"the images are 220x220 each"*? – Mark Setchell Oct 23 '20 at 16:53
  • Well, the images are 220x220 in the test case described, but I'm hoping for a general solution that works when this is not so. – celery_gemini Oct 23 '20 at 16:57
  • 1
    What does "image datasets that have not been rescaled, ie those that contain images of varying sizes" have to do with it? Why does that exclude `np.min` and `np.max`? – Cris Luengo Oct 23 '20 at 19:29
  • 1
    I guess one could look up the sources of np.min/np.max and save another factor of 2 by calculating min and max at the same time. Just write yourself a np.minmax function. – Trilarion Oct 23 '20 at 19:36

3 Answers3

3

The first approach is slower due to the flatten, which generates a copy of your data. In your case, the data is large, so a large portion of the runtime will be for memory allocation.

On the other hand, np.min by default will check along all axes, e.g.:

np.min([[1,2],[3,4]])
# out: 1

So you can just do:

min_val, max_val = np.min(X_data), np.max(X_data)

should be faster.

Quang Hoang
  • 131,600
  • 10
  • 43
  • 63
  • Thanks for the reply! `np.min` is indeed faster, but the problem is that it only works if all the images are the same size. I'm looking for a general solution that works across arrays that include different image sizes, but I didn't make this clear in my question so I'll make an edit to that effect. – celery_gemini Oct 23 '20 at 16:45
  • 1
    @celery_gemini in which case, only the second option works. Note that numpy operations only work best on homogeneous data, which isn’t the case here as you mentioned. – Quang Hoang Oct 23 '20 at 16:51
2

If maximum speed is important, it might be worth while adding in a library that can get the maximum and minimum values at the same time. Such a routine would only need to traverse the large array once instead of twice. Since getting these values is completely memory-limited on any modern platform (the computation is minimal), halving memory access is likely to halve execution time. Basically, you can get both values for the price of one.

This answer has a Fortran implementation linked into Python, and found that the Numpy function was excessively slow also. But that's an 8-year old answer, things have likely improved in the Numpy implementation.

DIPlib is an image processing library (I'm an author), and since I have it already installed it was easy for me to try out its function to get both the max and min value in an array. I'm using a smaller array than in the OP, I don't want to wait that long for the experiment to finish. Here's the code:

import numpy as np
import diplib as dip
import time

X = np.random.randn(2000, 220, 220, 3)

t = time.time()
np.min(X), np.max(X)
print(time.time() - t)

t = time.time()
dip.MaximumAndMinimum(X)
print(time.time() - t)

Median over three runs (without changing the X array) gives 0.361s for the first section, and 0.135s for the second section.

Cris Luengo
  • 49,445
  • 7
  • 57
  • 113
2

Finding the max (or the min) requires sorting the values. When both are needed, calling np.max() and then np.min() makes it sort the values twice...

I prefer simply using this :

the_min, the_max = np.percentile(my_data,[0,100])

which sorts the data once only...

Ananth
  • 819
  • 1
  • 3
  • 11
Lucien
  • 21
  • 1