Criticism of Random Search Methods in Optimization and Machine Learning

Question

I had the following question relating to "Random Search Methods" in Optimization and Machine Learning - in short, are there theoretical results which show the obvious idea: Why are "Random Search Methods" slow and inefficient, and are not favored for estimating the parameters of statistical models compared to algorithms like Gradient Descent?

Suppose you have a function "f" : f(x, y, z). You are trying to optimize this function "f".

To illustrate this question, consider the following: For argument sake - let's say that we want to use "random search" to randomly query "f" at different points.

A) My (naive understanding) of "random search" is follows : we randomly query "f" at f(x=a, y = b, z = c) and then we record the value of "f". We repeat this process 1000 times and record the combination of "x,y,z" that results in the smallest value of "f".

B) However, it seems that there is an alternate way to do this: https://en.wikipedia.org/wiki/N-sphere#Generating_random_points . If I understand this correctly, Marsaglia showed an alternate way to generate random points. If the function "f" is in 3 dimensions, generate 3 random numbers between 0 and 1. Take the square root of the sum squares of these 3 numbers: call this "r". Then multiply a vector of these 3 random numbers by "r". This vector is the first point you will evaluate the function "f" at - now repeat this 1000 times, and choose the combination which results in the smallest value of "f". Note: apparently this method scales poorly when "f" is in many dimensions.

Question 1: I am a bit confused - when we talk about "random search", are we referring to to "A)" or are we referring to "B)"?

Question 2: I was always confused whether there existed any theoretical results about the "convergence of random search methods". Are there any mathematical theorems that state the obvious about "random search" - that if your dataset has too many rows and too many columns, statistically speaking, random search will become highly ineffective at optimizing a loss function?

Is there some math equations that statistically links the number of iterations required for a given number of rows/columns, in order to produce a certain error bound on a function (e.g. loss function) of a certain complexity? Are there some results that suggest "random search" will converge after a given number of iterations?

Thus, how can we answer the obvious question: Why do modern statistical models not use "Random Search Methods" for optimizing their loss functions and parameter estimation?

Thanks!

References

... unless you wanted to optimize the function only on that sphere? ... 2. Why do you assume that random search is not used in statistics? (Typically it isn't because statistical problems generally have exploitable structure, but "rarely used" is not identical to "not used".) ... 3. please quote relevant parts of your references with enough context so we understand where your premises are coming from. — Glen_b, Jan 13 '22 at 05:00

score 2 · Answer 1 · answered Jan 13 '22 at 14:36

2

Even your simple suggestion to evaluate a function at random points is often used in cases where a full search is computationally infeasible and the function is not smooth. A famous example is RANSAC for estimating shape parameters from point clouds (here the function is the argmax of an accumulator array in a rasterized parameter space).

There are much more sophisticated methods to walk through the search space in a random manner, like genetic algorithms (used, e.g., for feature selection) or simulated annealing. Note that random search is used for performance reasons, not despite it, because most problems to which it is applied can be shown to be NP-complete, i.e. there is (most likely) no efficient algorithm for solving them.

Morover, algorithms that deterministically search for an optimum while beginning at some start value can also be randomized by trying out random starting poitns. An example in statistics that uses this method is k-means.

answered Jan 13 '22 at 14:36

cdalitz

5,132

@ cdalitz: thank you for your answer! As you pointed out, I also considered the advantages of random search to be that it's relatively essy to implement, and it's ability to work when the loss function is poorly defined, non-diffrentiable or its derivative is too expensive to calculate – stats_noob Jan 13 '22 at 15:08
Take the case of optimizing a convex, well behaved function: if random search reaches a value close to the true mininum on the 9th iteration, it's very possible that it might choose a really bad value on the 10th iteration...whereas gradient descent would be able to capitalize on the results of a good iteration – stats_noob Jan 13 '22 at 15:10
I am just curious: why has gradient descent (e.g. stochastic gradient descent) become the first thing people think about when optimizing the loss functions of classical mlp neural networks? Why is random search not the "go to" choice? Clearly, this must be due to some fact which suggests that random search is less effective in certain situations (e.g. perhaps well defined loss functions) compared to gradient based methods. Are there any theoretical results that formalize this? – stats_noob Jan 13 '22 at 15:14

Tim · Answer 2 · 2022-01-13T08:54:57.057

Random search means that you explore the potential hyperparameter values by picking the random combinations of hyperparameters. Marsaglia (1972) invented and algorithm for sampling points uniformly at random in a sphere, this may or may not be how you would like to sample the hyperparameters. There are many different algorithms for generating pseudo-random numbers from different distributions. Random search does not imply using any specific algorithm. Notice that for many problems you may want to choose some special algorithm, for example you can have discrete-values parameter (e.g. $k$ in $k$NN), in such a case, you wouldn't use an algorithm that samples from a continuous uniform distribution.

Are there any mathematical theorems that state the obvious [...] random search will become highly ineffective at optimizing a loss function?

Well, as you said it is pretty obvious, isn't it? Say that you lost your wallet when traveling back from work using public transportation (people did this before the pandemic). Would you search for the wallet by visiting the random GPS coordinates within your city? This would be a highly inefficient way, but this is what random search does. It works because machine learning algorithms are often quite forgiving for picking not exactly the best values of hyperparameters. You wouldn't be able to check all the combinations of possible values of the hyperparameters, so random search helps you to pick some of them. Smarter way would be to use an algorithm that picks the points by doing educated guesses on what makes sense, like Bayesian optimization, or successive halving that was recently added to scikit-learn. On the other hand, there are empirical results showing that random search works quite well and it can beat more clever algorithms if you just make it run twice as long.

As about high-dimensional data, this is just the curse of dimensionality.

Why people use random search? Because it is trivial to implement and works quite well. Hyperparameter optimization is hard because we're optimizing a complicated, multi-dimensional, non-convex, and noisy function (random initialization of parameters, random sampling when using cross-validation, etc). There is no single algorithm that is known to beat all the others, so people often default to random search because it is good enough.

@ Tim : thank you so much for your answer! Do you know of any theoretical results that show what you stated above? Mathematically, why is Random Search "inefficient"? For example, suppose you take a well behaved function and try to optimize it using gradient descent as well as with random search - are there any results which show that random search is slower and less ideal, seeing as gradient descent can benefit from its previous iterations — stats_noob, Jan 13 '22 at 14:33
e.g. if random search reaches a value close to the true mininum on the 9th iteration, it's very possible that it might choose a really bad value on the 10th iteration...whereas gradient descent would be able to capitalize on the results of a good iteration) — stats_noob, Jan 13 '22 at 14:33
As you pointed out, I also considered the advantages of random search to be that it's relatively essy to implement, and it's ability to work when the loss function is poorly defined, non-diffrentiable or its derivative is too expensive to calculate. — stats_noob, Jan 13 '22 at 14:35
@stats555 we don't really need empirical results here. You are picking the values at random, this could serve as a trivial benchmark where any reasonable optimization algorithm should be better than this, otherwise would be useless. — Tim, Jan 13 '22 at 15:07

Criticism of Random Search Methods in Optimization and Machine Learning

2 Answers2