1

I am working on the curse of dimensionality (theoretical models) and would like to test an algorithm on it. What I haven't found is a standard dataset that could be used to compare different approaches.

Is there such a thing? Or... related: what is the most well-known failure due to the curse of dimensionality?

Stephan Kolassa
  • 123,354
linhares
  • 111

1 Answers1

1

Zimek, Schubert and Kriegel have a good paper on outlier detection in High-Dimensional data. They use a simple method of generating sample data to illustrate the problem:

[..] we plot some characteristics of the distribution of vector length values for vectors uniformly distributed in the unit cube over increasing dimensionality d. The observed length of each vector is normalized by 1/√d . Plotted are mean and standard deviation of the observed lengths as well as the actually observed mini- mal (maximal) length in each dimensionality. While the observed distances now have a comparable average value, there is still a loss in relative contrast. On this dataset of a sample of 105 instances, the curse (more explicitly, the effect of distance concentration) is now clearly visible: the standard deviation disappears already at 10 dimensions, and the observed actual minimum and maximum shrink rapidly with increasing dimensionality.

To study the actual problem, which for them is outlier detection, they generate two series. The first is uniformly distributed on the interval [0,1] and the second series, which introduces the outliers they wish to study, is a normal distribution.

Maybe one of these approaches can help you.

Edgar H
  • 252