Randomly sample bounds from many multi-dimensional points

Question

Let $\mathbb{M}$ be an $m\times n$ matrix, with $m < n$. My data consists of a large list of points $\mathbf{x}_i\in\mathbb{R}^n$, $i=1,...,N$, where each point satisfies $\mathbb{M}\cdot\mathbf{x}_i = 0$ and $\mathbf{A} < \mathbf{x} < \mathbf{B}$, where vector inequality denotes all component-wise inequality, and $\mathbf{A},\mathbf{B}$ are some vector bounds.

For each vector $\mathbf{b}\in\mathbb{R}^n$, let $\mu(\mathbf{b})$ denote the number of points $\mathbf{x}_i$ that satisfy $\mathbf{x}_i \leq \mathbf{b}$. Obviously $\mu(\mathbf{A}) = 0$ and $\mu(\mathbf{B}) = N$.

Naively taking a random vector $\mathbf{b}$ inside $\mathbf{A} < \mathbf{b} < \mathbf{B}$ will surely yield very low values of $\mu(\mathbf{b})$. Similarly, $\mu(\mathbf{x}_i) = 1$ for most points $\mathbf{x}_i$ I've tried.

So here's my problem. I need a way to randomly sample vectors $\mathbf{b}$ such that all values of $\mu$ are more or less equally represented.

I wasn't sure how to tag (or title) this question. Feel free to suggest edits. — a06e, Jun 16 '14 at 17:51
There are trivial solutions. E.g., for $x=(x_i), y=(y_i)$ in $\mathbb{R}^n$ define $$(x\vee y)i=\max(x_i,y_i)$$ and consider the set of vectors $$B={b_A=x{1_1}\vee x_{i_2}\vee \cdots\vee x_{i_m}}$$ as $A={i_1,i_2,\ldots,i_m}$ ranges over all nonempty subsets of ${1,2,\ldots,N}.$ Sampling from $B$ seems to satisfy all your requirements, because the cardinality of $b_A$ will be at least $m$ and (generically) all allowable nonzero cardinalities will appear. One would think that this set of possibilities is too limited. Assuming that's the case, please clarify your requirements. — whuber, Jun 16 '14 at 19:10
@whuber I have tried this method. The values in the middle of the range of $\mu$ are underrepresented. For example, if $m$ is 1, $\mu=1$ most of the time. But if $m=2$, $\mu$ can jump to 10, rarely touching the values in between. I have added an edit to the question to make it clear that the values in the range of $\mu$ should be more or less equally represented. — a06e, Jun 16 '14 at 19:14
So you're saying you don't care about what the values of $b$ are--it's ok if they are limited to at most $2^N-1$ discrete vectors--provided that $\mu(b)$ has a uniform distribution on ${1,2,\ldots,N}$? — whuber, Jun 16 '14 at 19:19
@whuber Yes, that's what I am saying. The distribution of $\mu(\mathbf{b})$ doesn't have to be strictly uniform. I just want all values in its range to be represented, with no values significantly less represented than other values. — a06e, Jun 16 '14 at 21:23

whuber · Accepted Answer · 2014-06-24T14:33:39.150

There is a simple way: by picking a coordinate index and sorting the $x_i$ based on their component in this index, we can easily find any specified number of the $x_i$ that are determined by an inequality of the form $x_i \le b$.

As a preliminary calculation, sort the $x_i$ on each coordinate $j$, writing $x_{[i]}^{j}$ for the $i^\text{th}$ smallest element of the data $(x_1^j, x_2^j, \ldots, x_N^j)$, $1\le j\le n$. (Because the question uses subscripts to index vectors, I will use superscripts to index their coordinates.) For convenience of notation, write $x_{[0]}^j = A^j$ and $x_{[N+1]}^j=B^j$ for all $j$, whence for each $j,$ $1\le j \le n,$

$$A^j = x_{[0]}^j \lt x_{[1]}^j \le \cdots \le x_{[i]}^j \le \cdots \le x_{[N]}^j \lt x_{[N+1]}^j = B^j.$$

Select $i \in \{0,1,\ldots,N\}$ uniformly at random and choose $j \in \{1,2,\ldots, n\}$ in any manner you like. This determines a set $E_i^j$ of vectors $b$ with

$$b^j \in [x_{[i]}^{j}, x_{[i+1]}^j) \text{ and } b^{j^\prime} \in [x_{[N]}^{j^\prime}, B^{j^{\prime}}) \text{ for } j^\prime \ne j.$$

Let $y$ be an arbitrary vector. Since $y \le b$ if and only if $y^k \le b^k$ for all $1\le k \le n$, this--by the construction of $b$--means $y^j \lt x_{[i+1]}^j$ and $y^{j^\prime} \le x_{[N]}^{j^\prime}$. Provided there are no ties for $x_{[i+1]}^j$, the former inequality has exactly $i$ solutions (by definition) and the assumptions about $B$ show that the latter set of inequalities is always satisfied. Therefore $\mu(b) = i$. Select $b\in E_i^j$ in any manner you like. Since $i$ has a uniform distribution, so does $\mu(b)$.

Two coordinates of $N=10$ vectors are shown as blue dots. The rectangle delimited by $A$ and $B$ is shown as a pale unit square. To include exactly $i=5$ vectors, pick either the fifth smallest of the first coordinates and the largest of all other coordinates (shown at the left for $j=1$) or the fifth smallest of the second coordinates and the largest of all other coordinates (shown at the right for $j=2$). A value of $b$ is shown as a red dot and the region of points it selects is depicted as a darker shaded rectangle. In this construction it does not matter whether all vectors are confined to a linear subspace, provided only that the coordinates vary over the subspace.

The only condition required for this to work was that there exist at least one coordinate $j$ for which there are no ties among the $(x_i^j)$. This method will still produce an approximately uniform distribution of $\mu(b)$ provided there is at least one coordinate where the sizes of the tied groups are all very small compared to $N$.

Sorry, "$x$" has become seriously overloaded, hasn't it? I'll see if I can fix it without introducing any errors :-). — whuber, Jun 23 '14 at 13:49
I am writing the data matrix as $(x_i^j)$. All such values are real numbers. The first index $i$ is for the vector and the second $j$ is for the coordinate. — whuber, Jun 23 '14 at 15:03
I defined that term in the first line of the answer: it's a number equal to the $i^\text{th}$ smallest among all $N$ of the $j^\text{th}$ coordinates. I hope the illustration I added will help clarify any issues related to this notation. — whuber, Jun 24 '14 at 14:04
Good point--I neglected to indicate their corresponding components. I'll fix that. I'll also make the use of superscripts for coordinates more consistent throughout. — whuber, Jun 24 '14 at 14:25
+1 Thanks. I finally got the idea. Sorry about the notation issues... Deleting comments... — a06e, Jun 24 '14 at 14:45
The notation issues were entirely my fault; I appreciate your patience with them. — whuber, Jun 24 '14 at 15:24

Randomly sample bounds from many multi-dimensional points

1 Answers1