10

I have $N$ (~a million) feature vectors. There are $M$ (~a million) binary features, but in each vector only $K$ (~a thousand) of them would be $1$, the rest are $0$. I'm looking for the pairs of vectors that have at least $L$ (~a hundred) features in common ($1$ in both). The number of such pairs is of a similar magnitude to $N$ (~a million).

I think this could be approached as looking for close point pairs in a very high-dimensional space. The distance function could be such that it's based on how many features the two vectors have in common. But it would probably be useful with a more conventional distance metric (such as Euclidean) as well.

Which well-known algorithms would be useful for approaching this problem? Anything that is quadratic in $N$ or $M$ will not be practical.


An example real-world formulation of the problem is to consider $N$ people moving between a number of locations. If two people were at the same location at the same time, we say they met each other. (The number of location-time combinations with at least 1 person present is $M$.) We are looking for friends: people who met at least $L$ times.

  • 1
    If vector 1, feature 1 is $0$, & vector 2, feature 1 is also $0$, do they have that feature "in common"? – gung - Reinstate Monica Sep 10 '15 at 21:37
  • @user777, I assume no, in which case your answer is perfect, but it would be nice for this to be explicitly stated by the OP. – gung - Reinstate Monica Sep 10 '15 at 22:09
  • @gung, you assume right. I've edited the question to clarify. Thanks! – Daniel Darabos Sep 11 '15 at 10:58
  • 1
    About how many pairs of vectors have > 100 features in common -- random sample + brute force ? Are the sizes 1M x 1M a real problem, or made up ? See also the approach in bit-string-nearest-neighbour-searching on stackoverflow. – denis Jan 25 '16 at 10:38
  • @denis: The number of matching pairs is expected to be of a similar magnitude to the number of vectors, so maybe a million. Probably less. (I've added this to the question. And yes, we can and do estimate this based on a sample.) The sizes are made up. The real problem is of similar size, but more flexible. We can have more or less features if we increase or decrease the granularity at which time and space is discretized. Thanks for the link! I'll take a look. – Daniel Darabos Jan 25 '16 at 11:11
  • 1
    A possibly crazy suggestion: view your 1 Mbit long feature vectors as images of 1000 x 1000 pixels, and look up methods for image clustering, e.g. http://stackoverflow.com/search?q=[image]+clustering . Afaik you have to find good features (not single pixels) for that to work, but I'm no expert. – denis Jan 27 '16 at 14:01
  • @DanielDarabos - what do you consider a reasonable set of data? Is there something in a public repository, or could you make some pseudocode to describe an acceptable set? – EngrStudent Feb 08 '17 at 19:01
  • Sorry, I don't have an example dataset. It would be nice to make one, and in fact that would be my first step if I were to research this in more depth today. (I can't remember whether I had an example dataset at the time I wrote this question...) – Daniel Darabos Feb 10 '17 at 02:15

7 Answers7

7

I'm looking for the pairs of vectors that have at least $L$ features in common.

This is just an inner product of binary feature vectors. When the inner product is greater than $L-1$, the pair will have at least $L$ elements in common. This should be a relatively fast computation -- at least, faster than euclidean distance, which would be wasteful and slow for this data. Because you stipulate that you're looking for pairs, this will inherently mean you have to do $\binom{N}{2}$ computations to compare every vector.

Finding points that are close together is indeed a clustering problem. But the first step of the clustering algorithms that I'm familiar with is computing pairwise distances or similarities. I'm sure someone has developed more efficient alternatives. A point about terminology: having at least $L$ common neighbors is phrased as a similarity, not a distance! Inner products are, in this case, unnormalized cosine similarities.

You can make this more tractable by only performing the inner product computation when the sum of the feature vector (which is in this case the same as the norm) for an observation is greater than $L-1$, since it's impossible for that binary feature vector to have an inner product with another binary feature vector which will satisfy my criterion when this sum is less than $L$. Obviously, computing these sums is only $O(N)$ complexity, so i's a cheap way to drive down the magnitude of the inner product step.

But the classic way to reduce the scope of this problem is to do additional pre-filtering. Are you especially interested in when one, somewhat uncommon feature takes the value 1? If so, only perform the computation for those feature vectors.

Or perhaps you could benefit from re-framing your problem. For example, sampling is known to have nice properties; inferential statistics develops on this idea to quite some depth. So perhaps it's unfeasible to analyze the entire data set, but it's perfectly feasible to examine a small sample. I don't know what question you're trying to answer, but if you carefully design your experiment, you may get away with only looking at a few thousand observations, with more than enough data left over for validation purposes.

After some additional thought, I have a strong hunch that the data you're working with is some kind of graph $G$. It's very plausible that $G$ is composed of several connected components, in which case you can decompose $G$ into a set of graphs, with the happy side-effect of reducing the dimensionality of the data. Even if the graph is only two connected components of roughly the same size, that means your $O(N^2)$ pairwise comparisons has roughly $\frac{1}{4}$ the total cost!

If the graph is symmetric, the following observations may be helpful:

  1. Define the Laplacian of your graph as $P=D-A$, where $D$ is a diagonal matrix of degree (the sum of each feature vector) and $A$ is the adjacency matrix (the stacking of feature vectors into a matrix).
  2. The number times $0$ appears as an eigenvalue of $P$ is the number of connected components of $G$. Decomposing the graph into its connected components and working solely with those components will have the side-effect of reducing the dimension of your data; computing your quantity of interest will be easier. But computing the eigendecomposition will be expensive for a million vertices...
  3. (After a full permutation) $P$ is a block diagonal matrix of the Laplacians of the connected components of $G$.
  4. $P$ is positive semidefinite. This is almost certainly useful somehow.
  5. The algebraic connectivity of $G$ is the value of the second-smallest eigenvalue of $P$. This tells you how well-connected $G$ is. Perhaps that will answer some of the questions you are interested in re: the vectors that have features in common. Spectral graph theory develops this idea in some more detail.

"Is this an SNA problem?" I'm not sure. In one application the features describe behavior and we're looking to connect people with similar behaviors. Does that make this an SNA problem?

If you have a bipartite graph connecting people to behaviors, you can think of this as an affiliation network $B$, with people as rows and behaviors as columns. If you want to connect people to people via the behaviors they have in common, you can compute $BB^T=A$. $A_{ij}$ is the number of behaviors the people have in common. Obviously, the set of vertices where $A_{ij}\ge L$ answers your question.

Sycorax
  • 90,934
  • Thanks for the excellent answer! That's a lot of things I will have to investigate further. I'm not convinced the pairwise comparisons are unavoidable, though. Isn't this a clustering problem where I'm looking for clusters of size >1? I was expecting some spacial partitioning approach could heavily cut down on the number of pairwise comparisons. – Daniel Darabos Sep 11 '15 at 11:07
  • Sorry, I don't know much about data science. But isn't it a clustering problem when we are looking to group together points that lie close to each other? I have a maximum distance (L) and want to find groups (pairs) of points that lie within that distance of each other. Is that stretching the definition of clustering too much? – Daniel Darabos Sep 11 '15 at 13:18
  • 1
    It can indeed be phrased as a graph problem. In that case we have a bipartite graph of N points and M features and want to find pairs of points that have at least L common neighbors. I'm specifically looking at the feature vector-based phrasing now, hoping that there is a clustering method that could be of use to me. K-SVD was suggested to a similar problem in http://stats.stackexchange.com/questions/93366/all-nearest-neighbors-in-high-dimensional-space, so I'm reading up on that at the moment. Thanks! – Daniel Darabos Sep 11 '15 at 13:19
  • "Is this an SNA problem?" I'm not sure. In one application the features describe behavior and we're looking to connect people with similar behaviors. Does that make this an SNA problem? Thanks for introducing me to the terminology, it's very helpful to guide my search. – Daniel Darabos Sep 11 '15 at 13:24
  • I've revised my answer. Is your ultimate goal just to enumerate the people with many behaviors in common, or is it something else? – Sycorax Sep 11 '15 at 13:42
  • Thanks, the answer is even more awesome now! I think the ultimate goal in the case of people and behaviors would be to build or augment a social network graph with edges (relationships) identified from the behavioral data. E.g. they do a lot of stuff together, so they must be friends. – Daniel Darabos Sep 11 '15 at 14:11
  • "A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces." (1998) looks at a similar problem and finds space-partitioning methods ultimately useless in a high-dimensional space. I think this supports your argument that the quadratic run time may be unavoidable. – Daniel Darabos Sep 14 '15 at 12:28
  • @DanielDarabos Well, I'm not surprised by the result, although it's worth noting that the paper is quite old. – Sycorax Sep 14 '15 at 12:54
  • I think you can avoid the O(N2) complexity, but maybe I missed something important in the question. My argument is too long to be left in a comment, I posted an (updated) answer – RUser4512 Sep 18 '15 at 15:13
7

It looks like the approach you're looking for is a combination of minhash signatures and Locality Sensitive Hashing (LSH); the (freely available) pdf of Mining Massive Datasets describes this approach (and other similarity measures) in some detail in Chapter 3, but briefly:

A minhash signature is a condensed representation of your original matrix that is constructed by applying some number n of hash functions to features, thereby reducing the number of features per observation. This reduces the size of your data, however you'll probably notice that this still leaves you with a $O(N^2)$ problem.

To address this, MMDS advises that if all you want to find is pairs above a certain threshold of similarity (which would seem to apply in your case), then you can focus only on those pairs that are most likely to be similar - this approach is called Locality Sensitive Hashing, and in section 3.4 they walk through an example of how to combine the minhash signature approach with LSH.

In addition to the text, there are also lectures available on the Coursera course of the same name.

Tchotchke
  • 1,014
2

Inverted dictionnary! Represent a point $x$ as $feat_1:value_1, feat_{101}:value_{101}$, the keys corresponding to non zero values (i.e. the features holding true). The average size of storage of an element will be $K$. Indeed, I only need $K$ strings to store the features and $K$ floats to hold the values.

For each feature, build a dictionary holding the indexes sharing this feature. Hopefully, this number will not be too big (if you have a feature which is shared by all the indexes, this approach is ruined, you can stop reading here).

This dictionary looks like : $feat_1 : \{1,101,202\}, feat_2 : \{7,202\},feat_3 : \{202\}...feat_M:\{3,45,6\}$. If I want to gain speed and save space, I can even drop the features that are only found with one element (here:$feat_3$) as they will not produce close pairs. This dictionary is built in $O(NK)$ operations.

Now, when you want to evaluate the distance of an element $x$ to the others, generate (with the dictionary) the list of the indexes sharing at least one feature with $x$. You know that all the other elements are far from $x$ (they don't even share one feature!). If the average number of "elements per feature" is low (call it $P$), you need not to be in $O(N^2)$ any more.

Now there is another big improvement if $x$ and $y$ are represented as dictionaries as well, since $d(x,y)$ or $<x,y>$ can be evaluated iterating over the keys of $x$ and $y$, in $O(K)$ operations.

Your final complexity is $O(NPK)$ instead of the naive $O(MN^2)$ initial approach.

I applied this method to implement a KNN over large text set (train : 2 000 000 lines, test 35 000 lines, number of features : 10 000, average number of features per element : 20), which ran in about an hour...

RUser4512
  • 10,217
  • I don't entirely understand this approach -- this isn't because I disbelieve you, it's entirely due to my lack of familiarity with different strategies for representing data. Perhaps you could elaborate more on your what you cover in the first two paragraphs? – Sycorax Sep 18 '15 at 13:56
  • "this number will not be too big": the average column sum = average row sum = 1000. 2) floats ? the OP's features are binary 3) runtimes for 3 runs N, 2N, 4N would be interesting, would show if they're roughly $O(N^2)$.
  • – denis Jan 26 '16 at 11:13