Searching intersection of elements in subsets with approximation

Question

Task:

Having a large number of transactions that consists of distinct elements from one large set $S$ I need to find transactions in which items have intersection with more than 20% of items in current transaction. I need to avoid unusual complexity comparing elements in each array and perform some sort of dimensionality reduction. I wonder if an array of distinct elements in each transaction could be characterized by a number of float type or some type of hash with an with an acceptable level of approximation $p$ ~0.95.

The target is reducing search algorithm complexity for finding intersected items for a new transaction.

More formal:

Having a set of named entities $S$ with $M$ distinct elements, there is a set of transactions $D$ with elements $D_0..D_n$.

Given a new transaction $D_{n+1}$ and number of past transactions $N$ there is a need to identify a set of indicies of past transactions $D$ which elements have intersection with more than $\omega=20\%$ of items in $D_{n+1}$ with probability $p>=0.95$. Search algorithm complexity should be O(N) assuming separate algorithms with any complexity of coding $S$, storing representations of $D_n$ in any form.

Search algorithm have access at no cost to any representation of $S$ and coded/transformed representations of $T$.

Searching intersection of elements in subsets with approximation

0 Answers0