9

I have a list of list such as the following

[["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]

How would I get the two list that have the most words in common? In this case, it would be the first and second list because they both have the word test in common

I have tried solving this by finding the intersection of every combination of two lists and keeping track of the combination with the highest amount of words in common. This method however seems inefficient with say 100,000 lists. It would be (100,000 choose 2) combinations I think. Is there a faster way to do this?

This is my code attempt

from itertools import combinations

a = [["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]
c= combinations(a,2)


max = 0
for pair in c:
    intersection = set(pair[0]).intersection(pair[1]) 
    if len(intersection) > max:
        max = len(intersection)
        mostsimilar = pair

print mostsimilar

The output of my program is what I expected however it is very slow on bigger test cases

output:

(['This', 'is', 'a', 'test'], ['test', 'something', 'here'])
yzx7243
  • 91
  • 2
  • Are the words within each sublist unique or can e.g. `test` appear twice in a submits? – Cleb Apr 29 '19 at 21:29
  • This was a simple test case but each list should have many words and repeats could happen. I just need the two lists that have the most in common out of all the lists – yzx7243 Apr 29 '19 at 21:35
  • 1
    @yzx7243 If repeats are allowed, then is `['a', 'b', 'b']` considered to have more in common to `['b']` than `['a', 'b']`, or the same? – blhsing Apr 29 '19 at 21:44
  • @blhsing ```['a', 'b', 'b'] ``` and ```['a', 'b']``` have 2 in common (a and b) so these should be the result – yzx7243 Apr 29 '19 at 21:58
  • @yzx7243 I think what @blhsing meant was - say `X = ['a', 'b'], Y = ['a', 'b', 'b']`. Which is more similar to `['b']`, `X` or `Y`? – gmds Apr 29 '19 at 22:24
  • @gmds ah I see. I would say they're the same in term in similarity – yzx7243 Apr 29 '19 at 22:40
  • @yzx7243 How long will the average `list` be? – gmds Apr 29 '19 at 23:21
  • @gmds For my purposes, each sub list will have at least 100 words and the list would have thousands of sub lists – yzx7243 Apr 30 '19 at 00:27
  • @yzx7243 Based on my runtime profiling of the solutions, mine will be better if each sentence has more than 15 words or so. You should definitely do your own testing, though. – gmds Apr 30 '19 at 00:29
  • @gmds Unfortunately the program ran out of memory when I tested your solution with my dataset. (10,000 sub lists). The solution is a bit beyond the scope of my knowledge anyways haha – yzx7243 Apr 30 '19 at 00:55
  • Yes, the bottleneck is in the requirement to convert the sparse matrix to a dense one...if you have memory issues, then @blhsing's approach will be better. – gmds Apr 30 '19 at 01:02

3 Answers3

4

Based on my understanding of the problem, I think this should work:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KDTree, DistanceMetric

data = np.array([' '.join(words) for words in [['This', 'is', 'a', 'test'], ['test', 'something', 'here'], ['cat', 'dog', 'fish']]])

vectorised = CountVectorizer(binary=True).fit_transform(data).todense()
metric = DistanceMetric.get_metric('manhattan')
kdtree = KDTree(vectorised, metric=metric)
distances, indices = [result[:, 1] for result in kdtree.query(vectorised, k=2, dualtree=True)]

nearest_distances = (distances == distances.min())
print(data[nearest_distances])

Output:

['This is a test' 'test something here']

I recast the problem in the following manner:

Each list of words (or, sentence) can be represented as a row in a sparse matrix where a 1 in a particular column denotes the presence of a word, and 0 its absence, using sklearn's CountVectorizer.

Then, we can see that the similarity of two sentences, as rows in the sparse matrix, can be determined by comparing the values of their elements in each column, which is simply the Manhattan distance. This means that we have a nearest neighbour problem.

sklearn also provides a k-dimensional tree class, which we can use to find the nearest two neighbours for each point in the dataset (since a point's nearest neighbour is itself). Then, it remains to find the neighbours with the lowest distances, which we can use to index the original array.

Using %%timeit, I tested the runtime of my solution vs blhsing's solution on the text on this page, leaving imports outside the timing loop:

# my solution
198 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  
# blhsing's solution
4.76 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  

Restricting the length of sentences to those under 20 words:

# my solution
3.2 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# blhsing's solution
6.08 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
gmds
  • 17,927
  • 4
  • 26
  • 51
3

A more efficient approach to solve this in O(n x m) time complexity (where n is the size of the list, and m is the average number of words in the sub-lists, so that if m is relatively small and constant compared to the size of the list, this solution would scale linearly to n) is to iterate through the list of lists and build a seen dict that maps each word in a sub-list to a list of indices to the sub-lists that have been found to contain the word so far, and then use collections.Counter over the words in a given list to find the index to the most similar list with the most_common method:

from collections import Counter
seen = {}
most_common_pair = None
most_common_count = 0
for index, lst in enumerate(a):
    for common_index, common_count in Counter(
            i for word in lst for i in seen.get(word, ())).most_common(1):
        if common_count > most_common_count:
            most_common_count = common_count
            most_common_pair = a[common_index], lst
    for word in lst:
        seen.setdefault(word, []).append(index)

Given your sample input list in variable a, most_common_pair would become:

(['This', 'is', 'a', 'test'], ['test', 'something', 'here'])
blhsing
  • 77,832
  • 6
  • 59
  • 90
  • What if each sub list was hundreds of words long? Wouldn't it matter then? – yzx7243 Apr 30 '19 at 00:25
  • Hundreds of words should still be relatively trivial compared to a size of a list that's in the magnitude of a hundred thousands. I've updated my answer with a better description of the time complexity then. – blhsing Apr 30 '19 at 04:31
0

My tackle. I tested it with a list of 729 lists and it still works fast. I'm not sure on how faster it is, if at all honestly. But it doesn't use sets.

Here (It has a test in it, just use the function):

a = [1,2,3,4,5,6,7,8,9]
b = [9,10,11,12,13,14,15]
c = [11,3,99]
d = [9,10,11,12,50,1]
ls = [a,b,c,d]


for i in range(100,3000,2):
    ls.append([i,i+1,i+2,i+3])


def f(ls):
    wordDic = {}
    countDic = {}
    ind = 0
    for t in range(len(ls)):
        countDic[t]={}
    for l in ls:
        for word in l:
            try:
                for sharedIndex in wordDic[word]: #For every list that we already know has this word
                    try:
                        countDic[sharedIndex][ind] += 1  
                        try:
                            countDic[ind][sharedIndex] += 1
                        except KeyError:
                            countDic[ind][sharedIndex] = 1
                    except KeyError:
                        countDic[sharedIndex][ind] = 1
                wordDic[word].append(ind)
            except KeyError:
                wordDic[word] = [ind]
        ind += 1

    mx = 0
    ans = -1
    ind = 0
    for sLs in countDic.values():
        for rLr in sLs:
            if mx < sLs[rLr]:
                ans = (ls[ind],ls[rLr])
            mx = max(mx,sLs[rLr])
        ind += 1
    return ans

print(f(ls))  

What it does:

It's based on these two dictionaries: wordDic and countDic.

wordDic keys are each "word" that is used, with its value being a list of the indexes where said word was found.

countDic keys are the indexes of each list, and its values are dictionaries that hold how many

countDic = { listInd: {otherLists:sharedAmount , ...} , ...}

First it creates the dictionaries. Then it goes through each list once and saves the words that it has. It adds its own index to the list of indexes that each word has, after adding one to the amount of "shared words" count in the second dictionary.

When it finishes you'd have something like this:

{0: {1: 1, 2: 1, 3: 2}, 1: {2: 1, 3: 4}, 2: {3: 1}, 3: {1: 3, 0: 1}} ([9, 10, 11, 12, 13, 14, 15], [9, 10, 11, 12, 50, 1])

which reads as:

{(List zero: elements shared with list 1 = 1, elements shared with 2=1, elements shared with list 3=2.),(List One: elements shared with list 2=1, elements shared with list 3=4),....}

In this case List 1 shares the most elements with list 3 as any other. The rest of the function simply goes through the dictionary and finds this maximum.

I probably messed up my explanation. I think it would be best that you check if the function works better than your own first, and then try to understand it.

I also just noticed that you probably only need to add 1 to previously found lists, and don't need to add that other list to the one you're currently testing. I'll see if that works

EDIT1: It seems so. The lines:

try:
    countDic[ind][sharedIndex] += 1
except KeyError:
    countDic[ind][sharedIndex] = 1

Can be commented out.

I hope this helps.

laozoka
  • 21
  • 5