friends a bit of context first. I am trying to calculate the distance between all the strings from a list and to do so I have a library that has a function that calculates the distance between two strings called affineGapDistance. Let's say that I have a list of 1000 different strings if I use a for loop it will be 1 million iterations (1000 x 1000). This becomes exponentially slow when the number of strings increases. If I have 20000 strings it will be 400 mi iterations.
What is the best way to speed up this? I started by using a simple for loop and avoiding the append:
%%time
results = np.zeros(len(list_of_strings)*len(list_of_strings))
counter = 0
for string in list_of_strings:
for string2 in list_of_strings:
results[counter] = affineGapDistance(string,string2)
counter += 1
Took 21.5 seconds to run.
Then I tried list comprehension:
%%time
distances = [(word1, word2, affineGapDistance(sk1,sk2)) for word1 in tqdm(list_of_strings) for word2 in list_of_strings]
Took 21.4 seconds to run, a 1-millisecond improvement.
And my last try was using pandas to vectorize it:
temp_df = pd.DataFrame(list_of_strings, columns=["words"])
compare = pd.MultiIndex.from_product([temp_df['words'],
temp_df['words']]).to_series()
def metrics(tup):
return affineGapDistance(*tup)
%%time
rs = compare.apply(metrics)
The result was 21.2 seconds to run a 3-millisecond improvement compared to the for loop.
My question is, am I doing it right or am I missing something? Does it have a better way of doing this task?
Thanks in advance folks.