-1

friends a bit of context first. I am trying to calculate the distance between all the strings from a list and to do so I have a library that has a function that calculates the distance between two strings called affineGapDistance. Let's say that I have a list of 1000 different strings if I use a for loop it will be 1 million iterations (1000 x 1000). This becomes exponentially slow when the number of strings increases. If I have 20000 strings it will be 400 mi iterations.

What is the best way to speed up this? I started by using a simple for loop and avoiding the append:

%%time
results = np.zeros(len(list_of_strings)*len(list_of_strings))
counter = 0
for string in list_of_strings:
    for string2 in list_of_strings:
        results[counter] = affineGapDistance(string,string2)
        counter += 1

Took 21.5 seconds to run.

Then I tried list comprehension:

%%time
distances = [(word1, word2, affineGapDistance(sk1,sk2)) for word1 in tqdm(list_of_strings) for word2 in list_of_strings]

Took 21.4 seconds to run, a 1-millisecond improvement.

And my last try was using pandas to vectorize it:

temp_df = pd.DataFrame(list_of_strings, columns=["words"])

compare = pd.MultiIndex.from_product([temp_df['words'],
                                      temp_df['words']]).to_series()

def metrics(tup):
    return affineGapDistance(*tup)

%%time
rs = compare.apply(metrics)

The result was 21.2 seconds to run a 3-millisecond improvement compared to the for loop.

My question is, am I doing it right or am I missing something? Does it have a better way of doing this task?

Thanks in advance folks.

  • The runtime difference calculations are incorrect. Please include a [mre] to make the question reproducible. You forgot to include the time to convert to `pd.Dataframe` in the last benchmark. – Michael Szczesny May 11 '22 at 08:33
  • 1
    If you can't refactor out your function to rely on built-in vectorized operations, then you aren't going to be able to beat it with vanilla numpy + python. Although apparently, `numpy.vectorize` has improved performance, but it would not be by a large margin like you would expect. The fundamental issue is that you have to execute the Python - level function on each element, this will always introduce a lot of overhead. So your problem is with `affineGapDistance`. Can it be re-written in terms of built-in vectorized operations? – juanpa.arrivillaga May 11 '22 at 08:43
  • Alterantively, you can use something like `numba` which provides JIT compilation and AOT compilation for your functions written in Python, so if your function is not easily expressed in terms of vectorized operations, you can use the typical imperative / procedural Python way of expressing it – juanpa.arrivillaga May 11 '22 at 08:46
  • And fundamentally, here we are talking about improving constant factors, but the algorithm is required to be quadratic, so improving the constant factors will only take you so far – juanpa.arrivillaga May 11 '22 at 08:51
  • 0.1 seconds is a decisecond, not a millisecond. Odd to see that mistake from a European data scientist :-P – Kelly Bundy May 11 '22 at 08:52

0 Answers0