5
  1. I have a list of strings containing 50 million search queries. [1-500+ words in each query].
  2. I also have a list of strings containing 500 words and phrases I need to return indices of search queries (1) that contain any word or phrase (2).

The goal is to only keep queries related to a certain topic (movies) and then use NLP to cluster these filtered queries (stemming -> tf_idf -> pca -> kmeans).

I tried to filter queries using nested loops, but it would take more than 10 hours to finish.

filtered = []
with open('search_logs.txt', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        query, timestamp = line.strip().split('\t')
        for word in key_words:
            if word in query:
                filtered.append(i)

I looked into solutions which use regex (word1|word2|...|wordN), but the problem is that i cannot combine queries into a large string since i need to filter irrelevant queries.

UPDATE: examples of logs and keywords

search_logs.txt
'query  timestamp\n'
'the dark knight    2019-02-17 19:05:12\n'
'how to do a barrel roll    2019-02-17 19:05:13\n'
'watch movies   2019-02-17 19:05:13\n'
'porn   2019-02-17 19:05:13\n'
'news   2019-02-17 19:05:14\n'
'rami malek 2019-02-17 19:05:14\n'
'Traceback (most recent call last): File "t.py" 2019-02-17 19:05:15\n'
.......... # millions of other search queries
key_words = [
    'movie',
    'movies',
    'cinema',
    'oscar',
    'oscars',
    'george lucas',
    'ben affleck',
    'netflix',
    .... # hundreds of other words and phrases
]
Superbman
  • 647
  • 1
  • 6
  • 20
  • With this much data, you should expect a long running time. – Tim Biegeleisen Apr 21 '19 at 15:08
  • True, but i suspect there are more efficient ways to do this – Superbman Apr 21 '19 at 15:12
  • 1
    You could look into multiprocessing to run the algorithm in parallel on all your available cores. Python is single-threaded and generally slow, so I'd prefer to write this sort of thing in C as a multithreaded application. Regex is probably not a performance-oriented solution, either. – ggorlen Apr 21 '19 at 15:15
  • 1
    Have you seen [this thread](https://stackoverflow.com/a/42789508/3832970)? With a regex trie, you may create a compact regex that will search exactly for your strings. – Wiktor Stribiżew Apr 21 '19 at 16:05
  • Nope, i'll give it a try. – Superbman Apr 21 '19 at 16:16
  • If the regex trie can be made even faster (at the cost of a O(n^2 preprocessing) by converting the regex trie to an actual FSA, minimizing it and converting it back. This way you have an even smaller regex. But the minimization operation is very expensive. There is a great post about implementing this in rust https://blog.burntsushi.net/transducers/ – Tommaso Fontana Feb 03 '20 at 16:55

1 Answers1

0

I would suggest FlashText, which was developed to be very efficient for exactly this kind of task. It will work as long as the keywords you are searching for are plain strings (as opposed to complicated regexes).

aab
  • 8,810
  • 18
  • 32