Currently, I am running into a problem running my "old" code for estimating lda models. My old code, which worked fine on my laptop, looks like as follows:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
count_vectorizer = CountVectorizer(min_df=0.025, max_df=0.75)
count_data = count_vectorizer.fit_transform(my_data)
lda_ = LatentDirichletAllocation(n_components=20, max_iter=100, n_jobs=-1, random_state=0).fit(count_data)
On my new computer, running exact the same code, I am getting an error:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
Setting n_jobs parameter to 1 seems to "fix" this problem. However, it takes much longer to train a model. Setting n_jobs below or above 1 leads to TerminatedWorkerError. my_data is 704x3911 sparse matrix of type '<class 'numpy.int64'. I am using the same virtual conda environment, so exact the same packages are used. My new computer has about 30 GB RAM and Intel(R) Core(TM) i7-11700 16 cpu cores. Besides, my computer has a GPU. Could this cause this problem?
I have tried some solutions that I have found on the Internet. First, I checked whether the latest version of C++ as proposed in How do I fix/debug this Multi-Process terminated worker error thrown in scikit learn. Second, I ran the same code from a .py file to ensure that this is not a Jupyter problem.
Do you have any ideas what might cause this problem? I would really appreciate your help!