3

Hi there I have this code:

from functools import partial
import multiprocessing

def kmer_count(sequence, alphabet, k): """Returns a dictionary with kmers and it counts.""" seq = sequence.upper() seq_len = len(seq) kmers = [seq[i:i+k] for i in range(0, seq_len - k + 1)] filterd_kmers = [kmer for kmer in kmers if all(base in set(alphabet) for base in kmer)] return Counter(filterd_kmers)

pool = multiprocessing.Pool(4)

args = [''.join(alphabet.iupac_dna), 4] # in the real code it came from argparse f = partial(kmer_count,seq) pool.starmap(f, *args)

Result:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
TypeError: kmer_count() missing 1 required positional argument: 'k'
"""

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last) <ipython-input-79-0f5ba8381a51> in <module> 1 f = partial(kmer_count,seq) ----> 2 pool.starmap(f, *args)

/opt/anaconda3/lib/python3.7/multiprocessing/pool.py in starmap(self, func, iterable, chunksize) 274 func and (a, b) becomes func(a, b). 275 ''' --> 276 return self._map_async(func, iterable, starmapstar, chunksize).get() 277 278 def starmap_async(self, func, iterable, chunksize=None, callback=None,

/opt/anaconda3/lib/python3.7/multiprocessing/pool.py in get(self, timeout) 655 return self._value 656 else: --> 657 raise self._value 658 659 def _set(self, i, obj):

TypeError: kmer_count() missing 1 required positional argument: 'k'

If I run like this:

f(*args)

I got this:

Counter({'GTTC': 9780,
         'TTCG': 11291,
         'TCGC': 14803,
         'CGCC': 14428,
         'GCCA': 19254,
         'CCAG': 11512,
         'CAGA': 11641,
         'AGAG': 10387,
         'GAGC': 13164,
         'AGCG': 16025,
         'GCGG': 13067,
         'CGGT': 12329,
         'GGTT': 14217,
         'GTTT': 18733,
         'TTTT': 25488,
         'TTTG': 22006,
         'TTGA': 20004,
         'TGAC': 11351,
         'GACT': 7262,
         'ACTA': 5663,
         'CTAG': 4190,
         'TAGC': 8543,
         'AGCT': 12898,
         'GCTT': 19114,
         'CTTG': 16557,
         'TGAA': 17703,
         'GAAC': 9334,
         'AACA': 14050,
         'ACAC': 8812,
         'CACA': 11050,
         'ACAT': 9705,
         'CATC': 17357,
         'ATCC': 11157,
         'TCCC': 5754,
         'CCCG': 5926,
         'CCGT': 8601,
         'CGTC': 7401,
         'GTCC': 3811,
         'TCCT': 5890,
         'CCTG': 7985,
         'CTGC': 15324,
         'TGCG': 14856,
         'TTTA': 15607,
         'TTAG': 8366,
         'TAGG': 5393,
         'AGGC': 9580,
         'GGCA': 13939,
         'GCAC': 11686,
         'CACC': 14496,
         'ACCA': 16950,
         'CCAC': 13028,
         'CACT': 13194,
         'ACTT': 12585,
         'TTGC': 18877,
         'TGCA': 13770,
         'GCAT': 13307,
         'CATT': 14102,
         'ATTA': 9776,
         'TTAA': 12585,
         'TAAG': 9048,
         'AAGC': 18834,
         'AGCC': 12734,
         'CCAT': 13799,
         'CATA': 8188,
         'ATAA': 12278,
         'GCGT': 12546,
         'CGTT': 12599,
         'GTTG': 15136,
         'TTGG': 18903,
         'TGGC': 19391,
         'GGCC': 7685,
         'GCCC': 7404,
         'CCCA': 9752,
         'CCAA': 18755,
         'CAAC': 14852,
         'AACT': 11886,
         'ACTC': 10028,
         'CTCA': 13518,
         'TCAC': 15385,
         'ACCG': 12027,
         'CCGC': 13105,
         'CGCA': 14853,
         'ATAG': 7066,
         'AGGG': 5273,
         'GGGG': 4927,
         'GGGT': 7972,
         'GAAA': 16511,
         'AAAC': 18215,
         'ACTG': 12639,
         'TGCT': 16663,
         'CTTT': 18228,
         'TTTC': 17497,
         'TCGG': 10116,
         'TGAT': 20593,
         'GATG': 17301,
         'ATGA': 14218,
         'GATC': 14479,
         'ATCG': 16604,
         'CGCG': 13405,
         'CGTA': 8086,
         'GTAG': 6829,
         'TAGA': 6656,
         'AGAT': 11833,
         'ATCA': 20344,
         'TCAG': 13868,
         'ATCT': 11892,
         'TCTT': 14706,
         'CTTC': 13897,
         'TTCC': 9522,
         'TCCG': 7099,
         'CCGG': 5935,
         'CGGG': 5786,
         'GGGA': 5615,
         'GGAT': 11054,
         'GATA': 12085,
         'ATAT': 9205,
         'TATC': 11941,
         'TCGT': 8802,
         'GTAT': 8661,
         'TATT': 12620,
         'ATTT': 18149,
         'TGAG': 13150,
         'GAGA': 8658,
         'GAGG': 5644,
         'GCCT': 9673,
         'CCTT': 11377,
         'CTTA': 9100,
         'TTAC': 11418,
         'TACG': 8250,
         'ACGC': 12774,
         'CACG': 10171,
         'ACGA': 8951,
         'CGAT': 16617,
         'ATTG': 17372,
         'GGCG': 14257,
         'TTGT': 13287,
         'TGTA': 7808,
         'ACGG': 8529,
         'CGGA': 7109,
         'ATGT': 9975,
         'TGTC': 8286,
         'GTCG': 9088,
         'TCGA': 11986,
         'GATT': 16232,
         'GACC': 7608,
         'CATG': 11552,
         'ATGG': 13703,
         'TGGG': 9728,
         'CGAC': 9020,
         'GACG': 7394,
         'ACGT': 8085,
         'GTAA': 11360,
         'TAAT': 9895,
         'AATA': 12342,
         'GGCT': 12713,
         'AAAT': 17834,
         'AATC': 16220,
         'TTCA': 18526,
         'TCAA': 20304,
         'CAAA': 21752,
         'AATG': 14200,
         'TGGT': 16952,
         'GTTA': 10179,
         'GCGA': 14624,
         'AGGT': 9338,
         'GGTA': 10326,
         'CGGC': 12423,
         'TAAC': 9956,
         'AACG': 12524,
         'GCGC': 18306,
         'CTGG': 11344,
         'TGGA': 10805,
         'GCTG': 17455,
         'CTGA': 13532,
         'CGAG': 8649,
         'GAGT': 10014,
         'AGTA': 8668,
         'GTAC': 8038,
         'TACC': 10217,
         'ACCT': 9204,
         'ATAC': 8527,
         'TACA': 7453,
         'ACAA': 12591,
         'AAAA': 24650,
         'AAAG': 17848,
         'AAGG': 11240,
         'AGTT': 12431,
         'CAAT': 17456,
         'GTCT': 6235,
         'AGGA': 5937,
         'GGAG': 5294,
         'AGAA': 12516,
         'AAGA': 13980,
         'CGTG': 9948,
         'GTGC': 11762,
         'TCAT': 14388,
         'AGTC': 7451,
         'GTCA': 11557,
         'CAGT': 13149,
         'GGTC': 7553,
         'GTGA': 15043,
         'TAAA': 15634,
         'GGGC': 7200,
         'GGTG': 14495,
         'TTCT': 13230,
         'TCTA': 6621,
         'CTAC': 6510,
         'TACT': 8573,
         'CTAT': 6856,
         'TATG': 7993,
         'ATTC': 11395,
         'GAAG': 13449,
         'AAGT': 12578,
         'GCTA': 8346,
         'CTAA': 8310,
         'TAGT': 5859,
         'GACA': 8060,
         'TTAT': 12293,
         'TCCA': 11138,
         'CAGC': 17389,
         'GCTC': 13661,
         'CTCG': 8724,
         'CCCC': 5036,
         'TGCC': 13824,
         'GCCG': 12340,
         'GGAA': 9364,
         'CTGT': 9651,
         'TGTT': 14580,
         'ATGC': 13150,
         'AGTG': 13050,
         'GTGT': 9052,
         'TCTG': 11772,
         'AATT': 13738,
         'CGAA': 11026,
         'GTGG': 12927,
         'TCTC': 8866,
         'ACAG': 9505,
         'CCTC': 5686,
         'CTCC': 5391,
         'CCTA': 5236,
         'CGCT': 16302,
         'AACC': 13897,
         'ACCC': 8037,
         'GCAA': 18697,
         'CAAG': 16287,
         'AGCA': 16273,
         'GCAG': 15145,
         'CCGA': 9751,
         'CAGG': 7851,
         'TATA': 4461,
         'GGAC': 3754,
         'CTCT': 10608,
         'TGTG': 11291,
         'AGAC': 6199,
         'CCCT': 5517,
         'GAAT': 11315}

What I am doing wrong with the multiprocessing step. I want to speed up my analysis, because I have a lot of genomes in a directory to process. I just need to figure out how can I implement the pool, because probably all other functions a similar.

Thanks for your time and attention

  • 1
    I would guess that the kmer counting will be much faster with tools like KMC. Is there any particular reason you want your own counter in python? – Kamil S Jaron Sep 27 '20 at 17:47
  • I know what you mean, but I am learning to write code and I just want to use python to do that! There are many tools out there like khmer, jellyfish, kat etc... Thank you – Paulo Sergio Schlogl Sep 27 '20 at 18:06
  • 1
    I guess that's fair. The question is related to this one about kmer counting in python. Although it does not specificly talk about multiprocessing, I thought it's still relevant. – Kamil S Jaron Sep 27 '20 at 18:39
  • counting the mers it is not the problem. My problem is just the multiprocesing step thats giving me a hard time – Paulo Sergio Schlogl Sep 27 '20 at 20:11

1 Answers1

2

This code help me to get the answer I was looking for.

r = pool.apply_async(f, a)
print(r.get())
user438383
  • 1,679
  • 1
  • 8
  • 21