I am trying to write python script for customized filtering for fastq file (size >3 GB). My proposed script is as follows:
def filtering(read):
time.sleep(0.1)
if len(read) >= 15 and \
len(read) <= 30 and \
np.mean(read.qual) >= 30 and \
sum(i < 20 for i in read.qual) <=2:
flag1 = True
return flag1
pool = multiprocessing.Pool(5)
pool.ncpus = 4
for file in os.listdir('/moRNA_data/trimmed_fastq'):
if file.endswith('.fq'):
fastq_file = HTSeq.FastqReader( os.path.join('/moRNA_data/trimmed_fastq',file))
name = str(file.split('.')[0]) + '.fastq'
file_out = open(os.path.join('/moRNA_data/filtered_fastq',name), "w" )
for read in tqdm(fastq_file):
flag1 = pool.apply_async(filtering,read)
if flag1:
read.write_to_fastq_file( file_out )
pool.close()
pool.join()
The problem with this script is that it works well with small file like upto 100-200 MB but when it goes to directory having multiple sample each having size more than 3 GB, it works well for first file but after that it hangs so severely that I had to restart the server [ I can't even kill the process]. I need suggestions to improve this script to make fastq file processing faster. Platform details is as follows:
OS : Ubuntu-18.02
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-8700
CPU @ 3.20GHz
Stepping: 10
CPU MHz: 800.560
CPU max MHz: 4600.0000
CPU min MHz: 800.0000
BogoMIPS: 6384.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-11
Thanks.
filtering()actually returns something, though that's likely not the root of the problem. – Devon Ryan Jun 12 '19 at 08:56tqdmonly reports time in such cases it does not do a list conversion as you indicated- you can easily test with with an infinite generator https://pastebin.com/NJvwf5DT – Chris_Rands Jun 12 '19 at 08:58time.sleep()call for starters. and I think you're missing anelse: flag1 = Falseclause. If you still have issues, remove the multiprocessing code to assist your debugging – Chris_Rands Jun 12 '19 at 09:07Pool(2)but same server hangs after processing of one file very severely. – Lot_to_learn Jun 12 '19 at 11:09elsestatement. Fortqdmapproach, my only aim is to check how fast this script is checking sequences. When I run this script then for the first file, I saw that it is processing 8000-10000 reads per second (very fast) and process a fastq file of size 3 GB in few seconds. But after that it's block everything. I can't even kill the process. Thanks. – Lot_to_learn Jun 12 '19 at 11:11multiprocessingobject details rather than its actual value. I also triedflag1.getbut gets object description again. Again On small file, this script is OK. Many thanks for positive EDIT in question to make it more informative. – Lot_to_learn Jun 12 '19 at 11:16flag1 = False, it was an issue that the previous edit caused changing the indentation. Now (after my edit) the function returnsNone(which is falsly anyway) without an else-clause. you should still remove the sleep() call but i think your issue is due to the multiprocessing, strip that part out and see if it works – Chris_Rands Jun 12 '19 at 11:27multiprocessingmodule so i would solve this (per file) via a pipelining wrapper like snakemake or ruffus – Chris_Rands Jun 12 '19 at 12:28pool.apply_async(filtering,read), I am passing a read sequence to a function. Read sequence I am getting using for loop. I may have little knowledge about per file application of multiprocessing. As far as I know, I can use it on very heavy file to make filtration task faster to split in many parallel processes. – Lot_to_learn Jun 12 '19 at 12:45str(file.split('.')[0])is likely not doing what you want when a file name contains more than one dot. And it seems to me thatstris unnecessary: you already have strings. You might be interested inos.path.splitext. – bli Jun 12 '19 at 20:33Sample_ID.fastq. So In all the files there is only one dot. Forstryou are right. I can remove that. – Lot_to_learn Jun 13 '19 at 04:05Pool.apply_asyncobject may be the reason to hang system because this may set an infinite loop. – Lot_to_learn Jun 13 '19 at 04:07bbduk. I have check. I have found quality trimming based on mean quality and based on length. There are still two consition for which I have to write script. Filter reads having more than 2 bases under quality score 20 and filter reads with unique sequence having counts less than 10 in fastq file. – Lot_to_learn Jun 13 '19 at 04:11