I have some fastq files (obtained from nanopore sequencing) that contain reads that can be of either of these 5 forms:
- a known CDS with 3'UTR:
CDS----------------------- (seq1: original sequence)
- the same cds with a block of 50bp represented by
|||||(that substitutes the 3'UTR at 4 different locations)
CDS|||||------------------ (sub1: substitution 1)
CDS-----|||||------------- (sub2: substitution 2)
CDS----------|||||-------- (sub3: substitution 3)
CDS---------------|||||--- (sub4: substitution 4)
I need to split my fastq files in 5 smaller files clustering the different types of sequences (seq1, sub1, sub2, sub3 and sub4). The problem is that those sequences all share some segments and I don't know which method I should apply. I tried finding the best match by building a BLAST database with the 5 sequences but as it is local alignments, I get HSPs for each sequences and cannot discriminate between them.
Does anyone has an idea how I could achieve that?
Any help would be amazing.
edit
Here are 5 sequences as I expect them (i.e., fasta reference)
>seq1 (this one does not have the 50bp block)
ACCTGCCAGTTCCTGAAGGGTGTACTGATCCTGTGGCTGAAAACTTTGATCCAACGGCTAGAAGTGACGATGGAACCTGTGTCTACAACTTTTGAGCAATATTATCCTGCTTATTAATTTGCTGTTTTACTCCTATTGTCTCTTTTGGTTTATTTTTCTCCTTTGTGTAATTGTGGATTGGATCTTGTCCTCTTTTGTTCCCTTTTTTTTTTTTTATGATGTACAACACATTGGTAATTTAAAATTGCCTTGTCATAAA
>sub1
_GTAAGTGAGGGAGTCTCGGACTCAATTAGAGGCTCTTCTTTCACA_ TTGATCCAACGGCTAGAAGTGACGATGGAACCTGTGTCTACAACTTTTGAGCAATATTATCCTGCTTATTAATTTGCTGTTTTACTCCTATTGTCTCTTTTGGTTTATTTTTCTCCTTTGTGTAATTGTGGATTGGATCTTGTCCTCTTTTGTTCCCTTTTTTTTTTTTTATGATGTACAACACATTGGTAATTTAAAATTGCCTTGTCATAAA
>sub2
ACCTGCCAGTTCCTGAAGGGTGTACTGATCCTGTGGCTGAAAACT _GTAAGTGAGGGAGTCTCGGACTCAATTAGAGGCTCTTCTTTCACA_ TTTGAGCAATATTATCCTGCTTATTAATTTGCTGTTTTACTCCTATTGTCTCTTTTGGTTTATTTTTCTCCTTTGTGTAATTGTGGATTGGATCTTGTCCTCTTTTGTTCCCTTTTTTTTTTTTTATGATGTACAACACATTGGTAATTTAAAATTGCCTTGTCATAAA
>sub3
ACCTGCCAGTTCCTGAAGGGTGTACTGATCCTGTGGCTGAAAACTTTGATCCAACGGCTAGAAGTGACGATGGAACCTGTGTCTACAACT _GTAAGTGAGGGAGTCTCGGACTCAATTAGAGGCTCTTCTTTCACA_ TTGTCTCTTTTGGTTTATTTTTCTCCTTTGTGTAATTGTGGATTGGATCTTGTCCTCTTTTGTTCCCTTTTTTTTTTTTTATGATGTACAACACATTGGTAATTTAAAATTGCCTTGTCATAAA
>sub4
ACCTGCCAGTTCCTGAAGGGTGTACTGATCCTGTGGCTGAAAACTTTGATCCAACGGCTAGAAGTGACGATGGAACCTGTGTCTACAACTTTTGAGCAATATTATCCTGCTTATTAATTTGCTGTTTTACTCCTA _GTAAGTGAGGGAGTCTCGGACTCAATTAGAGGCTCTTCTTTCACA_ GATCTTGTCCTCTTTTGTTCCCTTTTTTTTTTTTTATGATGTACAACACATTGGTAATTTAAAATTGCCTTGTCATAAA
examples (50bp block if present is in lowercase here)
- this read should be associated with sub3 because of the flanking sequences (...CTACAACT in 5' and TTGTCTCTTTT... in 3' of the 50 bp block)
@e213fdd5-bbfb-4dfe-9b03-29eea256da0d
TGTGTACTTCGTTCAGTTACGTATTACTAGTTATTGAGTGTCTTTGTGTTTCTGTTGGTGCGTCTTCGCACAAGGCTAATCTTACTCTCCTCTCCAATGTACTGCTTCTATGTTCATCTCAGCCGAATTCAATAAGGAGAAGAACTTTCACTAGGTTTCCCCATTCTGTTCGATTAATTGGTGATGTTAGTAAATTTCAATTTTCTGTCGGTGGAAGGTGAAGGTGATTGCTGCTGGCTTTTCATTGTTTACTTTTGGCTTCTGTTTTGTAGTAACGCATCACTACTTTCTCTTATAGTGTTCAATGCTTTCCAAGTCTGGAATCATGAAGCAGCGGCTTCTTCAAGAGCGCCATGCCTGAGGGATACGTGCAGGAGAGGACCATCTTCTTCAAGGACGACGGGAACTACAAGACGTGCTGAAGTCAAGTTTGGGGAGACACCCTCGTCAATGAGATCGAGCTTAAGGGAATCGATTTCAAGGAGGACGGAAACATCCTCGGCCACAAGTTGGAATACAACTACAACTCCCACAACGTATACATCATAGCCGACAAAACCCAAAAGACGGCATCAAGAAACCAACTTCAGACCTTACCATCAGGCGGCGGCCATCACACAGTCATTGTCAACAAATGTAATTGGCGGTGGCTATTCTTTTTACCCAGACAACCATTACCTTGTCCACACAGTCACACTGTGAAAAGATCCCAACTTTGAAAGAGAGGCCACTGGTCTTCTTGAGTTTGTAACAACTGCTGGGATTACACATGGCATGGATGAACTATACAAACATGACAGACTCTAAAGCCTGCAGTTCCTGAAGAGATTGTACTGATCCTGTGGCTGAAAACTTTGATCCAACGGCTAGAAGTGACGATGGAACCTGTGTCTACAACTgtaagtgagggagtctcggactcaattagaggctcttctttcacaTTGTCTCTTTTGGTTTATTTTCTCCTTTGTGTAATTGTGGATTGGATCTTGTCCTCTTTTGTTTTCCTTTTTTTTTTTTATGATGTACAACACATTGGTAATTTAAAATTGCCTTGTCATGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACTGTAGGCACCATCAATGAAGATAGAGCGACAGGCAAGTCACAAAAACACCGACAACTTTCTTGTCACGGTAGGCGATAGCAATACGTAACTG
+
$&&&%234441124--585545;>B99968650-++*()'&'(;D540488?=888,++*+*)*$$$##&%$$$%)*****('%$$'1/+++-,('))'%$$%$$$&)&%%%&(*'&),/'''&&'06872..*&%)),,3/49:;<?A773)('()/066+'&&)('(+&%$&''$$&&()*--888)('()'')')'''++*+)&')&$$$%()'')*''&&))&'&$&(&$#$%&%&&&%%$%&..&%%%%&&%$%&'&$$'-)(%%%$&(*,+****&'''('+17<{..+*('((*+)*&&&'**.*%$&%&&%$$$%&&'&&(+'(&&&%%&&($$%&&'+55536988;;74632+(*)''%6<=(((&*''(68998===?ABBBB=B?<;;96667>>?>>?@?A=-+.7556<@>A<0//002(()-989::::;<<>>;:5++++->;::;<?F:7<()988;;<?>9;9)36:5-963/002B@A@@?554543,,9<=<@?@>777461166.../B88<;<@=<:888;;>42(''%%(%%+*('&%%&&**)(()*+020*'('&$$$'('''&%'((%(&$##$&)*()&%%%$$#$%'&(&'%$$%%&'%%%$%%%$&('$##$&&&&$$$%$%$$%&%%&040(((&'.023266,,,++0)++-.-*,,0+&&'$&''''%%'&-+,0((+))*+*(%&)+00.3)''%'&'''(*(%%)++++,--+.....0'(()7:<=;=<<<<<0.<;9:;<@;;;:<<=;2+)++,1212611)())*:1++*'&&%((&&056/,454))*''''(3346;<;;;;;000.0/043546??=;<*)+64335663455665333788:78779<>9755446::<><198662*+7899722248867>?8++4))+6779<<BA><9;<;:;;{<?BB-,,-13444/154///933{3==@B:9989::::::;;<<=:3335=EBB;:*)..'-:69?@CKKKDC<<?>:)()((012.,,+,55;=?>@>:98:78544344*)%$$$&)))))))))(((((''''&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&')+.1866677978:;;:::;;;99::988889653456:::9889;::8899.,+0668:;:::8436735//:632/-,+**+013459;9332/+($
- this read should be associated with sub4 because of the flanking sequences (...TTTTACTCCTA in 5' even if a C is missing, and GATCTTGTCCT... in 3')
@107dfd6d-2101-4fd8-aeb5-2c432e176d72
ATGTACTTCGTTCAGTTACGTATTGCTATCGCCTACCGTGACAAGAAAGTTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCATGAAGACTAATCTTTTCTCTTTCTCATCTTTTCACTTCTCCTATCATTATCCTCGACCGAATTCAGTAAAGGAGAAGAACTTTTCACTGGGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGTACAAATTTTCTGTCAGTGGAAGAGGGTGAAGGTGATGCAACATACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCATGGCCAACACTTGTCACTACTTTCTCTTATGGTGTTCAATGCTTTCAAGATACCAGATCATATGAAGCGGCACGACTTCTTCAAGAGCGCCATGCCTGAGGGATACGTGCGGGAGAGGACCATCTTCTTCAAGGACGACGGGAACTACAAGACACGTGCTGAAGTCAAGTTTGAGGAGACACCCTCGTCAACAGGATCGAGCTTAAGGGAATCGATTTCAAGGAGGACGGAAACATCCTCGGCCACAAGTTGGAATACAACCGCAACTCCCACAACGTATACATCATGGCCGACAAGCAAAAGAACGGCATCAAAGCCAACTTCAAGACCCGCCACAACATCGAAGACGGCGGCGTGCAACTCGCTGATCATTATCAACAAAATACTCCAATTGGCGATGGCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGCCCTTTCGAAAGATCCCAACGAAAAATTTTTGCATGGTCCTTCTTGAGTTTGTAACGGCTGCTGGGATTACATGGCATGGATGAAGAACTGCAACACAAACACTTGACGAACTCTAAACCTGCCAGTTCCTGAAGAAATTGTACTGATCCTGTGGCTGAAAACTTTGATCCAACGGCTAGAAAGTGACGATGGAACCTGTGTCACCAACTTTTGAGCAATATTATCCTGCTTATTAATTTGCTGTTTTACTCTAgtaagtgagggagtctcggactcaattagaggctcttctttcacaGATCTTGTCCTCTTTTGTTCCCTTTTTTTTTTTATGATGTACAACACATTGGTAATTTAAAATTGCCTTGTCATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACTGTAGGCACCATCAATGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTTTCTTGTCACGGTAGGCGATAGCAATACGTAACT
+
-0,+))),:::<??===>=<<<?@>=<<=?;:889:<;<<=;;;;CD@?:9:A>9999=999DBAA@<BE=3220111<B@@AA@A?>>>=>>;:9999:9;<E9A<;?????C8788;==@?;;;==9989@A6677?=7788+**+368>B87::==@<99;<<<<<<??>@=:9/.))*698.*,/8A=<==>>=?BAB=::::;;;<?@>@B@@?:888:>?@CCCEDB>??>===::.+)()2157=7:99<B?>???@??=;::=DGC=<6?;84568>AB@CCBCBAA====AAACEI{AA=<::;<?@=<<>=>@AAA@>>>>?<?====A?==<=@>??@>?==??CA@AAA@AB=757:965(*.;<;<::?AAA@>=>::::<=?@BCC5336899=?>>?@A??42220;9:89.-***+77;97<;:;<<??BC{878<:9:<:2112<<>===@=<;;<>???877/,.-.-///0,++,)299:=;:;;=@@<;=7778?=::;;<;=<<C>AB<BBBCEDCDC?A>;;9:655:DAA@?>>>=?=><;<<@=>><=B=@=:870++,,;=;>><<==CBA@?>=>>=667<;;7778B>B?B<<==<99:::4:659;6642212:.-<=??=6556;??@@>::;:==;8887899==;2143235<<<?>==?CEBD?>>>ADB=720/.02334<00//06568>:888:711*)-776'''&06:<=><><866;{C@A?>8897=>EK558568552-179664)(()),,--,()))3:;:<BAAA>=>>>><<>>*))+/245999;>>=96;?B9888<<77-,,,,&&%%%&&'+6;A;2111--,,.:;><<ACCA@@?=9;;;>=>==@930../+)(())07>>8<><;:;9238<>=>998878:7700016545**1236434689<9<<;.--0-,,,&'():78862237@???><;;;;<<=@>@??@@===;<=>>;:8;)(67511223/.---.2001023:;21<?=**6223;;>?AACGA=73366:;<=:766;ACC=<;86000?8=@DHEC@@=;;<9999<<==9><==8889>A>=;30266777:972+***)))'&%%%%%$$$$$$######"""""""""""""""""""""""""""""""""""""""""#&')+-1122/...46688<<=0('%%&258899;92136:999;:::9:;;--2-.343668=<?@EF????==:7668;6667;><<;;>;;8851.*