Currently we are building a sequence based deep-learning model to predict binding affinity between antibody and antigen. For this we are training a sequence based model with AlphaSeq Antibody dataset (https://github.com/mit-ll/AlphaSeq_Antibody_Dataset). But in this dataset they haven't provided the sequence for SARS-CoV-2 virus.
So I read the paper and there they have mentioned they have tested against peptide in the HR2 region of the SARS-CoV-2 spike protein and they have given the sequence as "PDVDLGDISGINAS". (https://www.nature.com/articles/s41597-022-01779-4#Sec3)
But this sequence is very small. I'd like a larger protein sequence that includes non-epitope portions so that the prediction is more meaningful (e.g. predicts binding to the actual epitope when provided a sequence which includes sequence beyond the epitope). I searched online and only found a covid genome sequence which isn't very helpful.
Searching the RCSB PDB website for Spike protein HR2 gave two SARS-CoV-2 results (RCSB PDB website).
One of them had sequence like the following:
>8CZI_1|Chains A, B, C|Scaffolded Spike protein S2' HR1|Nostoc punctiforme (strain ATCC 29133 / PCC 73102) (63737)
MSHHHHHHGSQTLLRNFGNVYDNPVLLDRSVTAPVTEGFNVVLASFQALYLQYQKHHFVVEGSEFYSLHEFFNESYNQVQDHIHEIGERLDGLGGVPVATFSKLAELTCFEQESEGVYSSRQMVENDLAAEQAIIGVIRRQAAQAESLGDRGTRYLYEKILLKTEERAYHLSHFLAKDSLTLGFAYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVE
>8CZI_2|Chains D, E, F|Spike protein S2' HR2|Severe acute respiratory syndrome coronavirus 2 (2697049)
KNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQ
So I am not sure which one is the correct sequence for our task. Can someone provide some insights regarding this matter?
Also why are there so many sequences popping up when I search for SARS-CoV-2? Is there a version that can be considered as the root from which other versions are derived?
{}button to format it as code. – terdon Feb 27 '24 at 09:18