I have posted a question related to this problem over a year ago and we still were not able to figure this out.
We have two groups, A and B that we want to train on to separate them. Both have numerous observations of "text" so for example:
group A:
- AAABBBC*CAAAAAAAC
- CCCBBBC*CAAAAAAAB
- CBBBBBC*CAAAAAAAB
group B:
- AAACCCC*CAAAAAAAA
- CCCCCCC*CAAAAAAAA
- CBBCCCC*CAAAAAAAA
Notably, our original datasets are much larger, with around 4,000 observations for A (with really specific patterns) and around 20,000 for group B. We want is a model that sees things like:
- if there is a
Cat position 1 we see aBon the end in group A (2/3), and we do not see this in group B (0/3) - That we only find the motif
AAABBBin group A - if we see
AAABBBwe also saw aCat the end in group A (1/3) but we did not see this in group B (0/3)
We tried LDA now (after converting this data to binary vectors), however, this would score each letter independently. To illustrate if group A would have two subgroups:
sub1: position1 = A + position10 = Csub2: position2 = A + position15 = B
and both are not common in group B then a method like LDA would also score position1 = A (sub1) + position15 = B (sub2) extremely high even tho they are actually part of different dependencies within group A, so we are looking for an alternative taking care of such dependencies when differentiating groups.
We really hope someone here can help us out!
EDIT
To explain the dependencies better as asked by @carlo I made a grossly simplified example:
For pink we see that:
- A at pos
1is often associated with A at pos3 - A at pos
1can also be associated with B at pos2where it is often followed by A at3and less often by C at pos3
For blue we see that:
- A at pos
1is often associated with A at pos2and then sometimes followed by C at3.
Then when we run a classifier we want it to see that a sequence such as AXA would be probably group I but that a sequence such as ABA is even more likely to be group I, and that a sequence such as AAB would be group II.