I have a data set with text-strings that contain a pattern within them and I want to deduce the number of times the pattern has happened. For instance, if I would input: 'abcdefabcdefabcdef' The solution should be 3.
Unfortunately my patterns will have noise, so they will be more similar to: 'abcdefabcdeefbacdef'
I thought of using association rules or try some kind of spatial analysis but I do not find a straightforward way to tackle the problem.
Any help/advice would be appreciated.
Edit 1
The pattern is not provided as an input and it is not known beforehand. Regarding the noise, the patterns will have a disturbance 5% of the times. These disturbances will usually be the repetition of a prior element or skipping an element.
My aim is to maximise the weighted mean of the Levenshtein distance and the Jaccard index for several k-s (number of times the event happened), so I can find that k.
It is the first time I face a problem like this and I have not been able to find much on the topic. I will try to start working with the suggestions made by Matthew on his comment, but I would be grateful if someone could lead me to some example or theoretical framework on the topic.
a.b.c.dinstead ofabcd. The.character matches anything in a regular expression.) (2) There are various metrics for the distance between strings such as the Levenshtein distance. – Matthew Gunn Sep 22 '16 at 15:13