0

I have a data set with text-strings that contain a pattern within them and I want to deduce the number of times the pattern has happened. For instance, if I would input: 'abcdefabcdefabcdef' The solution should be 3.

Unfortunately my patterns will have noise, so they will be more similar to: 'abcdefabcdeefbacdef'

I thought of using association rules or try some kind of spatial analysis but I do not find a straightforward way to tackle the problem.

Any help/advice would be appreciated.

Edit 1

The pattern is not provided as an input and it is not known beforehand. Regarding the noise, the patterns will have a disturbance 5% of the times. These disturbances will usually be the repetition of a prior element or skipping an element.

My aim is to maximise the weighted mean of the Levenshtein distance and the Jaccard index for several k-s (number of times the event happened), so I can find that k.

It is the first time I face a problem like this and I have not been able to find much on the topic. I will try to start working with the suggestions made by Matthew on his comment, but I would be grateful if someone could lead me to some example or theoretical framework on the topic.

Jon Nagra
  • 313
  • 1
    Quick reactions: (1) Scanning strings for patterns can be quickly done with regular expressions. You may be able to incorporate some ad-hoc, allowable fudge factor to your regular expression. (eg. match a.b.c.d instead of abcd. The . character matches anything in a regular expression.) (2) There are various metrics for the distance between strings such as the Levenshtein distance. – Matthew Gunn Sep 22 '16 at 15:13
  • 1
    Is this "pattern" given as input or does it need to be determined from the string itself? What are the forms the "noise" can take? What can you say about the probability distribution of this noise? – whuber Sep 22 '16 at 20:41
  • Thank you for the edit. But what exactly do you mean by a "pattern"? For instance, in the noisy example the substring "e" occurs four times, so why shouldn't it be identified as the "pattern" and the rest of the string as "noise"? – whuber Sep 23 '16 at 14:18
  • In principle, in my real dataset, I will be able to assess the quality of the "patterns" once given. I will likely be looking for the longest string with least letter repetitions within it. However, prior to this assessment, I need to be able to partitionate the string in k pieces. My main concern is how to make this cut as I want to make the pieces as similar as possible to each other. – Jon Nagra Sep 25 '16 at 17:32

1 Answers1

1

Here is a simple approach:

> string <- 'abcdefabcdeefbacdef'
> target <- 'abc'
> token <- '@' #the token is a character that will never appear in the string
> processed_string <- gsub(target, token, string)
> lengths(regmatches(processed_string, gregexpr(token, processed_string)))
[1] 2

If you want to account for noise, you could take the approach of making target a vector, and looping through it to look for various definitions of your pattern. How you define that vector will depend on your application. It might be ad-hoc or programmatic.

Also, this is more of a stack-overflow question, unless you've got a statistical question about probabilistic text matching or something.

generic_user
  • 13,339
  • 1
    Thanks for the response. I edited the question because it was unclear that the pattern was not known beforehand. Sorry for the inconvenience. – Jon Nagra Sep 23 '16 at 06:52