-5

Consider the following problem: given a number $n$, an alphabet $\Sigma$, and a finite language $L$, how many strings of length $n$ in $\Sigma^*$ contain at least one word $w\in L$? E.g. abcgodef contains the word go.

This is a toy problem I want to use to demonstrate the power of an algorithm I developed. I implemented it as a trivial python script, and was able to solve the above problem for $n = 50, \Sigma = [a-z], |L| = 5$ on my laptop.

Does anyone know of an existing technique capable of this? That is, can anyone calculate the answer, or am I the only one? (When I finish my paper, I'll be happy to link it here if people are interested).


Specifically, $L$ was

['the', 'of', 'and', 'you', 'that']
  • Please read [tour] and [help/on-topic]. – Kaveh Sep 07 '15 at 00:36
  • 1
    I don't think your toy problem illustrates any "power". It seems you could calculate this using a fairly standard dynamic programming approach, as for every $1\leq i\leq n$ and every prefix of a word in $L$ you'd just need to store how many strings of length $i$ would contain a word of $L$ if you prepended that prefix. – Tom van der Zanden Sep 07 '15 at 07:48
  • 2
    See Noonan and Zeilberger's exposition of the Goulden-Jackson Cluster method, http://dx.doi.org/10.1080/10236199908808197 for the general method. – András Salamon Sep 10 '15 at 14:10
  • 1
    This is a special case of the problem answered in http://cstheory.stackexchange.com/questions/8200/counting-words-accepted-by-a-regular-grammar – David Eppstein Sep 12 '15 at 03:30

1 Answers1

0

This problem can be solved using a fairly standard dynamic programming approach, in time $O(n\cdot \Sigma_{w\in L} |w|)$. This is not counting the complexity of big number arithmetic, which would bring the running time to something like $O(n^2 \log |L| \cdot \Sigma_{w\in L} |w|)$.

I believe the answer to your sample problem is (barring any mistakes in my program):

4361583841553844512216859806590291394203513538220428254669943773111822

(Approximately $4.36\cdot 10^{69}$, given that the total number of strings of length $50$ is $5.61\cdot 10^{70}$ this seems reasonable; "or" should be fairly common as a substring.)

In the following $||$ denotes concatenation, $\emptyset$ the empty string and $\bar{w}$ denotes the string $w$ with its first character removed.

$f(w,i)=\begin{cases} \Sigma|^i, & \text{if $w$ contains a word of $L$}.\\ 0, & \text{if $i = 0$}.\\ f(\bar{w}, i), & \text{if $w$ is not a prefix of any word in $L$}.\\ \Sigma_{c\in \Sigma} \space f(w || c, i-1), & \text{otherwise}. \end{cases}$

$f(\emptyset, n)$ is the value you're after. It can be computed (efficiently) using dynamic programming.