A simple challenge

Question

Consider the following problem: given a number $n$, an alphabet $\Sigma$, and a finite language $L$, how many strings of length $n$ in $\Sigma^*$ contain at least one word $w\in L$? E.g. abcgodef contains the word go.

This is a toy problem I want to use to demonstrate the power of an algorithm I developed. I implemented it as a trivial python script, and was able to solve the above problem for $n = 50, \Sigma = [a-z], |L| = 5$ on my laptop.

Does anyone know of an existing technique capable of this? That is, can anyone calculate the answer, or am I the only one? (When I finish my paper, I'll be happy to link it here if people are interested).

Specifically, $L$ was

['the', 'of', 'and', 'you', 'that']

I don't think your toy problem illustrates any "power". It seems you could calculate this using a fairly standard dynamic programming approach, as for every $1\leq i\leq n$ and every prefix of a word in $L$ you'd just need to store how many strings of length $i$ would contain a word of $L$ if you prepended that prefix. — Tom van der Zanden, Sep 07 '15 at 07:48
See Noonan and Zeilberger's exposition of the Goulden-Jackson Cluster method, http://dx.doi.org/10.1080/10236199908808197 for the general method. — András Salamon, Sep 10 '15 at 14:10
This is a special case of the problem answered in http://cstheory.stackexchange.com/questions/8200/counting-words-accepted-by-a-regular-grammar — David Eppstein, Sep 12 '15 at 03:30

Tom van der Zanden · Accepted Answer · 2015-09-07T18:44:33.203

0

This problem can be solved using a fairly standard dynamic programming approach, in time $O(n\cdot \Sigma_{w\in L} |w|)$. This is not counting the complexity of big number arithmetic, which would bring the running time to something like $O(n^2 \log |L| \cdot \Sigma_{w\in L} |w|)$.

I believe the answer to your sample problem is (barring any mistakes in my program):

4361583841553844512216859806590291394203513538220428254669943773111822

(Approximately $4.36\cdot 10^{69}$, given that the total number of strings of length $50$ is $5.61\cdot 10^{70}$ this seems reasonable; "or" should be fairly common as a substring.)

In the following $||$ denotes concatenation, $\emptyset$ the empty string and $\bar{w}$ denotes the string $w$ with its first character removed.

$f(w,i)=\begin{cases} \Sigma|^i, & \text{if $w$ contains a word of $L$}.\\ 0, & \text{if $i = 0$}.\\ f(\bar{w}, i), & \text{if $w$ is not a prefix of any word in $L$}.\\ \Sigma_{c\in \Sigma} \space f(w || c, i-1), & \text{otherwise}. \end{cases}$

$f(\emptyset, n)$ is the value you're after. It can be computed (efficiently) using dynamic programming.

edited Sep 07 '15 at 18:44

answered Sep 07 '15 at 17:44

Tom van der Zanden

533
3
9

Why don't we continue this discussion in the chat? – Tom van der Zanden Sep 07 '15 at 18:07
Now it's correct... unfortunately... – Elliot Gorokhovsky Sep 07 '15 at 18:49
I'm sorry... :( – Tom van der Zanden Sep 07 '15 at 18:51
I don't understand why the time isn't exponential, though, because of the last case. Also, can you include a proof sketch? – Elliot Gorokhovsky Sep 07 '15 at 18:51
Why doesn't $|\Sigma|$ show up in the time complexity? – Elliot Gorokhovsky Sep 07 '15 at 18:59
The reason the time isn't exponential is because of \emph{dynamic programming} (this is a very important technique, you should learn it!). Intermediate values can be memorized and reused. You can download the source code here. You are right, to be completely correct $|\Sigma|$ should show up in the running time. I ignored it because it is small compared to the rest. – Tom van der Zanden Sep 07 '15 at 19:07
OK, so if it was naively implemented it would take exponential time, but with memoization it's polynomial. – Elliot Gorokhovsky Sep 07 '15 at 19:09
Last bonus question: can your method be modified to find how many strings contain exactly 2 words? – Elliot Gorokhovsky Sep 07 '15 at 19:17
Yes, but please don't ask me to implement it because it would be really annoying :-) – Tom van der Zanden Sep 07 '15 at 19:23

A simple challenge

1 Answers1