25

Given two strings x and y, I want to build a minimum size DFA that accepts x and rejects y. One way to do this is brute force search. You enumerate DFA's starting with the smallest. You try each DFA until you find one that accepts x and rejects y.

I want to know if there is any other known way of finding or building a minimum size DFA that accepts x and rejects y. In other words, can we beat brute force search?

More Detail:

(1) I really do want an algorithm to find a minimum size DFA, not a near minimum size DFA.

(2) I don't just want to know how large or small the minimum DFA is.

(3) Right here, I'm only focused on the case were you have two strings x and y.


Edit:

Additional information for the interested reader:

Suppose $x$ and $y$ are binary strings of length at most $n$. It is a known result that there is a DFA that accepts $x$ and rejects $y$ with at most $\sqrt{n}$ states. Notice that there are about $n^{\sqrt{n}}$ DFA's with a binary alphabet and at most $\sqrt{n}$ states. Therefore, the brute force approach wouldn't require us to enumerate through more than $n^{\sqrt{n}}$ DFA's. It follows that the brute force approach couldn't take much more than $n^{\sqrt{n}}$ time.

Slides that I found helpful: https://cs.uwaterloo.ca/~shallit/Talks/sep2.pdf

Michael Wehar
  • 5,127
  • 1
  • 24
  • 54
  • I think I've heard of some learning approaches, but I'm not sure if they work for this specific case. If you could explain one such approach, that would be quite helpful. – Michael Wehar Jan 12 '15 at 19:13
  • 2
    @AndrásSalamon Is it still NP-complete if the sets to be distinguished each consist of only one string? It feels to me like this should be reasonably tractable. – mhum Jan 12 '15 at 20:45
  • 1
    I didn't mention it in the question, but to the best of my knowledge, that's a possibility. However, if x and y have length n, then it is known that there is a DFA with at most $\sqrt{n}$ states or less that accepts x and rejects y. Therefore, the brute force search would give an answer in $2^{\sqrt{n}\log(\sqrt{n})}$ time or less. Link: https://cs.uwaterloo.ca/~shallit/Talks/sep2.pdf – Michael Wehar Jan 12 '15 at 22:31
  • 2
    @MichaelWehar Is it insufficient to construct a simple non-minimal DFA that separates x and y and then run a standard DFA minimization algorithm? The simple non-minimal DFA can have $O(n)$ states (e.g.: if it's just based on the first differing character) and then Hopcroft runs in $O(n log n)$. – mhum Jan 12 '15 at 23:04
  • @mhum Small correction: it runs in $O(ns\cdot logn)$ where $s$ is the alphabet size. But this shouldn't change the running time in most cases. – Ryan Dougherty Jan 13 '15 at 00:51
  • 7
    @mhum the problem that there are many different regular languages that separate the two strings — DFA minimization will find the best automaton for any one of these languages but will do nothing to compare it to automata for the other separating languages. – David Eppstein Jan 13 '15 at 07:01
  • @DavidEppstein Ah yes. Of course. – mhum Jan 13 '15 at 07:22
  • 1
    Do you want to do this in practice, or are you more interested in the theory (an algorithm with a provable bound on its worst-case running time)? If you want to do this in practice, one approach is to use a SAT solver (this is one standard approach to finding a separating DFA). If that's what you want, edit the question to say so and I'll add an answer elaborating. – D.W. Jan 13 '15 at 14:59
  • @AndrásSalamon Hi, although I am not super familiar with the constructions, I believe Robson showed that you can separate any two bit strings of length n using a DFA with at most $n^{\frac{2}{5}}\log^{\frac{3}{5}}(n)$ states. Although, this just gives us an upper bound on the size of the minimum DFA. It doesn't really help us in constructing it. That's why I didn't include references on this. – Michael Wehar Jan 13 '15 at 19:03
  • @D.W. It would be nice to know the best approach in practice. Also, it would be nice to understand the complexity of this problem. But, the question I really want to answer right now is whether or not there is a known approach that always beats brute force search. – Michael Wehar Jan 13 '15 at 19:04
  • 4
    If $x$ and $y$ are different lengths, with the larger of length $n$, it is easy to quickly find a DFA with $O(\log n)$ states that separates them: just use a cycle of length $p$, where $p$ doesn't divide $|x|-|y|$. Find $p$ by trying $2, 3, 5,\ldots$ in order until you find the appropriate $p$. If $x$ and $y$ are the same length, then the $O(\sqrt{n})$ construction of Robson, in a 1996 paper, gives a simple machine that can be found by a search of size $O(n)$. Neither construction is guaranteed to be the smallest DFA. – Jeffrey Shallit Jan 13 '15 at 21:09
  • 3
    Shallit's notes, linked above, include the useful observation that the worst case for the separation problem is when the alphabet is binary: it's always possible to partition larger alphabets into two subsets that still distinguish the two input words and search for a binary automaton that treats letters in one subset as 0's and letters in the other subset as 1's. But for seeking the minimum separating automaton this doesn't seem to help, because you might be able to use the extra information from the original alphabet to do better than you could with a mapping to a binary alphabet. – David Eppstein Jan 13 '15 at 23:14
  • 3
    a special case of this other recent question where in-set and out-set sizes equal 1. minimal finite automata given in-words and out-words. that answer lists some learning literature incl some heuristics. – vzn Jan 14 '15 at 04:05
  • @AndrásSalamon Maybe I made a mistake? But, I think there are only $2^{O(\sqrt{n}\log(\sqrt{n}))}$ DFA's with $\sqrt{n}$ states over a binary alphabet. This is how I was counting them. Label each state 1, 2, 3, ..., $\sqrt{n}$. For each state i, we have to pick somewhere to transition on a 0 and somewhere to transition on a 1. There are $\sqrt{n}$ choices for each. Therefore, we get $\sqrt{n}^{2\sqrt{n}}$ = $2^{O(\sqrt{n}\log(\sqrt{n}))}$. If there is something wrong with this, please do let me know. – Michael Wehar Jan 14 '15 at 16:56
  • @AndrásSalamon Hi again, I think you made a mistake there. We make two choices per state and there are $\sqrt{n}$ many states. Therefore, we make $2\sqrt{n}$ choices in total. For each choice, there are $\sqrt{n}$ possibilities. – Michael Wehar Jan 15 '15 at 00:09
  • 1
    @AndrásSalamon Another way of thinking about it is that for each state there are n many ways to make the two choices. Therefore, we get $n^{\sqrt{n}}$. – Michael Wehar Jan 15 '15 at 00:11
  • 1
    You are right, $n^\sqrt{n}$ makes sense. Again, please do edit the question to include these useful facts so someone finding it doesn't have to unscramble a bunch of comments. – András Salamon Jan 15 '15 at 06:57
  • @AndrásSalamon Ok, cool. Thanks again for your comments. :) – Michael Wehar Jan 16 '15 at 22:36
  • 2
    this seems not to be cited yet, the Shallit slides are connected to this paper coauthored by him. Remarks on Separating Words / Demaine, Eisenstat, Shallit, Wilson – vzn Apr 14 '15 at 15:05
  • 1
    @vzn Thanks! I also found this paper to be quite relevant and interesting: http://drops.dagstuhl.de/opus/volltexte/2014/4854/pdf/30.pdf – Michael Wehar Apr 14 '15 at 16:20
  • It's on counting which is somewhat related to separating words. :) – Michael Wehar Apr 21 '15 at 04:46

1 Answers1

11

If I had to do this in practice, I would use a SAT solver.

The question of whether there is a DFA with $k$ states that accepts $x$ and rejects $y$ can be easily expressed as a SAT instance. For instance, one way is to have $2k^2$ boolean variables: $z_{s,b,t}$ is true if the DFA transitions from state $s$ to state $t$ on input bit $b$. Then add some clauses to enforce that this is a DFA, and some variables and clauses to enforce that it accepts $x$ and rejects $y$.

Now use binary search on $k$ to find the smallest $k$ such that a DFA of this sort exists. Based on what I've read in papers on related problem, I would expect that this might be reasonably effective in practice.


Other encodings of this as SAT are possible. For instance, we can use a trace encoding:

  • If $x$ is of length $m$, you could add $m\lg k$ boolean variables: let $s_0,s_1,\dots,s_m$ be the sequence of states traversed on input $x$, and represent each $s_i$ using $\lceil \lg k \rceil$ boolean variables.

  • Now for each $i,j$ such that $x_i=x_j$, you have the constraint that $s_{i-1}=s_{j-1} \implies s_i=s_j$.

  • Next, extend this to handle $y$: let $t_0,\dots,t_n$ be the sequence of states traversed on input $y$, and represent each $t_j$ using $\lg k$ boolean variables. For each $i,j$ such that $y_i=y_j$, add the constraint that $t_{i-1}=t_{j-1} \implies t_i=t_j$.

  • Similarly, for each $i,j$ such that $x_i=y_j$, add the constraint that $s_{i-1}=t_{j-1} \implies s_i=t_j$.

  • Both traces must start from the same starting point, so add the requirement that $s_0=t_0$ (WLOG you can require $s_0=t_0=0$).

  • To ensure that the DFA uses only $k$ states, require that $0 \le s_i < k$ and $0 \le t_j <k$ for all $i,j$.

  • Finally, to encode the requirement that $x$ is accepted and $y$ is rejected, require that $s_m \ne t_n$.

All of these requirements can be encoded as SAT clauses.

As before, you'd use binary search on $k$ to find the smallest $k$ for which such a DFA exists.

D.W.
  • 12,054
  • 2
  • 34
  • 80
  • 3
    note this will actually be superior to brute force search if there are certain symmetries in the problem and they are recognized by the solver, but can currently be difficult to identify/ isolate those (either for human or machine). there is also some newer/ related "technology" of satisfiability modulo theories and answer set programming some of which has "built-in" graph predicates or can support their definitions. – vzn Jan 14 '15 at 18:25