17

Twenty years ago, I built an regular expression package that included conversions from regular expressions to a finite state machine (DFA) and supported a host of closed regular expression operations (Kleene star, concatenation, reverse, set operations, etc). I was unsure about the worst case performance of my package.

A DFA has the same expressive power as an NDFA, because an n-state NDFA can be trivially converted to a DFA having 2^n states. However, are there any lower upper bound guarantees for such a conversion that do not require an exponential explosion in state?

I was unable to come up with examples mal-behaving regular expressions or NDFAs, but I didn't spend much time thinking about it. I am guessing a regular expression like ((((e|A|B|C)*(e|D|E|F))*(e|G|H|I))*(e|J|K|L|M))* which mixes a lot of alternations and Kleene stars would have a linearly sized NDFA but an expansive DFA.

Wesner Moise
  • 311
  • 2
  • 8
  • Are there any restrictions on the class of NFAs that you would want to accept as input? Some restrictions lead to better upper bounds. – András Salamon Aug 17 '10 at 01:52
  • not a very important point, but need ndfa be its own tag? – Lev Reyzin Aug 18 '10 at 15:47
  • Yes there are restrictions. The NFAs are constructed directly from regular expressions by treating them as generalized transition graphs. http://www.seas.upenn.edu/~cit596/notes/dave/regexp-nfa4.html – Wesner Moise Aug 18 '10 at 16:37

3 Answers3

16

It is known that for every pair of naturals numbers n,a such that n <= a <= 2^n, there exists a minimal NDFA with n states whose corresponding equivalent minimal dfa has a states (over a four letter alphabet).

See the paper here: Deterministic blow-ups of minimal nondeterministic finite automata over a fixed alphabet.

Abstract of the paper:

We show that for all integers n and α such that n ≤ α ≤ 2^n, there exists a minimal nondeterministic finite automaton of n states with a four-letter input alphabet whose equivalent minimal deterministic finite automaton has exactly a states. It follows that in the case of a four-letter alphabet, there are no "magic numbers", i.e., the holes in the hierarchy. This improves a similar result obtained by Geffert for a growing alphabet of size n + 2 (Proc. 7th DCFS, Como, Italy, 23-37).

So, I suppose the answer to your question is, no.

Aryabhata
  • 1,855
  • 3
  • 21
  • 29
  • the question is asking for an "algorithm" running in subexponential time and space for converting the NFA. – Marcos Villagra Aug 17 '10 at 01:39
  • @Marcos: If your output is exponential, you cannot possibly have an algorithm that runs in sub-exponential time. – Aryabhata Aug 17 '10 at 01:40
  • 1
    This is a general result. If there are known restrictions on the class of input NFAs, then it might be possible to do better. – András Salamon Aug 17 '10 at 01:50
  • @Andras: Agreed, but given that this is probably programming related (which will support Kleen * etc), I doubt if the set of input NFA will be restricted to a proper subset. – Aryabhata Aug 17 '10 at 01:55
  • 5
    This result was recently strengthened to use a three letter alphabet, and the constructions are a bit simpler: http://portal.acm.org/citation.cfm?id=1575787.1575813&coll=Portal&dl=GUIDE&CFID=98498868&CFTOKEN=60579493 –  Aug 18 '10 at 05:11
14

The classic example for a language with an exponential separation between DFA size and NFA size is the following finite language: binary strings of length exactly 2n in which the first half is not equal to the second half. A NFA would guess an index i in which the first and second half disagree. A lower bound for a DFA follows from communication complexity, for example.

Noam
  • 9,369
  • 47
  • 58
  • Another maybe simpler example is "the n-th character before the end of the word is an $a$". Can be done with $n+1$ states with an NFA that guesses when the end of the word occurs, whereas a DFA clearly needs to have $2^n$ states as each of the possible states of the $n$ last characters define a different set of continuations that make the automaton accept. – a3nm Nov 09 '21 at 18:33
8

The minimum DFA corresponding to an NFA has 2^n states in the worst case, so you can't garantee anything. Without having a constructive example, the reasoning is that in an NFA you can be at any arbitrary subset of states after reading a certain input string, and each such subset might behave differently when observing one character. Suppose a language with two characters in the alphabet (a and b), and an NFA N with n states that starts with an accepting state at s_0. Now enumerate all subsets of states of N, and build the transition table such that observing "a" from subset S_i takes you to subset S_i+1 and observing b takes you to subset S_i-1 (this is doable for some enumerations, I think). Now this automata has n states and accepts sequences of m a's and n b's such that m-n = 0 mod 2^|N|, and cannot be expressed with a DFA that has less than 2^|N| states (since it might need to cycle through all subsets of states of the NFA N).

Alexandre Passos
  • 1,459
  • 12
  • 20
  • Can this be turned into an argument that says "if (some bad thing) is avoided in the NFA, then the DFA has subexponential number of states"? – András Salamon Aug 17 '10 at 01:54
  • 1
    @András, yes. "If nondeterminism is avoided in the NFA, then the DFA has subexponential number of states". – P Shved Aug 17 '10 at 07:12
  • 2
    Pavel, yes, obviously. Is there any non-trivial property that can be recognized efficiently which also guarantees subexponential blowup? – András Salamon Aug 17 '10 at 17:21