14

Suppose we have a regular language specified by a regex, for example, (ab|ac)* and we wish to find an equivalent regex with the minimal number of symbols, (a(b|c))*. Is there any efficient way to do this? From what I've read, minimizing the number of states in a DFA is easy, but minimizing a NFA is PSPACE-Complete, and I'm not sure which if either of these it falls under. If it's not tractable, are there any good approximation algorithms or special cases under which it can be done efficiently?

Note: By minimizing the number of symbols, I mean the number of times symbols in the alphabet (a,b,c in the example) appear in the regular expression. The number of parenthesis, | and *s don't matter.

Motivation - this idea comes from the challenge of minimizing program code size, given that if and while statements are the only allowable control flow. The possible paths through the control flow graph correspond to a regular language, with if being union, while being the Kleene star, and symbols representing primitive statements that don't transfer control.

Kaveh
  • 21,577
  • 8
  • 82
  • 183
Antimony
  • 917
  • 5
  • 15
  • what about converting some symbol strings to new symbols? does it have a meaning in your context? disallowed? it would seem to be equivalent to refactoring a particular path into a subroutine? this sounds similar to a compiler code optimization problem... – vzn Aug 20 '12 at 18:55
  • 5
    I read only the abstract, but perhaps this paper is relevant: Minimizing NFA's and Regular Expressions. In particular: "... Jiang and Ravikumar [7] show moreover that the minimization problem for nfa's or regular expressions remains PSPACE-complete, even when specifying the regular language by a dfa ..." – Marzio De Biasi Aug 20 '12 at 23:33
  • 1
    Relevant information: http://en.wikipedia.org/wiki/PSPACE-complete#Determining_if_a_regular_expression_generates_all_strings – Tsuyoshi Ito Aug 21 '12 at 00:20
  • I guess it is PSPACE complete and not approximable. Oh well, at least the question's been answered. – Antimony Aug 21 '12 at 03:47
  • the technical meaning of "approximation" may not match what you might think. if you are interested in "application of theory" it may be reasonable to approach it as a compression or optimization problem for which there are heuristics etc. see "grammar compression", it seems to be a special case where the compression is limited to regexps, optimizing by size – vzn Aug 21 '12 at 16:53
  • 1
    I think that regexes are not a suitable representation for program minimization after all. Working with the control flow graph directly is easier. – Antimony Aug 21 '12 at 21:07
  • there is likely std research & algorithms for optimizing the control flow graph as you have sketched out in compiler optimization literature. but thats another question than the one above. as you note this problem is probably closely related to your question above, finding the minimal NFA of a DFA – vzn Aug 21 '12 at 21:47
  • 1
    As you said, I think the control flow chart is more suitable for your needs. Would it work to treat the graph as a DFA, and run known DFA state minimization algorithms on it in order to identify removable states ? EDIT : sorry, this should really be a comment but, being a noob, I can only post answers – Manuel Lafond Aug 22 '12 at 19:00

1 Answers1

6

It is PSPACE-complete to decide whether an expression accepts all words, i.e. is equivalent to $(a|b|c|...)^*$. It is not hard to get convinced that in this proof of PSPACE-completeness (see e.g. https://cstheory.stackexchange.com/a/40250/8953), when the expression does not accept all words, then it cannot be equivalent to an even shorter expression. Indeed, in this case it must accept everything except a long word that represents a valid encoding of a run of a PSPACE Turing machine. Therefore, minimization of regular expressions is PSPACE-hard.

Conversely, equivalence of expressions can be tested in PSPACE, so we can guess an expression of minimal size, and verify that it is correct in PSPACE (which is the same as Nondeterministic PSPACE). Therefore, the minimization problem is in PSPACE as well, and so it is PSPACE-complete.

Denis
  • 8,843
  • 28
  • 55