Is SAT a context-free language?

Question

I am considering the language of all satisfiable propositional logic formulae, SAT (to ensure that this has a finite alphabet, we would encode propositional letters in some suitable way [edit: the replies pointed out that the answer to the question may not be robust under varying encodings, so one needs to be more specific -- see my conclusions below]). My simple question is

Is SAT a context-free language?

My first guess was that today's (early 2017) answer should be "Nobody knows, since this relates to unresolved questions in complexity theory." However, this is not really true (see answer below), though not completely false either. Here is a short summary of things we know (starting with some obvious things).

SAT is not regular (because even the syntax of propositional logic is not regular, due to matching parentheses)
SAT is context-sensitive (it is not hard to give an LBA for it)
SAT is NP-complete (Cook/Levin), and in particular decided by nondeterministic TMs in polynomial time.
SAT can also be recognized by one-way nondeterministic stack automata (1-NSA) (see W. C. Rounds, Complexity of recognition in intermediate Level languages, Switching and Automata Theory, 1973, 145-158 http://dx.doi.org/10.1109/SWAT.1973.5)
The word problem for context-free languages has its own complexity class $\textbf{CFL}$ (see https://complexityzoo.uwaterloo.ca/Complexity_Zoo:C#cfl)
$\textbf{CFL}\subseteq\textbf{LOGCFL}\subseteq\textbf{AC}^{\textbf{1}}$, where $\textbf{LOGCFL}$ is the class of problems logspace reducible to $\textbf{CFL}$ (see https://complexityzoo.uwaterloo.ca/Complexity_Zoo:L#logcfl). It is known that $\textbf{NL}\subseteq\textbf{LOGCFL}$.
It is not known whether $\textbf{NL}\subsetneq\textbf{NP}$ or $\textbf{NL}=\textbf{NP}$ (in fact, even $\textbf{NC}^{\textbf{1}}\subsetneq\textbf{PH}$ is open; I think I got this from S. Arora, B. Barak: Computational Complexity: A Modern Approach; Cambridge University Press 2009). Hence, there cannot be any $\textbf{NP}$-complete problem that is known to not be in $\textbf{LOGCFL}$. Hence, it must be unknown if SAT is in $\textbf{LOGCFL}$.

However, this last point still leaves the possibility that SAT is known to not be in $\textbf{CFL}$. In general, I could not find much about the relationship of $\textbf{CFL}$ to the $\textbf{NC}$ hierarchy that might help to clarify the epistemic status of my question.

Remark (after seeing some initial answers): I am not expecting the formula to be in conjunctive normal form (this will not make a difference to the essence of the answer, and usually arguments still apply since a CNF is also a formula. But the claim that the constant-number-of-variables version of the problem is regular fails, since one needs parentheses for syntax.).

Conclusion: Contrary to my complexity-theory inspired guess, one can show directly that SAT is not context-free. The situation therefore is:

It is known that SAT is not context-free (in other words: SAT is not in $\textbf{CFL}$), under the assumption that one uses a "direct" encoding of formulae where propositional variables are identified by binary numbers (and some further symbols are used for operators and separators).
It is not know if SAT is in $\textbf{LOGCFL}$, but "most experts think" that it is not, since this would imply $\textbf{P}=\textbf{NP}$. This also means that it is unknown if other "reasonable" encodings of SAT are context-free (assuming that we would consider logspace an acceptable encoding effort for an NP-hard problem).

Note that these two points do not imply $\textbf{CFL}\subsetneq\textbf{LOGCFL}$. This can be shown directly by showing that there are languages in $\textbf{L}$ (hence in $\textbf{LOGCFL}$) that are not context-free (e.g., $a^nb^nc^n$).

If SAT were context free then dynamic programing (the CYK algorithm) would give a polynomial time algorithm for testing membership in SAT, giving P=NP. Even P=NP wouldn't mean that SAT were Context-Free.
This encoding of variables seems like it might be more important than you're giving it credit for. I haven't worked out the details but if you added unary or binary "subscripts" to the variables I think you'd have trouble distinguishing (x and y) from (x and not x) for large enough indices. — AdamF, Jan 17 '17 at 15:35
You have to be precise about the representation before claiming P=NP conclusions. For example, factoring a number N is polynomial-time in N (the interesting question is concerning the number of bits needed to write N in binary, or about log N). — Aryeh, Jan 17 '17 at 18:17
I was aware of the P=NP conclusion and that the answer therefore was not expected to be "yes" (sorry for being a little provocative in how I phrased this ;-). I was still wondering if this is really known or merely something that "most experts believe" (answers now clearly indicate that the former is true; I will select one in due course). — mak, Jan 17 '17 at 22:56

score 8 · Accepted Answer · edited Nov 02 '23 at 22:18

8

Just an alternative proof using a mix of well known results.

Suppose that:

variables are expressed with the regular expression $d = (+|-)1(0|1)^*$
and that the (regular) language (over $\Sigma = \{0,1,+,-,\land,\lor\})$ used to represent CNF formulas is: $S = \{ d^+ (\lor d^+)^*(\land (d^+ (\lor d^+)^*))^* \}$; just note that $S$ grabs all well-formed CNF formulas up to variable renaming.

For example $\varphi = (x_1 \lor x_2) \land -x_3$ is written as: $s_{\varphi} = +1 \lor +10 \land -11 \in S$ (the $\lor$ operator has the precedence over $\land$).

Suppose that $L = \{ s_{\varphi} \in S \, \mid $ s.t. the corresponding formula $\varphi$ is satisfiable $\}$ is CF .

If we intersect it with the regular language: $R = \{ +1^a \land -1^b \land -1^c \mid a,b,c > 0 \}$ we still get a CF language. We can also apply the homomorphism: $h(+) = \epsilon$, $h(-) = \epsilon$ and the language remains CF.

But the language we obtain is: $L' = \{ 1^a \land 1^b \land 1^c \mid a \neq b, a \neq c\}$, because if $a=b$ then "source" formula is $+x_a \land -x_a \land -x_b$ which is unsatisfiable (similarly if $a=c$). But $L'$ is a well known non CF language $\Rightarrow$ contradiction.

edited Nov 02 '23 at 22:18

whoisit

125
5

answered Jan 17 '17 at 21:57

Marzio De Biasi

22,915
2
58
127

I accepted this answer now since there is still an open issue with the other approach (see comments) and I like the somewhat more basic argument. It might be nice to emphasize that the argument is specific to the chosen encoding and it is indeed unknown if one could find another simple (logspace) encoding that leads to a context-free language. – mak Jan 22 '17 at 21:16
1

@mak: I suspect that any other "reasonable" encoding of SAT can be proved to be non CF with a similar technique. Perhaps another interesting direction would be to study if we can apply some sort of diagonalization to get a more general proof: the SAT formula encodes a computation that simulates a push down automata on a given input and it is satisfiable if and only if it doesn't accept it. But it's only a fuzzy idea ... – Marzio De Biasi Jan 22 '17 at 22:28
Checking if a string is in a regular language is in P. Assume SAT was in Reg. Then NP=coNP. Let L be in Reg. Consider the formula that is true if it is not in L. It is in NP so it can be expressed as a SAT formula. It is in the language iff it is not. – Kaveh Jan 24 '17 at 03:27

Aryeh · Answer 2 · 2017-01-18T09:44:44.803

5

If the number of variables is finite then so is the number of satisfiable conjunctions, so your SAT language is finite (and hence regular). [Edit: this claim assumes the CNFSAT form.]

Otherwise, let's agree to encode formulae such as $(x_{17}\vee \neg x_{21})\wedge (x_{1}\vee x_{2}\vee x_3)$ by $(17+\tilde{} 21)(1+2+3)$. We will use Ogden's lemma to prove that the language of all satisfiable conjunctions is not context-free.

Let $p$ be the "marking" constant in Ogden's lemma, and consider a sat-formula $w$ whose first clause consists of $(1^p)$ -- that is, the encoding of $(x_N)$, where $N$ is the decimal number consisting of $p$ ones. We mark the $p$ ones of $1^p$ and then require that all pumpings of the appropriate decomposition of $w$ (see the conclusion of Ogden's lemma) also be satisfiable. But we can easily block this by requiring that no clause containing $x_q$, where $q$ is a sequence of $1$'s shorter than $p$, be satisfiable -- for example, by ensuring that every other clause of $w$ has a negation of every such $x_q$. This means that $w$ fails the "negative pumping" property and we conclude that the language is not context-free. [Note: I've ignored the trivial cases where the pumping produces ill-formed strings.]

edited Jan 18 '17 at 09:44

answered Jan 17 '17 at 17:03

Aryeh

10,494
1
27
51

Note: In my claim that for a finite number of variables the language is finite, I am implicitly disallowing repeating a variable within a clause or a clause unboundedly many times – Aryeh Jan 17 '17 at 17:56
... But I think the language is still regular, because one takes the finite collection of "essentially distinct" (i.e., without trivial repetitions) formulae and then allows the various repetitions. – Aryeh Jan 17 '17 at 17:57
The claim with regularity only works for CNFSAT (I added a clarification to my question). – mak Jan 17 '17 at 23:20
4

Even with arbitrary non-CNF formulas in finitely many variables, satisfiability (and any language that cannot distinguish two logically equivalent formulas for that matter) is easily seen to be context-free. However, I fail to see the relevance of this. Satisfiability of formulas in finitely many varibles is a trivial problem that has nothing to do with the complexity of SAT. – Emil Jeřábek Jan 18 '17 at 09:23
@mak I edited the answer to reflect your observation. – Aryeh Jan 18 '17 at 09:45
I do not understand yet how your argument would ensure that the negative pumping does not also remove the parts of the formula that you need to ensure unsatisfiability of any positive unit clause other than $x_N$. Can you elaborate? – mak Jan 19 '17 at 10:12
The negative pumping only affects the string $1^p$ -- because of the way I've placed the marks, as per Ogden's lemma. So while I cannot ensure that a positive pumping will create an unsatisfiable assignment, I can certainly ensure that a negative one will -- since it will create litterals indexed by $1^q$, where $q<p$. – Aryeh Jan 19 '17 at 10:19
I do not follow your argument here. Ogden's Lemma allows any of the strings to have any amount of unmarked positions, so negative pumping does not affect only the marked positions. – mak Jan 20 '17 at 15:35
I think my application of Ogden's Lemma works, even if I may be off by a constant here or there. I mark only the contiguous stretch of 1s in the first variable of the formula. Ogden's lemma (items 1 and 2) force the pumped region to be limited to this contiguous stretch. Hence, any negative pumping will create unsatisfiable formulae. – Aryeh Jan 21 '17 at 19:17
Item 2 (referring to the Wikipedia page) is always true if you only mark p positions in total, so this can't help much. Item 1 only requires that 1 marked position is pumped, but does not restrict how much of the unmarked string can be in xy. It seems to me that you could (un)pump, e.g., "(111)(~0)(~1)...(~110)" by picking u="(11", x="1", y="", z=")(~0)...(~110" and v=")", so as to obtain "(11)", which is satisfiable. – mak Jan 21 '17 at 20:56
what is the value of p in your example? – Aryeh Jan 21 '17 at 21:28
p would be 3 here (to make the illustration short), but the example is meant to be generic. I am using binary numbers rather than decimal ones, and I am (as it was easier to write) requiring that all other variables are mapped to false rather than just the ones with a "1" in them. – mak Jan 21 '17 at 21:42
1

OK, I see the problem -- I was implicitly assuming that $|xyz|$ can be bounded in length (as in the classic pumping lemma), while also being able to specify something about its location in the string. I think the argument can be fixed by re-doing the pumping lemma from scratch. We'll make that first variable a really long sequence of 1's -- long enough that some sub-tree generating a contiguous substring of those 1's has to be sufficiently deep for the pidgeonhole principle to apply. – Aryeh Jan 21 '17 at 22:49
Yes, something like this could work, but it might end up being somewhat more complicated than the argument given by Marzio De Biasi here so maybe not worth the effort. Thanks for the quick answer and the discussion. – mak Jan 22 '17 at 21:14
Actually, I'm quite confident that it works. The derivation tree necessarily has bounded out-degree, meaning that to generate a substring of sufficient length, you need some minimal height -- and that immediately allows for a pidgeonhole argument. – Aryeh Jan 23 '17 at 07:17

Is SAT a context-free language?

2 Answers2