Bounds on the size of the smallest NFA for L_k-distinct

Question

Consider the language $L_{k-distinct}$ consisting of all $k$-letter strings over $\Sigma$ such that no two letters are equal:

$$ L_{k-distinct} :=\{w = \sigma_1\sigma_2...\sigma_k \mid \forall i\in[k]: \sigma_i\in\Sigma ~\text{ and }~ \forall j\ne i: \sigma_j\ne\sigma_i \}$$

This language is finite and therefore regular. Specifically, if $\left|\Sigma\right|=n$, then $\left|L_{k-distinct}\right| = \binom{n}{k} k!$.

What is the smallest non-deterministic finite automaton that accepts this language?

I currently have the following loose upper and lower bounds:

The smallest NFA I can construct has $4^{k(1+o(1))}\cdot polylog(n)$ states.
The following lemma implies a lower bound of $2^k$ states:

Let $L ⊆ Σ^*$ be a regular language. Suppose there are $n$ pairs $P = \{ (x_i, w_i) \mid 1 ≤ i ≤ n \}$ such that $x_i\cdot w_j \in L$ if and only if $i=j$. Then any NFA accepting L has at least n states.

Another (trivial) lower bound is $log$$n\choose k$, which is the log of the size of the smallest DFA for the language.

I am also interested in NFAs that accept only a fixed fraction ($0<\epsilon<1$) of $L_{k-distinct}$, if the size of the automaton is smaller than $\epsilon\cdot 4^{k(1+o(1))}\cdot polylog (n)$.

Edit: I've just started a bounty that had a mistake in the text.

I meant we may assume $k=polylog(n)$ while I wrote $k=O(log(n))$.

Edit2:

The bounty is going to end soon, so if anyone is interested in what is perhaps an easier way to earn it, consider the following language:

$L_{(r,k)-distinct} :=\{w : w$ contains $k$ distinct symbols and no symbol appear more than $r$ times$\}$.

(i.e. $L_{(1,k)-distinct} = L_{k-distinct}$).

A similar construction as the one in the comments gives $O(e^k\cdot 2^{k\cdot log(1+r)}\cdot poly(n))$ sized automaton for $L_{(r,k)-distinct}$.

Can this be improved? What's the best lower bound we can show for this language?

I can't write about it yet as we're still working on it, and haven't completed the proof.
Instead, I'll describe a much simpler automaton of size $O((2e)^k * 2^{O(log(k))} * log(n))$:

Take a $(n,k)$-perfect hash family $H$.

Every such hash is a function $h: [n] \to [k]$.

This means that for every subset of $[n]$ of size at most $k$, exists a function $h\in H$ such that it maps every item of the subset to different number.

After hashing, the resulting alphabet has $k$ letters, hence an autumaton of size $2^k$ can accept the $L_{k-distinct}$ language. — R B, Jan 07 '14 at 07:54
Now for the final automaton:
We will have an initial state $s$, from which there will be an epsilon move for every "hash automaton".

Since explicit builds of size $O(e^k\cdot k^{O(log(k))}$ for H are known, and every "hash automaton" is of size $2^k$, we get our total automaton size.The "hash automaton" with more detail:

There will be a state for every $S\in 2^{[k]}$, which symbols we've saw letters that were hashed to each of S's indices. Since our actual $\Sigma$ is of size $n$, there will be a move from $S$ to $S\cup {i}$ with for all letters hashed to $i$. — R B, Jan 07 '14 at 08:09
The "hash automaton" with more detail:
There will be a state for every $S\in 2^{[k]}$, which symbols we've saw letters that were hashed to each of S's indices. Since our actual $\Sigma$ is of size $n$, there will be a move from $S$ to $S\cup {i}$ with for all letters hashed to $i$. — R B, Jan 07 '14 at 08:10
Nitpick: I think the $k\lg n$ lower bound is not quite right. When $k=n$ it gives $n\lg n$, while the minimum DFS has $2^n$ states. — Radu GRIGore, Jan 07 '14 at 17:28
You're right @Radu.
I usually think of $k$ as a small number (otherwise the it's impractical) so I approximated $n\choose k$ as $n^k$, but in the general case, you're correct. Thanks for the comment :). — R B, Jan 07 '14 at 21:34
Why is the upper bound not $2^n$ also? The minimal DFA seems to have $2^n$ states, by keeping track of the (state space representation of the) bit array representing which of the $n$ letters has been seen so far in the input (rejecting when a letter is seen twice). Why can this DFA not be taken as the upper bound NFA? — András Salamon, Jan 09 '14 at 01:06
@AndrásSalamon - Formally, you are correct, but like I said in the comments, $k$ is assumed to be small for the automaton to be realistic.
Suppose that $n$ is large (say, $10^7$) but k is small, maybe $10-20$.

$\Sigma {n\choose i}$$ \leq 2^n$ is always a bound on a DFA, but, if $k$ is small, we can get a lot smaller DFA for the language. — R B, Jan 09 '14 at 10:39
As an aside, Theorem 30 in https://cs.uwaterloo.ca/~shallit/Papers/re3.pdf gives a $\Omega(c^k)$ lower bound for context-free grammars. This is stronger than your lower bound, though $c$ is much worse. — Yuval Filmus, Mar 12 '14 at 06:11
The size of the smallest DFA is not ${n \choose k}$ but $1+\sum_{i<k} {n\choose i}$. — domotorp, Mar 22 '14 at 19:44
In the lemma that gives the lower bound, in fact it is enough to require that $x_iw_i\in L$ and for every $i,j$ either $x_iw_j\notin L$ or $x_jw_i\notin L$. The proof that such families cannot have a cardinality bigger than $2^k$ was proved by Tuza: http://link.springer.com/article/10.1007%2FBF01788531 — domotorp, Mar 22 '14 at 20:09
The lower bound gives $(2-o(1))^k$ just counting the number of states that the NFA can be in after exactly $k/2$ steps. I don't think that I am aware of any proof method that gives significantly better bounds for the total size than what can be obtained than by just looking at what happens after $t$ steps, for some $t$. But here, for every $t$ there is an NFA that can be in only one of $(2+o(1))^k$ states after exactly $t$ states. — Noam, Mar 23 '14 at 05:51
Proof (of my previous claim): The hardest case is $t=k/2$; choose $2^k \cdot poly(k, \log n)$ different random subsets $S_i$ (of the $n$ alphabet symbols) of size exactly $t$ each and construct an NFA that has a state for each $i$ with some path leading to it iff the first $t$ symbols are all different and are contained in $S_i$, and has an accepting path from it iff the following $k-t$ symbols are all different and are contained in the complement of $S_i$. A counting argument will show that whp (over the random choice of the $S_i$'s) this NFA will indeed accept all of the desired language. — Noam, Mar 23 '14 at 05:59
In the previous construction, the simplest way to build the NFA will have a state for each possible prefix of length $j < t$ and for each possible suffix of length $j > k-t$. Instead, the prefix part and suffix part of the NFA can be built recursively using the same randomized construction (but now only within $S_i$ and its complement, respectively) and this would give a $(4+o(1))^k$ total size. — Noam, Mar 23 '14 at 06:13
@Noam: Did you want to write that the $S_i$'s have size exactly $n/2$? Where did the $\log n$ factor disappear? I think this upper bound is the same as R B's. — domotorp, Mar 23 '14 at 09:52
@domotorp: Oh, sorry, yes: for t=k/2, indeed sets of size $n/2$ (either exactly, or each item chosen independently with probability 1/2), and for general $t$ you want size $n \cdot t/k$. — Noam, Mar 23 '14 at 17:40
Also, I came up with a size $4^k \log n$ construction for $L_{k-distinct}$, presumably the same as R.B.'s. The same construction gives an NFA of size $4.74^k \log n$ for $L_{(r,k)-distinct}$ for any value of $r$. R.B., do you have this construction, or do you want some pointers? — greg, Mar 27 '14 at 14:06
@mobiusdumpling - is your build of size $4^klogn$ or $4^kpolylogn$? — R B, Mar 27 '14 at 16:07
@RB Apparently I had a mistake in my construction. I'll tell you what I was trying to do, maybe it could still give some helpful idea. Instead of using perfect hash families, I use families of functions from $[n]$ to $[t]$ such that for each $k$ elements in $[n]$, there's a function in the family that hashes them to distinct values. Then when I constructed the DFA for the already-hashed input, I thought that it's always enough to remember just $k/2$ elements in $[t]$, but I was wrong. It is enough to remember just $t/2$ elements: count up until the middle, and then flip and start (1/2) — greg, Mar 28 '14 at 11:23
(2/2) and then start counting down. Then I optimized over $t$. About $L_{(r,k)-distinct}$: all my previous comments about it were mistakes as well: I got the definition wrong. — greg, Mar 28 '14 at 11:40
One possible approach for a lower bound is to try to use techniques from information theory, such as some information theoretic arguments. Maybe something along the lines of Patrascu's information transfer method. — greg, Mar 29 '14 at 16:38

domotorp · Answer 1 · 2014-03-24T08:46:01.383

2

This is not an answer but a method which I believe would leave to an improved lower bound. Let us cut the problem after $a$ letters are read. Denote the family of $a$ element sets of $[n]$ by $\mathcal A$ and the family of $b=k-a$ element sets of $[n]$ by $\mathcal B$. Denote the states that can be reach after reading the elements of $A$ (in any order) by $S_A$ and the states from which an accepting state can be reached after reading the elements of $B$ (in any order) by $T_B$. We need that $S_A\cap T_B\ne \emptyset$ if and only if $A\cap B=\emptyset$. This already gives a lower bound for the required number of states and I think it could give something non-trivial.

This problem essentially asks for a lower bound on the number of the vertices of a hypergraph whose line graph is (partially) known. Similar problems were studied e.g., by Bollobas and there are several known proof methods that can be useful.

Update 2014.03.24: In fact if the above hypergraph can be realized on $s$ vertices, then we also get a non-deterministic communication complexity protocol of length $\log s$ for set disjointness with inputs sets of size $a$ and $b$ (in fact the two problems are equivalent). The bottleneck is of course when $a=b=k/2$, for this I could only find the following in Eyal and Noam's book: $N^1(DISJ_a)\le \log \big(2^k \log_e {n\choose a}\big)$ proved by the standard probabilistic argument. Unfortunately I could not (yet) find good enough lower bounds on this problem but assuming the above is sharp, it would give a lower bound $\Omega(2^k\log n)$ unifying the two lower bounds you have mentioned.

edited Mar 24 '14 at 08:46

answered Mar 23 '14 at 16:52

domotorp

13,931
37
93

1

Thanks @domotorp for your answer.
This seems a lot like the proof of the lemma I've used for the lower bound in the original question, but without specifying the actual $x_i$'s and $y_i$'s, and thus not a countable bound.

Your comment on the question above suggests that the $2^k$ bound can't be improved by that method, do you think this could do better?
– R B Mar 23 '14 at 17:32
4

The whole point of my comment above was that these techniques can not give a lower bound above $(2+o(1))^k$. This is really what makes this problem interesting to me. – Noam Mar 23 '14 at 17:43
1

@Noam: Let k=2, a=b=1. Already then we get a $\log n$ lower bound as every $S_A$ has to be different. – domotorp Mar 23 '14 at 17:46
@RB: I hope this can do better, I will try to see if I can make some proof work. – domotorp Mar 23 '14 at 17:48
1

@domotorp: The $o(1)$ hides a $O(k\log n)$ factor: Here is the analysis for the worst case where $a=b=k/2$: Start with a fixed $A$ and $B$ and pick at random a subset $S$ of the $n$ letters then we have $Pr[A \subseteq S :and: B \subseteq S^c]=2^{-k}$. Now pick $r2^k$ such sets at random then the probability that for at least one of them this happens is $1-exp(-r)$. If we choose $r = O(\log {n \choose k}) = O(k \log n)$ then we get that whp this is so for ALL disjoint sets $A$ and $B$ (of size $k/2$). The total number of such $S$'s in this construction is $O(2^k k \log n)$. – Noam Mar 23 '14 at 19:07
2

@Noam: I am sorry but I have never seen a $\log n$ hidden in an $o(1)$, especially as the problem is also interesting imho for $k<<\log n$. But you are right that R B asked about $k=polylog n$. – domotorp Mar 23 '14 at 22:26

score 0 · Answer 2 · answered Mar 27 '14 at 15:59

Some work in progress:

I'm trying to prove a lower bound of $4^k$. Here is a question that I'm pretty sure would give such a lower bound: find the minimum $t$ such that there exists a function $f:\{S \subseteq [n], |S|=k/2 \} \rightarrow \{0,1\}^t$ that preserves disjointness, i.e. that $S_1 \cap S_2 = \emptyset$ iff $f(S_1) \cap f(S_2) = \emptyset$. I'm pretty sure a lower bound of $t \ge 2k$ would almost immediately imply a lower bound of $2^{2k}=4k$ for our problem. $f(S)$ roughly corresponds to the set of nodes the NFA can get to after reading the first $k/2$ symbols of the input, when the set of these $k/2$ symbols is $S$.

I think the solution to this question might already be known, either in the communication complexity literature (especially in papers dealing with the disjointness problem; maybe some matrix rank arguments will help), or in literature about encodings (e.g. like this).

My comments above show that this approach cannot beat $(2+o(1))^n$ — Noam, Mar 27 '14 at 21:34

Bounds on the size of the smallest NFA for L_k-distinct

2 Answers2

Linked