Consider the language $L_{k-distinct}$ consisting of all $k$-letter strings over $\Sigma$ such that no two letters are equal:
$$ L_{k-distinct} :=\{w = \sigma_1\sigma_2...\sigma_k \mid \forall i\in[k]: \sigma_i\in\Sigma ~\text{ and }~ \forall j\ne i: \sigma_j\ne\sigma_i \}$$
This language is finite and therefore regular. Specifically, if $\left|\Sigma\right|=n$, then $\left|L_{k-distinct}\right| = \binom{n}{k} k!$.
What is the smallest non-deterministic finite automaton that accepts this language?
I currently have the following loose upper and lower bounds:
The smallest NFA I can construct has $4^{k(1+o(1))}\cdot polylog(n)$ states.
The following lemma implies a lower bound of $2^k$ states:
Let $L ⊆ Σ^*$ be a regular language. Suppose there are $n$ pairs $P = \{ (x_i, w_i) \mid 1 ≤ i ≤ n \}$ such that $x_i\cdot w_j \in L$ if and only if $i=j$. Then any NFA accepting L has at least n states.
- Another (trivial) lower bound is $log$$n\choose k$, which is the log of the size of the smallest DFA for the language.
I am also interested in NFAs that accept only a fixed fraction ($0<\epsilon<1$) of $L_{k-distinct}$, if the size of the automaton is smaller than $\epsilon\cdot 4^{k(1+o(1))}\cdot polylog (n)$.
Edit: I've just started a bounty that had a mistake in the text.
I meant we may assume $k=polylog(n)$ while I wrote $k=O(log(n))$.
Edit2:
The bounty is going to end soon, so if anyone is interested in what is perhaps an easier way to earn it, consider the following language:
$L_{(r,k)-distinct} :=\{w : w$ contains $k$ distinct symbols and no symbol appear more than $r$ times$\}$.
(i.e. $L_{(1,k)-distinct} = L_{k-distinct}$).
A similar construction as the one in the comments gives $O(e^k\cdot 2^{k\cdot log(1+r)}\cdot poly(n))$ sized automaton for $L_{(r,k)-distinct}$.
Can this be improved? What's the best lower bound we can show for this language?
Instead, I'll describe a much simpler automaton of size $O((2e)^k * 2^{O(log(k))} * log(n))$:
Take a $(n,k)$-perfect hash family $H$.
Every such hash is a function $h: [n] \to [k]$.
This means that for every subset of $[n]$ of size at most $k$, exists a function $h\in H$ such that it maps every item of the subset to different number.
After hashing, the resulting alphabet has $k$ letters, hence an autumaton of size $2^k$ can accept the $L_{k-distinct}$ language.
– R B Jan 07 '14 at 07:54We will have an initial state $s$, from which there will be an epsilon move for every "hash automaton".
Since explicit builds of size $O(e^k\cdot k^{O(log(k))}$ for H are known, and every "hash automaton" is of size $2^k$, we get our total automaton size.The "hash automaton" with more detail:
There will be a state for every $S\in 2^{[k]}$, which symbols we've saw letters that were hashed to each of S's indices. Since our actual $\Sigma$ is of size $n$, there will be a move from $S$ to $S\cup {i}$ with for all letters hashed to $i$.
– R B Jan 07 '14 at 08:09There will be a state for every $S\in 2^{[k]}$, which symbols we've saw letters that were hashed to each of S's indices. Since our actual $\Sigma$ is of size $n$, there will be a move from $S$ to $S\cup {i}$ with for all letters hashed to $i$.
– R B Jan 07 '14 at 08:10I usually think of $k$ as a small number (otherwise the it's impractical) so I approximated $n\choose k$ as $n^k$, but in the general case, you're correct. Thanks for the comment :).
– R B Jan 07 '14 at 21:34Suppose that $n$ is large (say, $10^7$) but k is small, maybe $10-20$.
$\Sigma {n\choose i}$$ \leq 2^n$ is always a bound on a DFA, but, if $k$ is small, we can get a lot smaller DFA for the language.
– R B Jan 09 '14 at 10:39