A shuffle of two strings is formed by interspersing the characters into a new string, keeping the characters of each string in order. For example, MISSISSIPPI is a shuffle of MISIPP and SSISI. Let me call a string square if it is a shuffle of two identical strings. For example, ABCABDCD is square, because it is a shuffle of ABCD and ABCD, but the string ABCDDCBA is not square.
Is there a fast algorithm to determine whether a string is square, or is it NP-hard? The obvious dynamic programming approach doesn't seem to work.
Even the following special cases appear to be hard: (1) strings in which each character appears at most four six times, and (2) strings with only two distinct characters. As Per Austrin points out below, the special case where each character occurs at most four times can be reduced to 2SAT.
Update: This problem has another formulation that may make a hardness proof easier.
Consider a graph G whose vertices are the integers 1 through n; identify each edge with the real interval between its endpoints. We say that two edges of G are nested if one interval properly contains the other. For example, the edges (1,5) and (2,3) are nested, but (1,3) and (5,6) are not, and (1,5) and (2,8) are not. A matching in G is non-nested if no pair of edges is nested. Is there a fast algorithm to determine whether G has a non-nested perfect matching, or is that problem NP-hard?
Unshuffling a string is equivalent to finding a non-nested perfect matching in a disjoint union of cliques (with edges between equal characters). In particular, unshuffling a binary string is equivalent to finding a non-nested perfect matching in a disjoint union of two cliques. But I don't even know if this problem is hard for general graphs, or easy for any interesting classes of graphs.
There is an easy polynomial-time algorithm to find perfect non-crossing matchings.
Update (Jun 24, 2013): The problem is solved! There are now two independent proofs that identifying square strings is NP-complete.
In November 2012, Sam Buss and Michael Soltys announced a reduction from 3-partition, which shows that the problem is hard even for strings over a 9-character alphabet. See "Unshuffling a Square is NP-Hard", Journal of Computer System Sciences 2014.
In June 2013, Romeo Rizzi and Stéphane Vialette published a reduction from the longest common subsequence problem. See "On Recognizing Words That Are Squares for the Shuffle Product", Proc. 8th International Computer Science Symposium in Russia, Springer LNCS 7913, pp. 235–245.
There is also a simpler proof that finding non-nested perfect matchings is NP-hard, due to Shuai Cheng Li and Ming Li in 2009. See "On two open problems of 2-interval patterns", Theoretical Computer Science 410(24–25):2410–2423, 2009.
For n=4, 10000111 is a 2*n bit binary number for which half the bits are on and half are off, but which is not a square, as defined. Following that logic, since squares are a strict subset of the set that generates A000984, the values for squares over a binary alphabet should be lower at equal indices through the sequence - no?
– Daniel Apon Aug 24 '10 at 01:45n / 2letters in polynomial time. There are a number of reductions that can be made to the brute force approach, like limiting it to only contain half of each letter. – Nick Larsen Aug 26 '10 at 18:30