How hard is unshuffling a string?

Question

A shuffle of two strings is formed by interspersing the characters into a new string, keeping the characters of each string in order. For example, MISSISSIPPI is a shuffle of MISIPP and SSISI. Let me call a string square if it is a shuffle of two identical strings. For example, ABCABDCD is square, because it is a shuffle of ABCD and ABCD, but the string ABCDDCBA is not square.

Is there a fast algorithm to determine whether a string is square, or is it NP-hard? The obvious dynamic programming approach doesn't seem to work.

Even the following special cases appear to be hard: (1) strings in which each character appears at most ~~four~~ six times, and (2) strings with only two distinct characters. As Per Austrin points out below, the special case where each character occurs at most four times can be reduced to 2SAT.

Update: This problem has another formulation that may make a hardness proof easier.

Consider a graph G whose vertices are the integers 1 through n; identify each edge with the real interval between its endpoints. We say that two edges of G are nested if one interval properly contains the other. For example, the edges (1,5) and (2,3) are nested, but (1,3) and (5,6) are not, and (1,5) and (2,8) are not. A matching in G is non-nested if no pair of edges is nested. Is there a fast algorithm to determine whether G has a non-nested perfect matching, or is that problem NP-hard?

Unshuffling a string is equivalent to finding a non-nested perfect matching in a disjoint union of cliques (with edges between equal characters). In particular, unshuffling a binary string is equivalent to finding a non-nested perfect matching in a disjoint union of two cliques. But I don't even know if this problem is hard for general graphs, or easy for any interesting classes of graphs.
There is an easy polynomial-time algorithm to find perfect non-crossing matchings.

Update (Jun 24, 2013): The problem is solved! There are now two independent proofs that identifying square strings is NP-complete.

In November 2012, Sam Buss and Michael Soltys announced a reduction from 3-partition, which shows that the problem is hard even for strings over a 9-character alphabet. See "Unshuffling a Square is NP-Hard", Journal of Computer System Sciences 2014.
In June 2013, Romeo Rizzi and Stéphane Vialette published a reduction from the longest common subsequence problem. See "On Recognizing Words That Are Squares for the Shuffle Product", Proc. 8th International Computer Science Symposium in Russia, Springer LNCS 7913, pp. 235–245.

There is also a simpler proof that finding non-nested perfect matchings is NP-hard, due to Shuai Cheng Li and Ming Li in 2009. See "On two open problems of 2-interval patterns", Theoretical Computer Science 410(24–25):2410–2423, 2009.

FWIW, I hacked a program to count squares over binary alphabets and I got 1, 2, 6, 22, 128, 509, 2074, 4518, 20141, 33637, 155543, 210678, 1074477. (The first value, the one, counts the empty string. The third value, the six, counts 0000, 0101, 0011, and their bitwise negations.) This sequence does not appear in OEIS. Code at http://github.com/rgrig/hacks/blob/master/square.w — Radu GRIGore, Aug 19 '10 at 09:05
@rgig If I understand, that would mean that there are 128 8 bit strings that are squares. There are only 128 8 bit strings with an even number of ones, so that would mean that all such stings are squares, but 10000001 obviously isn't. I'm missing something. — deinst, Aug 19 '10 at 18:14
@deinst, you are not missing anything, I had a typo in generating the merging patterns. The fixed numbers are 1, 2, 6, 22, 82, 320, 1268, 5102, 20632, 83972, 342468, 1399296. Still not in OEIS. The fraction of squares seems to approach 1/3. — Radu GRIGore, Aug 20 '10 at 07:53
Isn't the sequence just A000984, the "number of possible values of a 2*n bit binary number for which half the bits are on and half are off"? — Travis Brown, Aug 23 '10 at 23:24
@Travis, unless I'm misunderstanding:
For n=4, 10000111 is a 2*n bit binary number for which half the bits are on and half are off, but which is not a square, as defined. Following that logic, since squares are a strict subset of the set that generates A000984, the values for squares over a binary alphabet should be lower at equal indices through the sequence - no? — Daniel Apon, Aug 24 '10 at 01:45
@Daniel: I was taking the 2*n bit binary number as a template for making squares; i.e., replace all the zeroes with the characters in order, and repeat for the ones. So "AABCDBCD" is the square corresponding to 10000111 for the string "ABCD". But you're right: this still isn't the sequence Radu GRIGore wanted. — Travis Brown, Aug 24 '10 at 01:59
RE: My above comment, squares over binary are not a strict subset of the set of the set of strings with bits half-on/half-off. This is pretty obvious; there are plenty of squares (in fact, the vast majority) with unequal counts of 1s and 0s. — Daniel Apon, Aug 24 '10 at 03:20
In MISSISSIPPI there is 4 "i", in MISIPPI 3 and in SSISI 2. Does it mean that one of the i of MISSISSIPPI is used for both words, or that you made a typo ? In the first case I am not sure to understand your question. — Arthur MILCHIOR, Aug 24 '10 at 03:26
Arthur, yes, that's a typo. Either one of the I's at the end of either MISIPPI or SSISI shouldn't be there; doesn't matter which. — Daniel Apon, Aug 24 '10 at 03:36
You could also answer the question, is it possible to enumerate all subsets with n / 2 letters in polynomial time. There are a number of reductions that can be made to the brute force approach, like limiting it to only contain half of each letter. — Nick Larsen, Aug 26 '10 at 18:30
Wait: a matching is non-nested if a pair is nested ? I am confused — Suresh Venkat, Aug 27 '10 at 22:40
Observation: Using the graph formalism, let 2n be the number of vertices in G. Let G′ be a graph obtained from the line graph of G by adding the edges between vertices corresponding to nested edges of G. The problem asks whether G′ has an independent set of size n. There are various classes of graphs where a maximum independent set can be computed polynomial time. If we go this route, the question is: What nice properties does G′ have? (more) — Tsuyoshi Ito, Sep 11 '10 at 14:34
(cont’d) Clearly the graphs G′ arising in this way form a class of graphs closed under induced subgraphs. If we can characterize this class nicely, it might lead to a polynomial-time algorithm for the problem. — Tsuyoshi Ito, Sep 11 '10 at 14:35
@Radu: I don't think the fraction of squares to non-squares (over binary alphabets) converges to 1/3. I did some Monte- Carlo simulations which indicate a slow convergence to 1/2. Hence in the limit essentially all binary strings with even numbers of 0 and 1 are squares. This is surprising to me, and may be exploitable in an algorithm. For larger alphabets the fraction of squares seems to converge to 0 rapidly. — Martin Berger, Dec 06 '10 at 22:41
I don't think the sequence 1, 2, 6, 22, 128, 509, 2074, 4518, 20141, 33637, 155543, 210678, 1074477 is correct. I get 1,2,6,22,82,320,1268,5102,20632,83972. It would be nice if someone could check to see who is right. — Jeffrey Shallit, Jun 28 '11 at 17:42
I think David's greedy algorithm (maybe with the slight variation of preserving order among two parallel solutions) always returns a string that is a substring of a solution (for example on "AABAAB" returns "AB" that is a substring of "AAB"). Possibly, there is a way to exploit this partial information, other than reducing the complexity of a brute force algorithm, of course. — nicola rebagliati, Aug 25 '10 at 08:29
Since this question is mentioned in today's blog post, let's see if we can get some renewed interest in solving this problem. It's been a year since this question has been brought forward, and we've gained a lot of new users since then. I've put up a 100 rep bounty for the question. — Alex ten Brink, Aug 16 '11 at 10:06
@ Jeffrey, I think you are right. I get 1, 2, 6, 22, 82, 320, 1268, 5102, 20632, 83972, 342468, 1399296, 5720966, 23396618, 95654386, 390868900 (there probably is an integer overflow in the last one, don't have time to check right now). — Martin Berger, Aug 16 '11 at 21:08

score 76 · Accepted Answer · edited Dec 06 '13 at 20:25

76

Michael Soltys and I have succeeded in proving that the problem of determining whether a string can be written as a square shuffle is NP complete. This applies even over a finite alphabet with only $7$ distinct symbols, although our proof is written for an alphabet with $9$ symbols. This question is still open for smaller alphabets, say with only $2$ symbols. We have not looked at the problem under the restriction that each symbol appears only $6$ times (or, more generally, a constant number of times); so that question is still open.

The proof uses a reduction from $3$-Partition. It is too long to post here, but a preprint, "Unshuffling a string is $\text{NP}$-hard", is available from our web pages at:

http://www.math.ucsd.edu/~sbuss/ResearchWeb/Shuffle/

and

http://www.cas.mcmaster.ca/~soltys/#Papers.

The paper has been published in the Journal of Computer System Sciences:

http://www.sciencedirect.com/science/article/pii/S002200001300189X

edited Dec 06 '13 at 20:25

D.W.

12,054
2
34
80

answered Nov 30 '12 at 04:58

Sam Buss

996
8
5

12

Awesome!! (And to my immense relief, seriously nontrivial.) – Jeffε Nov 30 '12 at 05:21
15

Thanks. StackExchange was our source for this question. It's a great resource! – Sam Buss Nov 30 '12 at 16:39
9

@SamBuss a small request: while you cite Jeff's question, you only mention Per Austrin's solution in the text. If you look at the answers, there's a way to generate a formal citation for answers as well (click on the share button and hit the 'cite' link). In that way, you can generate a proper citation for Per's answer as well. I only mention this so that people who make formal contributions on the site can also get formal recognition. Thanks ! and congrats for cracking this problem – Suresh Venkat Nov 30 '12 at 17:20
2

@SureshVenkat. Thanks for the tip: this is useful. I have added this to the online version of the paper. – Sam Buss Nov 30 '12 at 18:18
1

The problem of recognizing a square shuffle has now shown to be hard even on a binary alphabet: https://www.sciencedirect.com/science/article/pii/S0304397519300258 – a3nm Mar 28 '19 at 15:11

score 62 · Answer 2 · answered Aug 27 '10 at 22:10

62

For the special case you mention when each character appears at most four times, there is a simple reduction to 2-SAT (unless I'm missing something...), as follows:

The crucial point is that for each character, there are (at most) two valid ways of matching the occurrences of the character (the third possibility will be nesting). Use a boolean variable to represent which of the two matchings is chosen. Now an assignment to these variables gives a valid unshuffle of the string iff for every pair of edges that are nested, not both were chosen. This condition can be precisely described by a disjunction of the variables (possibly negated) corresponding to the two characters involved.

answered Aug 27 '10 at 22:10

Per Austrin

1,399
1
9
10

Nice. The same idea generalizes to strings where each character occurs at most six times, but the result is an instance of 5-SAT. :-( – Jeffε Aug 29 '10 at 17:11
2

This answer is the favorite to win the bounty. – Jeffε Aug 29 '10 at 17:13
so this seems to prove the problem is NPC and why we need long conference and journal proofs? – Turbo Dec 07 '13 at 09:30
1

@Turbo Much belated, but this doesn't prove the problem to be NPC because 2-SAT is not NPC; it's in P. – Steven Stadnicki Apr 03 '15 at 22:16
Does this reduction to 2-SAT work if the Alphabet size is unbounded? – Mohammad Al-Turkistany Jun 25 '18 at 20:13

Mohammad Al-Turkistany · Answer 3 · 2019-02-25T10:45:43.203

Romeo Rizzi and Stéphane Vialette prove that recognizing square strings is NP-complete in their 2013 paper "On Recognizing Words That Are Squares for the Shuffle Product", by reduction from the longest binary subsequence problem. They state that the complexity of unshuffling a binary strings is still open.

An even simpler proof that finding non-nested perfect matching is NP-complete is given by Shuai Cheng Li and Ming Li in their 2009 paper "On two open problems of 2-interval patterns". However, they use terminology inherited from bioinformatics. Instead of "perfect non-nested matching", they call it the "DIS-2-IP-$\{<, \between\}$ problem". The equivalence between the two problems is described by Blin, Fertin, and Vialette:

The 2-IP-DIS-$\{<, \between\}$ problem has an immediate formulation in terms of constrained matchings in general graphs: Given a graph $G$ together with a linear ordering $\pi$ of the vertices of $G$, the 2-IP-DIS-$\{<, \between\}$ problem is equivalent to finding a maximum cardinality matching $M$ in $G$ with the property that for any two distinct edges $\{u, v\} $ and $\{u', v'\}$ of $M$ neither $min \{ \pi(u), \pi(v) \} \lt min \{ \pi(u'), \pi(v') \} $ and $max \{ \pi(u'), \pi(v') \lt max \{ \pi(u), \pi(v) \} $ nor $min \{ \pi(u'), \pi(v') \} \lt min \{ \pi(u), \pi(v) \}$ and $max \{ \pi(u), \pi(v) \} \lt max \{ \pi(u'), \pi(v') \}$ occur.

Update (February 25, 2019): Bulteau and Vialette showed that the decision problem of unshuffling a binary string is NP-complete in their paper, Recognizing binary shuffle squares is NP-hard.

I don't see the connection, and I don't see where the authors claim that unshuffling a string is equivalent to their problem. — Suresh Venkat, Jun 24 '13 at 23:45
They don't say it's equivalent to unshuffling; it's a more general problem. — Jeffε, Jun 25 '13 at 02:03
@SureshVenkat I edited my answer, I hope it is clearer. Basically, what they are saying in the footnote is that any two edges in the matching ($M$) are non-nested. — Mohammad Al-Turkistany, Jun 25 '13 at 03:15
In the actual published version, the equivalence is stated in page 320. http://books.google.com/books?id=1sIkPPUPNMcC&pg=PA311&lpg=PA311&dq=New+results+for+the+2-interval+pattern+problem,&source=bl&ots=VAi76WapuC&sig=qqCCmcy7TAm_L8tu-Mi-21jkL_Q&hl=en&sa=X&ei=BBLJUZfKIoHc8ATmhoCgAQ&ved=0CEwQ6AEwBQ#v=onepage&q=New%20results%20for%20the%202-interval%20pattern%20problem%2C&f=false — Mohammad Al-Turkistany, Jun 25 '13 at 03:51

score 12 · Answer 4 · answered Dec 05 '13 at 14:12

12

The solution that Sam Buss and I proposed in November 2012 (showing that unshuffling a square in NP-hard by a reduction from 3-Partition) is now a published article in the Journal of Computer System Sciences:

http://www.sciencedirect.com/science/article/pii/S002200001300189X

answered Dec 05 '13 at 14:12

Michael Soltys

129
1
5

4

This really ought to be an edit to Sam Buss's earlier answer, rather than a separate answer. You can click "edit" to suggest an edit to someone else's answer, and your edit will be reviewed by other users of the site. – D.W. Dec 06 '13 at 01:25

score 12 · Answer 5 · answered Aug 29 '10 at 15:09

Here is an algorithm that may have some chance of being correct though it seems tricky to prove and I would not bet the house on it...

Let us say that $G$ is purged if for every edge $e$, there exists a (possibly nested) perfect matching of $G$ that uses $e$ and does not use any edge contained in or containing $e$.

It is easy to test if $G$ is purged and if not to find the violating edges. Clearly none of these violating edges can be used in a non-nesting perfect matching of $G$, so it is safe to remove them from consideration. Repeating this process, we obtain a (unique) purged subgraph of $G$ which has a non-nested perfect matching iff $G$ has.

Now comes the leap of faith, which may or may not be correct: the hope is that in a purged graph, if there are still vertices of degree $> 1$, we can do the greedy choice and match the first such vertex to its first neighbor (or equivalently, remove the edges to all its other neighbors).

After the greedy choice we purge the graph again, and so on, and the process ends when the graph is (hopefully) a non-nesting perfect matching.

At first I thought this would be roughly like having a small look-ahead in the greedy algorithm and not really work, but I found it surprisingly difficult to come up with a counterexample.

I'm skeptical about the second greedy phase, but purging the graph seems useful. In the original string context, where the graph is the disjoint union of cliques, can you say anything about the structure of the purged graph? Is it still a disjoint union of cliques? (In other words, can you partition the occurrences of each character in the input string so that characters in different parts cannot be matched?) — Jeffε, Aug 29 '10 at 17:07
For the second question, consider the string 'aaaa'. Purging it removes the edges 1-4 and 2-3, giving a 4-cycle.
Two variations of the second greedy step that would also be sufficient and that I could not find any counterexamples to are:

A purged graph has a non-nested perfect matching iff it has a

perfect matching (this seems incomparable to the greedy step).

In a purged graph with a non-nesting perfect matching, every edge is used in some non-nesting perfect matching (this is stronger than both the greedy step and the first item so it should be easier to disprove). — Per Austrin, Aug 29 '10 at 18:37

score 10 · Answer 6 · answered Aug 17 '10 at 01:32

10

Does this help?

http://users.soe.ucsc.edu/~manfred/pubs/J1.pdf

answered Aug 17 '10 at 01:32

Aaron Sterling

6,994
6
42
74

7

Nice reference. It's hard to see how the results would apply to my problem, but maybe the techniques would help.
It's easy to tell whether a given string X is a shuffle of two copies of another given string Y. The attached paper proves it's NP-hard to decide whether a given string X is a shuffle of any number of copies of another given string Y. I want to know whether a given string X is a shuffle of two copies of SOME UNKNOWN string Y.
– Jeffε Aug 17 '10 at 08:04

David Eppstein · Answer 7 · 2010-08-17T07:36:38.560

NEVER MIND, THIS ANSWER IS WRONG. It fails on input "AABAAB": greedily matching the first two A's with each other makes it impossible to match the remaining symbols. I'm leaving it up rather than deleting it to help others avoid making the same mistake.

It seems to me that it's always safe to match each successive character of the supposed square greedily to another equal character that is in as early a position as possible. That is, I think the following linear time algorithm should work:

Loop through each position i in the input string, i = 0, 1, 2, ... n. For each position i, check whether that position has already been matched against some earlier position in the string. If not, match it against an equal character that comes after the last already matched position and is otherwise as early as possible in the string. If a match is not found for some character, declare that the input is not a square; otherwise, it is the set of characters in the first pair of each match.

Here it is in Python:

def sqrt(S):
    matches = []
    i, j = 0, 0
    while i < len(S):
        if j < len(matches) and matches[j][1] == i:
            i += 1
            j += 1
            continue
        if matches:
            k = matches[-1][1] + 1
        else:
            k = 1
        while k < len(S) and S[k] != S[i]:
            k += 1
        if k >= len(S):
            raise Exception("Not a square")
        matches.append((i,k))
        i += 1
    return "".join(S[a] for a,b in matches)

print sqrt("ABCABDCD")

Here i is the main loop variable (the position we're trying to match), j is a pointer into the array of matched pairs that speeds up the check of whether position i is already matched, and k is an index used to search for the character that matches the one at position i. It's linear time because i, j, and k are monotonically increasing through the string and each inner loop iteration increases one of them.

4

Been there. Done that. :-) – Jeffε Aug 17 '10 at 07:58

Neeldhara · Answer 8 · 2010-08-29T16:22:02.340

Update: It doesn't make sense to talk about the difficulty of finding perfect matching that is non-nesting and non-crossing, when the labels are from 1 to n, because there is only one such. (Yes, am kicking myself.) However, it would make sense given a larger range on the labels... so I still see some hope, but it might be quite pointless after all. I would certainly have to follow this up some more.

I can think of why it might be hard to find a matching that is non-nesting and non-crossing. Let me call such a matching a disjoint matching. Not sure to what extent this helps, but let me present the reasoning anyway. (I should point out that my argument, as it stands here, is not complete, and the detail I leave out is possibly crucial. However, I imagine that it might be something of a start.)

I'll begin with a slightly different problem. Given a graph $G$ whose edges are colored with $k$ colors, and the vertices are labeled from $1$ to $n$, is there a disjoint matching that contains exactly one edge of each color? This problem appears to be NP-hard (the argument for this is both complete and straightforward - unless I am missing something). The reduction spews out a graph that is a disjoint union of cliques.

The reduction is from Disjoint Factors, a NP-complete problem introduced in [1]. An instance of disjoint factors is given by a string over an alphabet of size $k$, and the question is whether there are $k$ disjoint factors, where a factor is a substring that begins and ends with the same letter; and two factors are disjoint if they don't overlap in the string (note that 'nesting', in particular, is disallowed too).

Let me denote by $a_1,\ldots, a_k$, the elements of the $k$-sized alphabet associated with the Disjoint Factors instance.

Given an instance of disjoint factors, that is, say a string of length $n$, create a graph that has $n$ vertices with vertex labels from $1$ to $n$. Add an edge between vertices $u$ and $v$ if the corresponding positions have the same letter (say $a_i$), and also color the edge $(u,v)$ with color $i$.

The proof of the reduction essentially follows from the definitions. Given $k$ disjoint factors, we clearly have a $k$-disjoint colorful matching, merely pick the edges as given by the disjoint factors, and it is easy to see that the resulting matching is both colorful and disjoint. Conversely, if there is a $k$-disjoint colorful matching, then we have k disjoint factors, one for every letter, because the matching is colorful (and hence picks one factor per letter), and is disjoint (so the corresponding factors would not overlap).

To get rid of the colors and make the matching perfect, albeit on a possibly larger range, make the following modifications to the graph thus created:

Let $U_a$ denote the subset of vertices that have labels which are positions associated with the letter $a$. If $U_a$ has $A$ vertices, then add $(A-2)$ new vertices and induce a complete bipartite graph between $U_a$ and the newly added vertices. Repeat, of course, for every letter.

Roughly speaking, if the graph is to induce a perfect matching, the newly introduced vertices must be matched with the vertices of $U_a$, and they will saturate all but a pair of vertices, and the edge between the remaining vertices will correspond to the disjoint factor. I have not worked out the numbers that one must associate with the newly added vertices... note that they must be so that the resulting matching is disjoint. I just have a feeling (read: hope) that it 'can be done'!

[1] On problems without polynomial kernels, Hans L. Bodlaender, Rodney G. Downey, Michael R. Fellows and Danny Hermelin, J. Comput. Syst. Sci.

I'm confused. Isn't (1,2), (3,4), (5,6), ..., (n-1, n) the ONLY perfect disjoint matching? — Jeffε, Aug 29 '10 at 10:56
Once I move over to the 'perfect matching' scenario, I modify the construction and add a lot of new vertices (note that I add |U_a|-2 new vertices for every a in the alphabet). Thus, n will blow up accordingly - roughly to 2n-2k, for a k-sized alphabet. I hope I did make it clear that the reduction is incomplete in that I haven't specified what numbers are allotted to the new vertices, but I am hopeful that it can be extended without too much difficulty. However, I certainly have to think about it some before I can say anything more. — Neeldhara, Aug 29 '10 at 13:36
I think that the point of JeffE’s comment is that it is easy to find a perfect matching that is non-nesting and non-crossing (or report the absence thereof) because there is only one possibility. — Tsuyoshi Ito, Aug 29 '10 at 15:13
It's possible that I'm missing something - but how can we be sure that there is only one possibility on a graph that is not even fully specified? :) In the first part of the reduction, I am not looking for a perfect matching, but a colorful one - and thus, automatically, one of size k. In the second part, the description of the reduced instance is incomplete, so I am not sure that a statement about its perfect matchings is even verifiable. Makes sense? — Neeldhara, Aug 29 '10 at 15:29
I am not talking about the content of your proof idea, but I am talking about the first sentence of your answer: “I can think of why it might be hard to find a perfect matching that is non-nesting and non-crossing.” This task is easy for the reason JeffE wrote. — Tsuyoshi Ito, Aug 29 '10 at 15:49
Ah! So I was missing something after all. So yes, I think only the first part makes sense (that is about a k-matching, not perfect), and I obviously don't expect to overcome the further bottleneck anymore. I suppose it might continue to make sense when the labeling is not from 1 to n but from a larger range. And then, of course, the task to reduce it to the problem at hand. I am less sure now about if anything can be salvaged. blush And thanks a ton for the explanation, I was clearly being thick. — Neeldhara, Aug 29 '10 at 16:16
Without the coloring constraint imposed by the disjoint factor problem (at most one edge of each color), finding maximal disjoint matchings is also easy. — Jeffε, Aug 29 '10 at 17:03
Ah yes, that's true. I've briefly shifted my attention to Per Austrin's approach above, though, which made me rethink the possibility of hardness. I can affirm that finding a counter example is surprisingly hard! I'll come back to trying to get something out of this once I have recovered some of my conviction/intuition in favor of hardness :) — Neeldhara, Aug 29 '10 at 17:29

DaniCL · Answer 9 · 2010-12-01T13:29:30.170

2

EDIT: This is an incorrect answer.

Sylvain suggested an RCG which unfortunately was not appropriate for these "shuffle squares". However, I think there is one (EDIT: not an RCG, see Kurt's comments below!), which looks as follows:

$\begin{aligned} S(Y) & \rightarrow A(\epsilon,Y) & (1) \newline A(X, ZY) & \rightarrow A(XZ,Y) & (2) \newline A(aX, aY) & \rightarrow A(X,Y) \quad \text{ for every } a \in \Sigma & (3) \newline A(\epsilon,\epsilon) & \rightarrow \epsilon & (4) \end{aligned}$

Explanation: recall that we have to match symbols which can appear anywhere in the string, but once we have matched $a$ and $a'$, we can only match $b$ and $b'$ if $a \prec b$ implies $a' \prec b'$ ($\prec$ meaning linear precedence). The idea is that we split the string $(1,2)$ in order to compare prefixes of the halves. If the beginnings of two substrings match, we can reduce the problem to the remaining strings $(3)$. If not, we can transfer some part of the right hand side to the left hand side $(2)$ and see if there is a match at a later position. What's important is that this is only allowed in one direction!

Here's a derivation for $100110101010$ (a counterexample to Sylvain's RCG):

$\begin{aligned} S(100110101010) & \Rightarrow A(\epsilon,100110101010) & (1) \newline & \Rightarrow A(1001,10101010) & (2) \newline & \Rightarrow^* A(01,101010) & (3) \newline & \Rightarrow A(011,01010) & (2) \newline & \Rightarrow^* A(1,010) & (3) \newline & \Rightarrow A(10,10) & (2) \newline & \Rightarrow^* A(\epsilon, \epsilon) & (3) \newline & \Rightarrow \epsilon & (4) \end{aligned}$

I haven't worked out a formal proof that this grammar indeed captures exactly the "shuffle squares" but it should not be too hard. Sylvain already mentioned that the decision problem for RCGs is polynomial.

edited Dec 01 '10 at 13:29

answered Sep 08 '10 at 14:59

DaniCL

172
4

I don't see how could this possibly be implemented in polynomial time: If you start from 000102030 then you can reach $A(x,\epsilon)$ for x equal to any one of the following $2^3$ strings 123, 01230, 01203, 0012030, 01023, 0010230, 0010203, 000102030. (Yes, I looked at the document linked by Sylvain, but it looks all French to me.) – Radu GRIGore Sep 09 '10 at 14:27
I'm not sure whether I understand you correctly, so I'm trying to summarise your arguments:
1. You construct a string of the form $a_0^n a_1 a_0 a_2 a_0 \dots a_n a_0$
2. You claim that an RCG parser has to enumerate $2^n$ steps.
Please clarify:
1. We assume a fixed alphabet. Is there a similar construction for words exceeding $3\cdot|\Sigma|$?
2. I think it is an established result that the languages defined by RCGs are equivalent to PTIME. Parsings uses sort of tabulation. (I'm no expert here.)
What I'm most interested in right now is: does my grammar define the "shuffle squares" language?
– DaniCL Sep 09 '10 at 15:13
Long example with three letters $0^{2n}(1020)^n$. Your grammar is fine. "Tabulation" is also known as dynamic programming. In fact, the pseudocode in the paper linked by Sylvain seems to use memoization. – Radu GRIGore Sep 09 '10 at 15:41
Oh, and I did not claim that "an RCG parser has to enumerate $2^n$ steps:" I simply said that I can't see how it can be done for your grammar, given that there are exponentially many potential derivations in which there isn't much to memoize. I say this simply because I tried to implement it and failed. For Sylvain's grammar, obviously, I didn't fail. :) – Radu GRIGore Sep 09 '10 at 15:55
Sorry for the misinterpretation, and thanks for the long example. I see now what you mean (but I think we can agree on RCG=PTIME?) – DaniCL Sep 09 '10 at 16:52
@DaniCL, I think you can make your grammar even simpler. Since in practice rule 2 will always be followed by rule 3, you can combine them together: A(aX,ZaY)->A(XZ,Y). This will necessitate a small change to the start rule: S(XY)->A(X,Y). (Although your version perhaps makes it clearer what's happening logically.) – Kurt Sep 09 '10 at 17:17
5

@DaniCL, On second thought... Do the parameters in the RHS of the production rules need to be contiguous ranges of the input? I didn't see that explicitly stated in the definition in the Boullier paper, but that does seem to be how it's being used. In the analysis of the running time of the parsing algorithm, it says that the number of possible arguments to the clauses is O(n^2h) where h is the max arity of the clauses and n is the input length. In your grammar, XZ in general will not be contiguous in the original input. – Kurt Sep 09 '10 at 20:11
@Kurt, good point. This (rule 3) is what, in my understanding, Boullier calls a "combinatorial" rule (Def. 8, page 11). A non-combinatorial RCG does not have discontiguous ranges. In Property 1 (page 11), he says: "For any RCG, there is an equivalent non-combinatorial RCG". He goes on to prove this by a somewhat convoluted construction after which it is "not difficult to see" that it results in an equivalent non-combinatorial grammar. For me at least, it is not at all straightforward, so I will probably be spending the next couple of minutes trying to apply this construction to rule 3. – DaniCL Sep 09 '10 at 21:20
3

@Kurt, I think you found the flaw. In another paper ("Chinese numbers, MIX, Scrambling, and Range Concatenation Grammars"), Boullier explicitly states: "Of course, only consecutive ranges can be concatenated into new ranges. In any PRCG, terminals, variables and arguments in a clause are supposed to be bound to ranges by a substitution mechanism." This probably means that my grammar is not a valid RCG, that Radu's doubt was reasonable, and that this approach doesn't work either. – DaniCL Sep 09 '10 at 21:45
2

@Kurt is correct. Without the contiguity restriction, I'm pretty sure I can create a set of production rules that recognize the NP-complete language UNARY 3PARTITION. Any set of non-negative integers can be encoded in unary by a string in the language (10)^. UNARY 3PARTITION is the set of all such strings whose encoded set can be partitioned into 3-element subsets, all with the same sum. (See http://en.wikipedia.org/wiki/3-partition_problem .) – Jeffε Sep 09 '10 at 22:37
1

Grammar for UNARY 3PARTITION: S(X0Y0Z)->A(e,X0,Y0,Z); A(W,1X,Y,Z),A(W,X,1Y,Z),A(W,X,Y,1Z)->A(W1,X,Y,Z); A(W,0X,0Y,0Z)->B(W,XYZ); B(W,e)->e; B(W,X0Y0Z)->C(W,W,X0,Y0,Z); C(W,1V,1X,Y,Z),C(W,1V,X,1Y,Z),C(W,1V,X,Y,1Z)->C(W,V,X,Y,Z); C(W,e,X,Y,Z)->B(W,XYZ) – Radu GRIGore Sep 10 '10 at 15:25

Sylvain · Answer 10 · 2010-08-27T08:00:31.683

The approach doesn't work: decomposing a shuffled square by taking out two matching letters does not result into shuffled squares... See Radu's comments below.

A proposal using Range Concatenation Grammars (RCGs, see http://hal.inria.fr/inria-00073347/en/): I'm was under the impression that the following simple RCG recognizes your "shuffled squares" language over a finite alphabet $\Sigma$, EDITED after Radu's first comment: $${\small \begin{aligned} S(XY)&\Rightarrow A(X,Y)&(1)\newline A(aX_1, aX_2Y_1Y_2)&\Rightarrow A(X_1,Y_1)\,A(X_2,Y_2)&(2)\newline A(\varepsilon,\varepsilon)&\Rightarrow\varepsilon&(3) \end{aligned}} $$ where $a$ ranges over $\Sigma$ and $\varepsilon$ denotes the empty string.

The grammar checks with the second predicate that it matches a letter from the first word occurrence with the same letter in the second word occurrence. It then guesses how to match the remainder of the remaining first word letters, i.e. $X_1$ with a substring of the remainder, namely $Y_1$. Everything before $Y_1$ therefore belongs to the first word instance; we call it $X_2$ and we guess that it matches some suffix starting at $Y_2$. Note that $Y_1$ and $Y_2$ might contain letters from both instances of the word, but $X_1$ and $X_2$ only contain letters from the first instance.

For example, here is a possible derivation of your string $abcabdcd$: $${\small\begin{aligned} S(abcabdcd) &\Rightarrow A(abc, abdcd) &(\text{by } 1, X=abc, Y=abdcd)\newline &\Rightarrow A(bc,bdcd)\,A(\varepsilon,\varepsilon)&(\text{by } 2, X_1=bc, Y_1=bdcd, X_2=Y_2=\varepsilon)\newline &\Rightarrow A(c,c)\,A(d,d)\,A(\varepsilon,\varepsilon)&(\text{by } 2)\newline &\Rightarrow A(\varepsilon,\varepsilon)\,A(\varepsilon,\varepsilon)\,A(d,d)\,A(\varepsilon,\varepsilon)&(\text{by } 2)\newline &\Rightarrow A(\varepsilon,\varepsilon)\,A(d,d)\,A(\varepsilon,\varepsilon)&(\text{by } 3)\newline &\Rightarrow A(d,d)\,A(\varepsilon,\varepsilon)&(\text{by } 3)\newline &\Rightarrow A(\varepsilon,\varepsilon)\,A(\varepsilon,\varepsilon)\,A(\varepsilon,\varepsilon)&(\text{by } 2)\newline &\Rightarrow^3\varepsilon&\text{i.e. success} \end{aligned}}$$

For $0011$, $${\small\begin{aligned} S(0011)&\Rightarrow A(0,011)\newline &\Rightarrow A(\varepsilon,\varepsilon)\,A(1,1)\newline &\Rightarrow A(1,1)\newline &\Rightarrow^\ast \varepsilon \end{aligned}}$$

Now, Boullier shows in the previously linked paper that there is a dynamic programming polynomial time algorithm for RCGs, which answers your question if the above grammar is were correct. The idea is that, although I presented above the instanciations of variables $X$, $Y$, etc. as strings, they really are intervals inside the input string, which can be properly tabulated.

Is there a derivation that takes S(0011) to $\epsilon$? (There should be one.) — Radu GRIGore, Aug 26 '10 at 15:19
Also, A(10,011010)->A(0,101)A(0,0)->$\epsilon$, but I believe 10011010 is not a square. — Radu GRIGore, Aug 26 '10 at 15:54
Thanks for the returns; I've changed the grammar a bit, and even have a small intuition of which it might work. — Sylvain, Aug 26 '10 at 16:28
You're welcome. Here's more, for the updated grammar :) A(00,000110)->A(0,011)A(0,0)->$\epsilon$, but 00000110 is not a square. Also, there seems to be no derivation for 100110101010, which is a square. — Radu GRIGore, Aug 26 '10 at 16:43
My my... both over- and under-generating! I sure was hoping to have at least one right. — Sylvain, Aug 26 '10 at 17:13

TonyK · Answer 11 · 2010-08-30T12:45:17.247

Update: As Tsuyoshi Ito points out in the comments, this algorithm has exponential running time.

Original post:

Here is how I would program this in the Real World.

We are given a string S = (S[1],...,S[n]). For each prefix S_r = (S[1],...,S[r]), there is a set {(T_i, U_i)} of pairs of strings, such that S_r is a shuffle of (T_i, U_i), and T_i is a prefix of U_i (i.e. U_i 'starts with' T_i). S_r itself is a square if and only if this set contains a pair (T_i, U_i) with T_i = U_i.

Now, we don't need to generate all of these pairs; we just need to generate the suffix V_i of each string U_i obtained by removing its copy of T_i. This will eliminate a (possibly exponential) number of irrelevant duplicates. Now S_r is a square if and only if this set of suffixes contains the empty string. So the algorithm becomes:

Initialise: SuffixSet = {<empty string>} ; r = 0
Loop: while (r < n) {
  r = r + 1
  NextSuffixSet = {}
  for each V in SuffixSet {
    if (V[1] == S[r]) Add V[2...] to NextSuffixSet // Remove first character of V
    Add V||S[r] to NextSuffixSet // Append character S[r] to V
    }
  SuffixSet = NextSuffixSet
  }
Now S is a square if and only if SuffixSet contains the empty string.

For instance, if S is AABAAB:

r=0: SuffixSet = {<empty string>}
r=1: S[r] = A; SuffixSet = {A}
r=2: S[r] = A; SuffixSet = {<empty string>, AA}
r=3: S[r] = B; SuffixSet = {B, AAB}
r=4: S[r] = A; SuffixSet = {BA, AB, AABA}
r=5: S[r] = A; SuffixSet = {BAA, B, ABA, AABAA}
r=6: S[r] = B; SuffixSet = {AA, BAAB, <empty string>, BB, ABAB, AABAAB}

We can discard all suffixes that are more than half as long as the input string, so this simplifies to:

r=0: SuffixSet = {<empty string>}
r=1: S[r] = A; SuffixSet = {A}
r=2: S[r] = A; SuffixSet = {<empty string>, AA}
r=3: S[r] = B; SuffixSet = {B, AAB}
r=4: S[r] = A; SuffixSet = {BA, AB}
r=5: S[r] = A; SuffixSet = {BAA, B, ABA}
r=6: S[r] = B; SuffixSet = {AA, <empty string>, BB}

I have programmed this in C++, and it works on all the examples given here. I can post the code, if anybody's interested. The question is: can the size of SuffixSet grow faster than polynomially?

I tried this too, but experiments show that the size of SuffixSet seems to grow exponentially in n if the original string is (AB)^n. — Tsuyoshi Ito, Aug 30 '10 at 12:24

How hard is unshuffling a string?

11 Answers11

Linked