A Stack Overflow user's curious problem of maximising unsortedness

Question

User ddofborg posted on Stack Overflow a programming question which hides a combinatorial optimisation problem.

The idea is the following: given a list of URLs with their respective domain names, he wants to find the permutation which makes the domain names "as unsorted as possible". By this he means that he does not want URLs with the same domain name to appear close to each other but, rather, he would like them to be as spaced apart as possible.

For example, let the following be the input data to the problem:

('host1.com', 'https://host1.com/id/4'),
('host1.com', 'https://host1.com/id/6'),
('host2.com', 'https://host2.com/abc'),
('host3.com', 'https://host3.com/x7'),
('host3.com', 'https://host3.com/f2'),
('host3.com', 'https://host3.com/l3')

In the above list all URLs associated with the same domain are close to each other, so the user wouldn't like this arrangement. The idea is that he will go through the URLs of this list and make some network request. He wants to space requests to the same domain apart, so that the corresponding server doesn't get overloaded. A solution the user would like is, e.g., the following:

('host3.com', 'https://host3.com/x7'),
('host1.com', 'https://host1.com/id/4'),
('host2.com', 'https://host2.com/abc'),
('host3.com', 'https://host3.com/l3'),
('host1.com', 'https://host1.com/id/6'),
('host3.com', 'https://host3.com/f2')

To quantify his "unsortedness" criterion, the user proposes to assign an objective value to each possible permutation of URLs. Before introducing the objective function, however, let me propose some notation which will make the discussion proceed more quickly within our community.

Let $B = \{1, \ldots, m\}$ be the set of bases (in our case, domain names) and let $I = \{1, \ldots, n\}$ be the set of items (in our case, URLs). Denote with $\alpha_b \in \mathbb{N}$ the number of items with base $b$, and with $\beta_i$ the base of item $i$. Let $x_{bj} \in \{0,1\}$ be a binary variable with value 1 iff an item of base $b$ is placed in position $j$ in the permutation.

For each base $b$ the user proposes to measure its "unsortedness" summing all the pairwise distances between items of base $b$ (without repetitions). For example, if URLs associated with host3.com take positions 4, 5 and 6 (as in the first example), then the unsortedness score of host3.com is $|4-5| + |4-6| + |5-6| = 4$. However if, as in the second example, the URLs take positions 1, 4 and 6, then the score is $|1-4| + |1-6| + |4-6| = 10$.

Formally, the objective function is: $$\sum_{b = 1}^m \sum_{j_1 = 1}^{n-1} \sum_{j_2 = j_1 + 1}^n (j_2 - j_1) x_{b j_1} x_{b j_2}$$ I.e., a quadratic function of variables $x_{bj}$.

I would like to investigate further this problem, see if it's possible to model it using a classical OR problem and devise some efficient algorithm.

Another Stack Overflow user has proposed a Constraint Programming approach (although I am not 100% convinced of its correctness).

I will mark this question as community wiki, so that everyone whose curiosity was picked by this problem can contribute.

It makes me think of the Car Sequencing Problem, maybe it would be relevant to model it the same way — fontanf, Jan 03 '22 at 08:05
This problem was recently featured in a CodeForces contest (although in their version, it is also required how many optimal orderings there are; this is the more difficult part). There is a solution in linear time in the number of bases (after parsing input). The problem is here: https://codeforces.com/contest/1612/problem/G The solution is here: https://codeforces.com/blog/entry/97164 — Mees de Vries, Jan 03 '22 at 15:36
Wow, what are the odds! @MeesdeVries you should make your comment into an answer, adding some more details perhaps. — Alberto Santini, Jan 03 '22 at 16:01
@AlbertoSantini, I do not have the time today to make more than a link-only answer, which I believe is discouraged. Feel free to do it yourself -- if not, I may do it eventually. Just wanted to leave this comment so not too much unnecessary work went into solving this problem (although time spent puzzling is time well spent!). — Mees de Vries, Jan 03 '22 at 16:15
I really hope my comment about actually looking for a sort with similar items having maximized distance prompted this question (but doubt it) - either way, I'm excited to see how this develops (hopefully beyond the pre-defined answer from Mees de Vries for my curiosities sake. (maybe an alt. method) - and of course the other ;) way you can improve may be removing the users links, they might not want that much attention on their profile. — TCooper, Jan 04 '22 at 00:22
"...whose curiosity was picked" I think you mean piqued ;) — J..., Jan 04 '22 at 13:13
See also https://cs.stackexchange.com/questions/41777/efficient-algorithm-to-generate-two-diffuse-deranged-permutations-of-a-multiset — hftf, Jan 05 '22 at 01:30
This metric just seems to be measuring how far apart the min and max of each group is (times, how many members in the same group minus one). Its just going to result in the largest groups having the progressively greatest min/maxes and all other members clumped in the middle. — RBarryYoung, Jan 05 '22 at 15:10
I wonder whether maximising the minimum distance between two items within the same group would work better... — Alberto Santini, Jan 05 '22 at 20:07

Alberto Santini · Answer 1 · 2022-01-03T00:43:43.970

Another possibility is to consider an integer linear extended formulation. Let $\mathcal{A}_b$ be the set of valid assignments for base $b$. I.e., an element of $\mathcal{A}_b$ determines $\alpha_b$ unique positions among the available ones $\{1, \ldots, n\}$, which are the positions taken by items of base $b$ in the solution. Denote with $\mathcal{A} = \bigcup_{b=1}^m \mathcal{A}_b$ the set of all assignments and with $c_a$ the cost of an assignment $a$. Furthermore, let $\gamma_{ab} \in \{0,1\}$ be a parameter which takes value 1 iff $a \in \mathcal{A}$ is a member of $\mathcal{A}_b$ (i.e., if the assignment refers to base $b$) and let $\delta_{aj} \in \{0,1\}$ be another parameter which takes value 1 iff assignment $a$ covers position $j$.

Define one binary variable $y_a$ for each assignment, which will take value 1 iff the assignment is used in the optimal solution. Then a model for our problem is the following: $$\begin{align} \max \quad & \sum_{a \in \mathcal{A}} c_a y_a \\ \text{s.t.} \quad & \sum_{a \in \mathcal{A}} \gamma_{ab} y_a = 1 & \quad & \forall b \in \{1, \ldots, m\} \\ & \sum_{a \in \mathcal{A}} \delta_{aj} y_a = 1 & \quad & \forall j \in \{1, \ldots, n\} \\ & y_a \in \{0,1\} & \quad & \forall a \in \mathcal{A} \end{align} $$ (Following a similar reasoning as the one for the quadratic integer model, I think we can replace both equalities with $\leq$ inequalities.)

The continuous relaxation of this model can be solved via column generation: let $\pi_b$ be the dual values associated with the first constraint and $\mu_j$ those associated with the second constraint. Then, starting with a reduced set of columns, for each base $b$, one can look for improving assignments $a \in \mathcal{A}_b$ such that $c_a - \sum_{j=1}^n \delta_{aj} \mu_j > \pi_b$.

Computing $c_a$ in a separation problem looks non-trivial to me. For example, if we want to use an integer model for separation (for a given base $b$), we might again end up with a quadratic model such as the following: $$\begin{align} \max \quad & \sum_{j_1 = 1}^{n-1} \sum_{j_2 = j_1 + 1}^n (j_2 - j_1) z_{j_1} z_{j_2} - \sum_{j=1}^n \mu_j z_j \\ \text{s.t.} \quad & \sum_{j=1}^n z_j = \alpha_b \\ & z_j \in \{0,1\} & \quad & \forall j \in \{1, \ldots, n\} \end{align} $$ In which binary variable $z_j$ takes value 1 iff we place an item in position $j$.

To pursue this line of research one should come up with an efficient algorithm to solve the above separation problem. Small experiments, however, seem to show that the relaxation is tight.

Alberto Santini · Answer 2 · 2022-01-03T11:20:17.613

A first attempt is to model this problem using a special case of the Quadratic Assignment Problem (QAP), in which we want to assign URLs to positions in the permutation. To make the model smaller, we first note that we don't need to provide the exact position of each URL, because URLs with the same hostname are indistinguishable for the purpose of computing the "unsortedness" of the permutation.

Using the notation introduced in the question, we can model this problem as: $$\begin{align} \max \quad & \sum_{b = 1}^m \sum_{j_1 = 1}^{n-1} \sum_{j_2 = j_1 + 1}^n (j_2 - j_1) x_{b j_1} x_{b j_2} \\ \text{s.t.} \quad & \sum_{b = 1}^m x_{bj} = 1 & \quad & \forall j \in \{1, \ldots, n\} \\ & \sum_{j = 1}^n x_{bj} = \alpha_b & \quad & \forall b \in \{1, \ldots, m\} \\ & x_{bj} \in \{0,1\} & \quad & \forall b \in \{1, \ldots, m\}, \; \forall j \in \{1, \ldots, n\} \end{align}$$

We can rewrite the objective function in the typical QAP form $$\min_{\phi \in \mathcal{S}(n)} \sum_{i=1}^n \sum_{j=1}^n a_{\phi(i)\phi(j)} b_{ij} x_{i \phi(i)} x_{j \phi(j)}$$ in which $\mathcal{S}(n)$ is the set of all permutations of $n$ elements. Assigning an item $\phi(i)$ to position $i$ and an item $\phi(j)$ to position $j$ incurs into a cost which is multiplicatively decomposable into one part which only depends on the two items ($a_{\phi(i)\phi(j)}$) and one part which only depends on their assigned positions ($b_{ij}$). In our case, we would have: $$ a_{ij} = \begin{cases} 1 & \text{ if } \beta_i = \beta_j \\ 0 & \text{ otherwise } \end{cases}\quad b_{ij} = \begin{cases} i - j & \text{ if } i < j \\ 0 & \text{ otherwise } \end{cases} $$ (We use $i-j$ instead of $j-i$ to transform our maximisation problem into the canonical minimisation form of the QAP.)

Such rewriting could be useful to identify special properties of matrices $(a_{ij})$ and $(b_{ij})$, which can help to identify if our special-case QAP is simpler to solve than the general QAP.

For example, matrix $(b_{ij})$ is Toeplitz because each entry only depends on the difference $i - j$. If matrix $(a_{ij})$ were anti-Monge, then we would know that our problem has nice properties. However, one can unfortunately build a simple example to see that $(a_{ij})$ is not anti-Monge.

Another observation is that it should be possible to replace the second equality constraint with a $\leq$ inequality: in any solution in which we place fewer than $\alpha_b$ items of a base $b$, because $\sum_{b=1}^m \alpha_b = n$, we must be leaving some empty space. Assuming that $\alpha_b > 1$, filling this empty space with an item of base $b$ strictly increases the objective function. And even if $\alpha_b = 1$, we obtain another optimal solution in which equality is satisfied.

Analogously, it should be possible to replace the first equality constraint with a $\leq$ inequality. This follows by the validity of the second constraint and the pigeonhole principle.

A Gurobi implementation of this model takes a very long time already for small instances ($n=15,20$) so clearly some more effort is required along this route.

How many bases ($m$) did you have in your Gurobi experiments? — prubin, Jan 03 '22 at 04:37
Can you generate some "real-sized" instances in QAPLIB format? — fontanf, Jan 03 '22 at 11:53
You might want to add some symmetry-breaking constraints (or coax Gurobi to break the symmetry). If $\alpha_b = \alpha_{b'}$, interchanging all base-$b$ items with their base-$b'$ counterparts in a solution creates a "new" solution with the same objective value as the original solution. — prubin, Jan 03 '22 at 15:31
Also, reversing a solution yields another solution with identical objective value. — prubin, Jan 03 '22 at 16:17
@RBarryYoung You should post that comment directly on the question, which proposes this metric. — Stef, Jan 04 '22 at 14:26

score 5 · Answer 3 · answered Jan 03 '22 at 17:06

5

So, it turns out that this problem admits a linear-time exact algorithm! The intuition is to add to the extremes (first and last position) two items with the same base $b^*$, which must be the base with the most items ($b^* = \text{argmax} \{\alpha_b\}$). Then one can reduce by 2 the number of items in this base ($\alpha_{b^*} \gets \alpha_{b^*} - 2$), cut off the first and last position from the assignment vector, and re-apply the same reasoning recursively. The algorithm ends when all bases have $\alpha_b \leq 1$; at this point we trivially add all remaining items to the "centre" of the assignment vector.

The reason why this algorithm works is explained in the solution to problem "Codeforces Educational Round 117", as spotted by OR Stack Exchange user Mees de Vries.

As a commentary, it's amazing how 1. this problem already existed (at least in recreational form) and was proposed less than 3 months ago, 2. user Mees de Vries managed to spot it both here and on Codeforces and to realise it was the same problem, 3. I really should stop trying to throw modelling-based solutions at any optimisation problems which comes my way... maybe an algorithm/math based solution exists, and is even more efficient. I will try to remember this in the future!

answered Jan 03 '22 at 17:06

Alberto Santini

2,113
9
23

1

I'm not doubting that this algorithm solves the precise formulation of the OP's problem but it doesn't seem to solve the problem in spirit: if one base has a substantial plurality then it will end up clumped at both ends of the list; after that the two most common bases will alternate at each ends until a third base is as common, etc. A better formulation would scale the objective sub-linearly in distance (as the intuitive benefit from moving items of the same base from 1 to 5 spaces apart is greater than moving them from 101 to 105 spaces apart). – Reinstate Monica Jan 03 '22 at 17:43
I'm skeptical this solves the OP's problem. Consider the following starting vector: ["c", "a", "b", "b", "a", "b", "a", "e", "b", "c", "e", "d"]. Applying the recursive algorithm, I get ["b", "a", "b", "c", "e", "a", "d", "e", "c", "b", "a", "b"], which I believe has "sortness" 28. Compare that to ["b", "a", "e", "c", "d", "b", "b", "a", "c", "b", "e", "a"], which has "sortness" 32. Do I have an error in my calculation of the recursive solution? – prubin Jan 03 '22 at 17:53
@prubin I think you should have an error. The first assignment has obj value 64 and the second 61, according to my calculations. I also implemented the algorithm and, in my tests, it consistently finds the optimal solution. – Alberto Santini Jan 03 '22 at 20:45
@ReinstateMonica I don't disagree, but that's what the original poster was using. If you have an alternative objective function which works better for unbalanced bases, that would be interesting. – Alberto Santini Jan 03 '22 at 20:48
Sorry, bug in my R code. The objective function worked on the authors "a" and "b" but not on the examples I used above. – prubin Jan 03 '22 at 23:18
I now match your 61, although I get 66 where you get 64. – prubin Jan 03 '22 at 23:25
@AlbertoSantini, can you say more about the distinction you make between modeling-based approaches and algorithm/math-based solutions? I don't understand the difference. – dgrogan Feb 12 '22 at 00:54

score 4 · Answer 4 · answered Jan 03 '22 at 16:53

A genetic algorithm with a permutation type chromosome works well on this problem. I have an R notebook demonstrating the approach that can be downloaded here. Since a GA is a metaheuristic, there is no guarantee of optimality, but it seems to perform pretty well (and quickly) on test problems.

score 1 · Answer 5 · answered Jan 04 '22 at 01:44

I don't think I can answer from a proper OR perspective, but I had to solve approximately the same problem (have had to multiple times, actually) and got an algorithmic answer that left me saying "well... duh!" in its simplicity. Hopefully you can see the "well duh" answer to translate it mathematically / (im)prove it as appropriate. It's not optimal per the objective function above - I have no idea if it's good, or only "good enough" for what I needed to do.

The only constraint I'm aware of for this answer is that any one site cannot have more entries than (all other sites combined + 1). That rule alone also means that there must be at least two sites.

Create a hash with key = domain and value = empty array. An alternative that may work better is a priority queue with the same key / value, and priority of length(value).
Append every URL to the appropriate domain array for its domain.
Sort the hash by length of array, longest first. Order of equal length arrays does not matter. As in (1) a priority queue would give you a continuously sorted list.
Set last_domain to the second entry in your list. That usually means the second-longest array domain, but may be one of N equal-length domains.
While hash is not empty:
1. x = Take the first item from the list that does not match last_domain
2. By "take" this means remove it from the hash array, thus shortening that domain's array by one. Remove the domain entry if the array becomes length zero.
3. Output x
4. Set last_domain = x.domain
5. If not empty, re-sort last_domain's entry in the list based on its new length N-1, putting it at the end of equal-length entries.

For a list of URLs with nice even distribution (all length N the same) you can see that it will effectively round-robin through them, always picking the longest available list to take its item. As each domain is shortened, it gets pushed to the end of the list.

For a list where the distribution is massively uneven (one domain has 1000, 999 domains each have one) it will start with the longest entry, then one of a different domain, then another of the longest, etc. until it has populated the first, last, and every second item in the output with one of the 1000-length items, and 999 items in-between with one-each of the other possibilities.

The options in-between are where things are iffy. If you have two entries A and B with N=10 and then twenty different domains C through V each with N=1 then this algorithm won't meet Glorfindel's optimization, where iterating A,B,other would. This algorithm will iterate between the first two, and then will eventually follow nine A,B pairs with one each from A through V.

A Stack Overflow user's curious problem of maximising unsortedness

5 Answers5