How to perform clustering of two different sets of entities?

Question

I need to perform clustering of two different entities. Lets say we have transmitters and users. The relation between a transmitters and user is defined by a weight, for example, the weight between transmitter $t$ and user $u$ is denoted as $w_{t,u}, t=1,2,\cdots, T, u=1,2,\cdots, U$.

Maximum cluster size (in terms of number of transmitters). There is no such constraint on the user number in a cluster.
Clusters are mutually exclusive. one user or transmitter can belong to only one clusters.
All the transmitters in a cluster serve only their own users.
The user must belong to a cluster that has the transmitter providing the highest weight for the given user.

The clustering objective is the maximise the signal quality of all the users. The signal quality for user $u$, $q_u$ is defines as

$q_u=\frac{\text{Sum of weights for all the transmitters in user cluster}}{\text{Sum of weights for all the transmitters not in user cluster}}$

Therefore, the objective is $\max \sum_{u=1}^Uq_u$

I am looking for optimal as well as some goo heuristic approach.

Note: We do not have access to coordinates of the users, however have access to the coordinates of the transmitters. As mentioned, the weights are also known.

xposted: https://math.stackexchange.com/questions/4339858/how-to-mathematically-formulate-this-clustering-optimization-problem — Kuifje, Dec 22 '21 at 15:18
Is the number of clusters to be formed fixed (or, if not fixed, limited in any way)? — prubin, Dec 22 '21 at 19:02
This question is remarkably similar to your previous question https://or.stackexchange.com/questions/6023/how-to-mathematically-formulate-the-optimization-problem. I described a GA approach to that problem in a blog post (https://orinanobworld.blogspot.com/2021/04/a-ga-model-for-joint-clustering-problem.html). With a minor modification (assigning each user to the cluster with its best server), that same approach would work here. — prubin, Dec 22 '21 at 19:07

prubin · Accepted Answer · 2021-12-28T19:22:17.543

2

This can be formulated as a MIP model using a gaggle of binary variables. First, we introduce some parameters. $\tau(u)$ is the index of the transmitter with highest weight for user $u$. I will charitably assume this is unambiguous (no ties for closest transmitter). $\overline{Q}_u$ is an upper bound on the possible values of $q_u$ for user $u$. $C$ is the maximum number of transmitters in any cluster and $L$ is an upper bound on the number of clusters.

Let $x_{t,t'}\in \lbrace 0,1 \rbrace$ be 1 if transmitters $t$ and $t'$ are in the same cluster, 0 if not, for $t\neq t'$. $x_{t,t}$ will be 1 if transmitter $t$ "anchors" a cluster and 0 if not. A cluster has exactly one anchor.

Let $y_{t,u}\in\lbrace 0,1 \rbrace$ be 1 if user $u$ is in a cluster with transmitter $t$ and 0 if not.

Here come the constraints. First, we require that the $x$ variables be symmetric (if $t$ is in the same cluster as $t'$, then $t'$ is in the same cluster as $t$):$$x_{t,t'}=x_{t',t}\quad\forall t\neq t'.$$

We also need to enforce transitivity. If $t$ and $t'$ are in the same cluster, and $t$ and $t''$ are in the same cluster, then $t$ and $t''$ are in the same cluster. So we add the following constraints. $$x_{t,t''} \ge x_{t,t'} + x_{t',t''} - 1\quad \forall t,t',t'' \ni t\neq t' \,\&\, t' \neq t'' \,\&\, t \neq t''.$$

We must force every transmitter to belong to a cluster with an anchor, which requires additional variables $v_{t,t'}\in [0,1]$ where $t\neq t'$. The additional constraints are $$v_{t,t'} \le x_{t,t'} \quad \forall t\neq t'$$ $$v_{t,t'} \le x_{t',t'} \quad \forall t\neq t'$$ and $$x_{t,t} + \sum_{t' \neq t} v_{t,t'} = 1.$$ So if $t$ is an anchor ($x_{t,t}=1$), $v_{t,t'}=0$ for all $t' \neq t$; but if $t$ is not an anchor ($x_{t,t}=0$), then $v_{t,t'} =1$ for exactly one $t'\neq t$, and that transmitter $t'$ must be an anchor ($x_{t',t'} = 1$).

We can limit the number of clusters by limiting the number of anchors: $$\sum_t x_{t,t} \le L.$$ (Change that to an equation if you need exactly $L$ clusters.)

The following constraint enforces cluster capacity limits: $$\sum_{t'=1}^T x_{t,t'} - x_{t,t} \le C - 1\quad \forall t.$$This says that the number of transmitters other than $t$ in the same cluster with $t$ is at most $C-1$.

To avoid having multiple anchors in a single cluster, we enforce the following:$$x_{t,t} + x_{t',t'} + x_{t,t'}\le 2 \quad\forall t < t'.$$

Next, we enforce the requirement that each user belong to the cluster containing the best transmitter:$$y_{\tau(u),u} = 1 \quad\forall u.$$To determine what other transmitters serve $u$, we add the following:$$y_{t,u} = x_{\tau(u),t}\quad\forall u,\forall t\neq \tau(u).$$

The objective remains to maximize $\sum_u q_u$. To linearize $q_u$, we introduce variables $z_{t,u}\in [0, \overline{Q}_u]$ for all transmitters $t$ and users $u$, along with the following constraints:$$z_{t,u} \le \overline{Q}_u y_{t,u} \quad\forall t,u$$$$z_{t,u} \le q_u + \overline{Q}_u(1-y_{t,u})\quad \forall t,u$$and$$z_{t,u}\ge q_u - \overline{Q}_u(1-y_{t,u})\quad \forall t,u.$$ The net effect of these is that$$ z_{t,u}=\begin{cases} q_{u} & y_{t,u}=1\\ 0 & y_{t,u}=0 \end{cases}.$$ Finally, we linearize the definition of $q_u$ via the constraints $$\sum_t w_{t,u} (q_u - z_{t,u})=\sum_t w_{t,u}y_{t,u}\quad\forall u,$$which is the result of multiplying both sides of the definition of $q_u$ by the denominator and then simplifying.

Addendum: There's a bit of inherent (undesired) symmetry in the model. If you take any solution and change the "anchor" transmitter in a cluster to a different member of the same cluster, you get the same results but the solver sees it as a different solution. To avoid that, the following optional constraints can be added:$$x_{t',t'} \le 1 - x_{t,t'} \quad \forall t < t'.$$This forces the lowest index transmitter in each cluster to be the anchor. Whether these speed up the solver or not is an empirical question.

Addendum2: Details of both this and an alternate (and apparently less efficient) MIP model, along with construction and improvement heuristics that seem to work really well, are given in a blog post. The blog post also contains a link to my Java code.

edited Dec 28 '21 at 19:22

answered Dec 22 '21 at 17:22

prubin

39,078
3
37
104

thanks for your answer. Yes you answered my previous question on clustering which is very similar to this one. I had some issues with that one. For this problem, I have a big system, i.e., u=1,2,...200, t=1,2,...,20. I think it will be a very big MIP with large number of binary variable! Can we devise some sort of purely heuristic approach. I know that the GA you proposed earlier is heuristic (metaheuristic), I prefer a purely heuristic approach. Is it possible for this particular problem? – KGM Dec 22 '21 at 19:46
is there a typo in the last line? I mean what is $w_t$. – KGM Dec 22 '21 at 19:59
Yes, that was a typo (sorry). I omitted a pair of braces. – prubin Dec 22 '21 at 20:20
400 $x$ variables and 4000 $y$ variables makes for a nontrivial MIP, but I would not rule out the possibility of a good commercial solver handling it. – prubin Dec 22 '21 at 20:23
By "purely heuristic" (as opposed to metaheuristic) do you mean a heuristic designed specifically for this problem? If so, I would assume that you could design multiple problem-specific heuristics. I don't know off-hand how close to optimum they would get. It is worth noting that, given the requirement that each user be grouped with its best transmitter and that absence of a limit on number of users per cluster, this boils down to clustering just the servers. So you could look at existing clustering heuristics for a single type of entity. – prubin Dec 22 '21 at 20:27
then the MIP you just proposed also does not make much sense as we have the absence of a limit on the number of users per cluster? – KGM Dec 22 '21 at 20:46
I don't understand that last comment. The MIP assigns every user to the cluster containing their "best" transmitter, with no limit on customers per cluster. (There is a limit $C$ on transmitters per cluster.) Why does this not make sense? – prubin Dec 22 '21 at 21:58
sorry for the causing confusion. My last comment was on on your proposed solution, but on my problem itself. You said ''given the requirement that each user be grouped with its best transmitter and that absence of a limit on number of users per cluster, this boils down to clustering just the servers''. I am afraid if it still holds for the MIP you proposed. Whether it is MIP, or heuristic, based on the constraints, if it boils down to clustering just the servers, then it doesn't make mush sense to solve this big MIP problem. Instead the focus should be solely on clustering of the transmitters. – KGM Dec 22 '21 at 22:25
In constraint 2, is it the summation of diagonal elements? sum(diag(x))<=L? – KGM Dec 23 '21 at 00:36
what are the good choices for $\overline{Q}$ and $\underline{Q}$ – KGM Dec 23 '21 at 01:11
Again in the last equation, should the left side be $\sum_t w_{t,u} q_u(1 - z_{t,u})$! – KGM Dec 23 '21 at 01:20
Yes on constraint 2. There was a typo (now fixed) in the last constraint. The left side was correct ($z_{t,u}=q_u y_{t,u}$) but the right side should contain $y$, not $z$. As far as choices for bounds on $q_u$, the worst case would be if the only transmitter in the cluster containing $u$ was it's favorite (the one that is mandatory) and the best case might be if $u$ is in a maximum size cluster with the $C$ best transmitters for it. – prubin Dec 23 '21 at 04:04
I don't know why I am getting the sum of diagonal elements of x equal to zero. Are constraint 1 and constraint 2 causing this? Would you please check with a small example. You can randomly generate the weights. – KGM Dec 23 '21 at 20:41
thanks for the correction. In fact, I had already calculated the upper and lower bounds based on your comment, bounds per user. – KGM Dec 23 '21 at 21:11
The model is missing a key constraint. I need to think about how to add it with a minimum number of additional variables and side constraints. – prubin Dec 23 '21 at 22:03
I added two sets of constraints that I missed before, one enforcing transitivity of the $x$ variables and the other (with new continuous variables $v$) forcing every transmitter to belong to a cluster. I also fixed some transcription errors and tested the model on a toy problem, where it produced a plausible solution. I should note that there would be a simpler model if the number of clusters were fixed. – prubin Dec 24 '21 at 04:20
thanks for correction. This seems to be a very complex model. I would prefer a much simpler model even if the number of cluster needs to be fixed. – KGM Dec 24 '21 at 11:22
1

@dipaknarayanan comments are not for extensive back and forth. Please use the chat for that. – EhsanK Dec 24 '21 at 18:28
@prubin the maximum number of clusters is sometimes not satisfied! Would you please share your toy model code! – KGM Dec 24 '21 at 22:27

score 2 · Answer 2 · answered Dec 24 '21 at 23:20

In preliminary tests, a greedy heuristic seems to do rather well for the problem.

The greedy heuristic I tried starts out with each transmitter being a cluster of size 1, and then loops indefinitely. In each pass through the loop, all pairs of clusters are considered for merging. Any merger that would result in a cluster that exceeds the cluster size limit is discarded. The surviving candidate mergers (those meeting the size limit) are evaluated for how they would change the objective value, and the best possible merger is retained. If the best merger results in a net gain, it is implemented and the loop repeats. If the best merger results in a net loss but the current number of clusters exceeds the limit on cluster count, it is implemented anyway (and the loop repeats). You exit the loop when either (a) you are within the limit on number of clusters and there are no beneficial mergers or (b) you have too many clusters but there are no legal mergers (in which case the heuristic has failed to find a feasible solution).

for most of the cases the proposed MIP does not satisfy the maximum cluster number constraints! — KGM, Dec 27 '21 at 14:23

How to perform clustering of two different sets of entities?

2 Answers2