4

I want to solve the Group Closeness Centrality problem where the input is a graph $G=(V,E)$ and integer $k$ and we want to find a vertex set $S$ of size $k$ minimizing the total distance of the vertices in $V\setminus S$ to $S$. Here, the distance of a vertex $v$ to $S$ is defined as the length of a shortest path starting in $v$ and ending at some vertex of $S$.

I am using an ILP formulation with binary variables $x_{v,d}$ where $x_{v,d}=1$ if vertex $v$ has distance $d$ to $S$ and $d\in \{1,\ldots, diam(G)\}$. Here, $diam(G)$ is the diameter of $G$, the length of longest shortest path in $G$.

The objective function is

$$\sum_{v\in V}\;\sum_{d\in \{1,\ldots,diam(G)\}} d\cdot x_{v,d}.$$ The constraints are

  1. $\sum_{v\in V} x_{v,0} = k$,
  2. $\sum_{d\in \{0,\ldots,diam(G)\}} x_{v,d}=1\, \forall {v\in V}$,
  3. $x_{v,d}\le \sum_{u\in N_d(v)} x_{u,0}\, \forall {v\in V}\, \forall {d\in \{1,\ldots,diam(G)-1\}}$.

The first constraint ensures that we select $k$ vertices, the second that the distance to $S$ is well-defined. The third constraint ensures that for $v$ to have distance $d$ to $S$, we must select a vertex in the $d$-neighborhood of $v$.

I am looking at improved formulations and I want to use an approach where for every vertex we do not consider variables $x_{v,d}$ for all $d\in \{0,\ldots, diam(G)\}$ but only for values $d\in \{0,\ldots, d(v)\}$, for some value $d(v)$.

Initially, $d(v)=2$ for all $v\in V$. Now this means that if we select a vertex, then it contributes 0 to the objective, if we select a neighbor, then it contributes 1, and otherwise, we get a contribution of 2. Now if we find a solution where every vertex contributes 1 (i.e. if $G$ has a dominating set of size $k$), then we are done. Otherwise, we find a vertex $v$ that contributes $2$ and increase the $d(v)$-value for this vertex. More precisely, we add a further variable $x_{v,d(v)}$ and the corresponding constraint, and afterwards increase $d(v)$ by one.

After this long explanation, here are my questions:

  1. Is it appropriate to call this approach column generation or does it have another name?
  2. When I implement this (hopefully) improved formulation in CPLEX, is it necessary to solve the main problem from scratch after adding a variable, or is there some way of adding the variables in a lazy manner?
prubin
  • 39,078
  • 3
  • 37
  • 104
  • I'm not sure your formulation is correct. Let $v\in S$ for a solution $S$. Constraint (2) forces $x_{v,\hat{d}}=1$ for some $\hat{d}\ge 1,$ and the objective function charges a penalty $\hat{d}$ for $v$, even though the objective function allegedly is looking only at distances from $V\backslash S$ to $S.$ Also, I think there is a typo in (3): $x_{0,d}$ should be $x_{u,0}.$ – prubin Jul 12 '22 at 15:37
  • @prubin There is a typo in constraint (2): the sum should start at $d=0$, and this correction allows $d=0$ for $v\in S$. – RobPratt Jul 12 '22 at 16:19
  • @RobPratt My thoughts exactly. – prubin Jul 12 '22 at 16:23
  • Yes this is a typo, thanks for pointing this out. – Christian Komusiewicz Jul 12 '22 at 16:33
  • Is there some intuition for why increasing $d(v)$ for a node that uses $d=2$ in the currently optimal solution is expected to yield an improvement? Changing $x_{v,2}$ from $1$ to $0$ and then taking $x_{v,3}=1$ would make the objective value worse by one unit. Traditional column generation would instead use reduced costs to recommend new columns. – RobPratt Jul 12 '22 at 17:31

3 Answers3

7

I would not call the approach column generation. That term is usually applied to methods where columns are constructed using information from the solution of a previous version of the problem. What you are doing is iterating over progressively larger values of a parameter (the maximum value considered for $d$). So I would characterize it as an "iterative" approach.

As far as "hot-starting" CPLEX, you can try to hand CPLEX a feasible starting solution by modifying the solution to the previous problem manually. So, for instance, if you have a vertex $v$ that contributes value 2 (the max allowed) and you increase the max to 3, you can construct a solution to the new problem by copying the solution to the previous problem and then setting $x_{v,2}$ and $x_{v,3}$ appropriately. All the APIs (as far as I know) have methods to set a candidate starting solution.

prubin
  • 39,078
  • 3
  • 37
  • 104
5

Here is an alternative model that might be easier to solve (or not -- see the update at the end). Let $N(i)\subseteq V$ be the set of neighbors of $i\in V$ (nodes connected to $i$ by an edge). We introduce one binary variable $x_i$ for each node $i$ (1 if the node goes in $S$, 0 if not) and two nonnegative continuous variables $y_{i,j}$ and $y_{j,i}$ for each edge $(i,j)\in E,$ representing flows. The objective is $$\min \sum_{(i,j)\in E} \left( y_{i,j} + y_{j,i} \right),$$ which charges each unit of flow 1 for each edge it crosses. The constraints are $$\sum_{i\in V} x_i = k$$ (to get the desired size for $S$) and $$\sum_{j \in N(i)} \left( y_{i,j} - y_{j,i} \right) \ge 1 - M\cdot x_i \quad \forall i\in V.$$ The latter constraints say that the flow out of a node $i$ must be at least one greater than the flow in unless $i\in S.$ For a sufficiently large value of $M$ (I think $M = \vert V \vert - k$ will suffice), the constraint allows a node $i\in S$ to soak up any plausible amount of incoming flow without spitting out anything.

Given a solution for $x,$ the flows that minimize total arc "costs" will automatically have nodes in $S$ soak up all incoming flow and nodes not in $S$ contribute one unit of flow. The unit of flow coming out of a node $i \notin S$ will automatically take the shortest path to a node in $S$ since to do otherwise would inflate the objective value.

The model contains $\vert V \vert$ binary variables and $2\cdot \vert E \vert$ continuous variables.

Update: I tried both this model and that of question's author on a few sparse graphs, using CPLEX as the solver. The author's model beat mine handily every time.

prubin
  • 39,078
  • 3
  • 37
  • 104
  • It might also be worth replacing the third set of original constraints with $x_{i,d} \le \sum_{j \in N(i)} x_{j,d-1}$. – RobPratt Jul 13 '22 at 19:40
  • 1
    @RobPratt That seems to be faster than the original model in some cases and slower in others. – prubin Jul 13 '22 at 22:00
  • Thanks for checking. – RobPratt Jul 13 '22 at 22:08
  • I should add that I ran only a few graphs, and not to optimality (maybe a two minute time limit on each), so I can't say anything definitive. – prubin Jul 13 '22 at 22:16
  • I also tried the original model with integrality relaxed for all but the zero distance variables (i.e., $x_{v,0}\in \lbrace 0, 1 \rbrace$ but $x_{v,d}\in [0,1]$ for $d > 0.$ This might or might not be equivalent to assigning branching priority to the zero distance variables. In any case, results were again mixed. – prubin Jul 13 '22 at 22:18
  • Thanks for this interesting formulation, I will have a look at it. – Christian Komusiewicz Jul 14 '22 at 14:57
3

This formulation may be helpful.

Let $d_{ij}$ as the value of the shortest path between vertices $i$ and $j$. Define

$x_j\in \{0,1\}$; where $x_j=1$ if the vertex $j$ is selected in $S$.

$y_{ij}\in \{0,1\}$; where $y_{ij}=1$ if the vertex $i$ is assigned to the selected vertex $j\in S$; i.e. $j\in S$ and minimum distance from $i$ to $S$ is $d_{ij}$.

Now, solve an ILP for minimization of $\sum_{i,j} d_{ij}y_{ij}$ subject to the constraints $\sum_j x_j=k$ and $\sum_j y_{ij}=1\ \ \forall i$ and $y_{ij}\le x_j \ \ \forall i,j$

  • Welcome and +1. This is a $k$-median (also known as $p$-median) formulation. – RobPratt Jul 12 '22 at 13:10
  • Thanks for the suggestion. Actually, I don't want to use this formulation as it needs a quadratic number of variables and we already know that it is slower than the simple formulation described above. – Christian Komusiewicz Jul 12 '22 at 13:13
  • I should add that the simple formulation in the question has also a quadratic number of variables in the worst case but for real-world social networks (which we are interested in), diam(G) is essentially a constant, thus we have essentially a linear number of variables. – Christian Komusiewicz Jul 12 '22 at 13:29
  • But I think this formulation is more useful. Please notify me if my idea is wrong. If yoy replace the last constraints by $\sum_i y_{ij}\le x_j\ \ \forall j$ then coefficient matrix A is totally unimodular and solving the LP relaxation produced the ILP solution. – Majid Zohrehbandian Jul 12 '22 at 15:31
  • 2
    @MajidZohrehbandian Imposing $\sum_i y_{ij} \le x_j$ would be too strong, forcing at most one $i$ per $j$. – RobPratt Jul 12 '22 at 16:23
  • I should add that the problem is NP-hard, so we cannot hope for polynomial-time algorithms via LP relaxations. – Christian Komusiewicz Jul 12 '22 at 16:37
  • Yes. You are right and the replacement should be as $\sum_i y_{ij}\le Mx_j$. – Majid Zohrehbandian Jul 12 '22 at 16:45
  • For what it's worth, I tried this (using the Java API to CPLEX). The coefficient matrix is apparently not TUM, since CPLEX had to do some branching. On a small graph, both this approach and the approach in the question reached optimality in seconds. On a larger graph (2000 nodes, ~13000 edges), CPLEX spent so much time presolving that I abandoned the attempt. – prubin Jul 13 '22 at 16:08