Gurobi solver with multiprocessing does not reduce total solving time

Question

Problem

I use the syntax of gurobi with multiprocessing to solve multiple problems in parallel. Even though multiple problems are solved concurrently, the solving time for a single problem increases significantly, so there are no benefits to use multiprocessing, which is strange.

Possible reasons that have been excluded

The number of problems solved in parallel is much smaller than number of cpu cores. So it is not because problems are too many and my cpu cores are too few.
Multiple problems are really solved concurrently. I can see the log file of each problem is written at the same time. And The number of busy cpu cores is also same with the number of problems. So it is not because running code cannot achieve multiprocessing.

Code

I solve the tsp (travelling salesman problem) already introduced in Gurobi tutorial in parallel to repeat this problem. With some modification of codes, the tsp (in tsp.py) is modified as follows (someone don't need look into codes in tsp.py since it won't help solve the problem).

#!/usr/bin/env python3.7
Copyright 2022, Gurobi Optimization, LLC
Solve a traveling salesman problem on a randomly generated set of
points using lazy constraints.   The base MIP model only includes
'degree-2' constraints, requiring each node to have exactly
two incident edges.  Solutions to this model may contain subtours -
tours that don't visit every city.  The lazy constraint callback
adds new constraints to cut them off.
import math
import random
from itertools import combinations
import gurobipy as gp
from gurobipy import GRB
from datetime import datetime
Callback - use lazy constraints to eliminate sub-tours
def subtourelim(model, where):
    if where == GRB.Callback.MIPSOL:
        vals = model.cbGetSolution(model._vars)
        # find the shortest cycle in the selected edge list
        tour = model._subtour(vals,model._n)
        if len(tour) < model._n:
            # add subtour elimination constr. for every pair of cities in tour
            model.cbLazy(gp.quicksum(model._vars[i, j]
                                    for i, j in combinations(tour, 2))
                        <= len(tour)-1)
Given a tuplelist of edges, find the shortest subtour
def subtour(vals,n):
    # make a list of edges selected in the solution
    edges = gp.tuplelist((i, j) for i, j in vals.keys()
                        if vals[i, j] > 0.5)
    unvisited = list(range(n))
    cycle = range(n+1)  # initial length has 1 more city
    while unvisited:  # true if list is non-empty
        thiscycle = []
        neighbors = unvisited
        while neighbors:
            current = neighbors[0]
            thiscycle.append(current)
            unvisited.remove(current)
            neighbors = [j for i, j in edges.select(current, '*')
                        if j in unvisited]
        if len(cycle) > len(thiscycle):
            cycle = thiscycle
    return cycle
class TSP:
    def init(self,n,logFile,taskName,rndSeed,model) -> None:
        self.n=n
        self.logFile=logFile
        self.startTime=datetime.now()
        self.taskName=taskName
        self.rndSeed=rndSeed
        self.m=model
def solveTSP(self):
    # with open(self.logFile,'a') as file:
        # file.write('TaskName: {}\nRandom Seed: {}\nStart time: {}\n\n'.format(self.taskName,self.rndSeed,self.startTime))

    # Create n random points

    random.seed(self.rndSeed)
    points = [(random.randint(0, 100), random.randint(0, 100)) for i in range(self.n)]

    # Dictionary of Euclidean distance between each pair of points

    dist = {(i, j):
            math.sqrt(sum((points[i][k]-points[j][k])**2 for k in range(2)))
            for i in range(self.n) for j in range(i)}

    self.m.setParam('Threads',1)
    self.m.setParam('logFile',self.logFile)
    self.m.setParam('LogToConsole',0)
    self.m._subtour=subtour
    self.m._n=self.n

    # Create variables

    vars = self.m.addVars(dist.keys(), obj=dist, vtype=GRB.BINARY, name='e')
    for i, j in vars.keys():
        vars[j, i] = vars[i, j]  # edge in opposite direction

    # You could use Python looping constructs and self.m.addVar() to create
    # these decision variables instead.  The following would be equivalent
    # to the preceding self.m.addVars() call...
    #
    # vars = tupledict()
    # for i,j in dist.keys():
    #   vars[i,j] = self.m.addVar(obj=dist[i,j], vtype=GRB.BINARY,
    #                        name='e[%d,%d]'%(i,j))


    # Add degree-2 constraint

    self.m.addConstrs(vars.sum(i, '*') == 2 for i in range(self.n))

    # Using Python looping constructs, the preceding would be...
    #
    # for i in range(n):
    #   self.m.addConstr(sum(vars[i,j] for j in range(n)) == 2)


    # Optimize model

    self.m._vars = vars
    self.m.Params.LazyConstraints = 1
    self.m.optimize(subtourelim)

    vals = self.m.getAttr('X', vars)
    tour = subtour(vals,self.n)

    with open(self.logFile,'a') as file:
        file.write('TaskName: {}\nRandom Seed: {}\nOptimal tour: {}\nOptimal cost: {}\nStart time: {}\nSolved time: {}\n\n'.format(self.taskName,self.rndSeed,str(tour),self.m.objVal,self.startTime,datetime.now()))

And I run it in parallel by the code:

from tsp import TSP # import from tsp.py file defined above
import multiprocessing as mp
import gurobipy as gp
function that every process calls
def tspSolve(params):
    n=params['n']
    logFile=params['logFile']
    taskName=params['taskName']
    rndSeed=params['rndSeed']
    with gp.Env() as env, gp.Model(env=env) as model: # new env 
        tspModel=TSP(n,logFile,taskName,rndSeed,model)
        tspModel.solveTSP() # solve tsp
if name == "main":
    caseNum=8 # 8 problems in parallel
    processPool=mp.Pool(processes=caseNum)
    tspParamsList=[] # collect solving parameters of every problems
    for i in range(caseNum):
        newTspParams={
            'n':300, # number of tsp nodes
            'logFile':'/home/littleqjy/workspace/code/myTSP{}.log'.format(i+1),
            'taskName':'myTSP{}'.format(i+1),
            'rndSeed':i+1
        }
        tspParamsList.append(newTspParams)
    for tspParams in tspParamsList:
        processPool.apply_async(func=tspSolve,args=(tspParams,)) # submit problems 
    processPool.close()
    processPool.join() # wait for all problems to be solved

When I change the caseNum from 4 to 8 (4 problem to 8 problems), I read the log file and the solving time for a single problem doubles, so there are no benefits to use multiprocessing, which is strange. Does somebody know why this happens?

How many physical (not logical) cores does your machine have? — Nikos Kazazakis, Aug 11 '22 at 14:52
@NikosKazazakis 12, which is still larger than tested scale. — Qiu Junyan, Aug 11 '22 at 15:20
Is it possible that your machine starts using virtual memory once you spawn too many processes or are the problems tiny? — Nikos Kazazakis, Aug 11 '22 at 16:07
"I can see the log file of each problem is written at the same time" Are your files going to rotational storage, or random access storage (e.g. SSD or ramdisk/tmpfs)? Sharing rotational storage between threads is a case where parallel performance is less than sequential, because time spent seeking disk heads is not doing useful work for any job. — Ben Voigt, Aug 11 '22 at 20:54
@NikosKazazakis The space of remaining main memory is still large (several GBs) in all tested scales, so there is no reason to use virtual memory. As for whether the problems are tiny, a single problem needs 3 minutes to solve in single process, so the time overhead to create a process is significantly smaller than solving time. — Qiu Junyan, Aug 12 '22 at 02:21
@BenVoigt When I mute the log option (then no log file will be written), the problem still exists. I guess the problem may have relationship with the Gurobi implementation, and I will next discover its implementation features. — Qiu Junyan, Aug 12 '22 at 02:28
Could you also try model/environment handling like in the link you provided? e.g. use a context manager with to manage the model and env or call dispose() at the end of solveTSP (for both the env and the model). — David Torres, Aug 12 '22 at 08:47
@QiuJunyan So just to double check, you're doing single thread runs and the code scales for up to 4 processes but performance starts declining after that? It's quite important to know whether it actually scales (even with 2 processes) or not at all. — Nikos Kazazakis, Aug 12 '22 at 08:59
@NikosKazazakis Single problem under 2 processes is solved as the same fast as that under 4 processes, which accords with prediction. But from 4 processes to 8 processes, then the solving time for single problem doubles, which is strange since I have 12 physical cores. — Qiu Junyan, Aug 12 '22 at 10:27
@QiuJunyan but is the 2-process run 2 times faster than solving those 2 problems sequentially? — Nikos Kazazakis, Aug 12 '22 at 12:23
@NikosKazazakis Yes. Total solving time under 2-processes is twice faster than sequential solving, because solving time for each single problem is the same as sequential solving. But under 8-processes, solving time for each single problem is twice slower than sequential solving. When the scale goes up, it will be slower and slower, even though physical cores are still sufficient. — Qiu Junyan, Aug 12 '22 at 12:51
I am asking the Gurobi support team about the question. If reason is found, then I will share it under this post. — Qiu Junyan, Aug 12 '22 at 12:55

score 4 · Answer 1 · answered Aug 12 '22 at 17:24

According to responses of Jaromił in Gurobi support team, the most possible reason is missing ratio of L-cache increases when used processes increase, even if physical cores are still sufficient.

I use perf stat -B -e cache-references,cache-misses,cycles,instructions python <fileName.py> to inspect the cache-miss ratio under different number of processes. The relationship between parallelized processes, ratio of cache-miss and solving time for single problem is as follows:

1 processes, 0.783% cache-miss, 130 seconds for single problem
2 processes, 1.204% cache-miss, 130 seconds for single problem
4 processes, 6.583% cache-miss, 150 seconds for single problem
8 processes, 16.848% cache-miss, 300 seconds for single problem

According to above checks, high cache-miss ratio is reasonable for explanation as far as I see.

To make best use of Gurobi with multiprocessing, you can try different number of processes and select the best where parallelism can work well.

Because you may have 12 physical cores but they aren't independent, groups of cores are sharing cache. — Ben Voigt, Aug 12 '22 at 17:51

Gurobi solver with multiprocessing does not reduce total solving time

Copyright 2022, Gurobi Optimization, LLC

Solve a traveling salesman problem on a randomly generated set of

points using lazy constraints. The base MIP model only includes

'degree-2' constraints, requiring each node to have exactly

two incident edges. Solutions to this model may contain subtours -

tours that don't visit every city. The lazy constraint callback

adds new constraints to cut them off.

Callback - use lazy constraints to eliminate sub-tours

Given a tuplelist of edges, find the shortest subtour

function that every process calls

1 Answers1