Is it possible to parallelize a matching method?

Question

I am working with matching as descibed in this paper (*.pdf). The dataset I use is quite large, I therefore had to extract a (sub)-sample from it, in order to actually get anywhere. I am using the MatchIt package in R (written in conjuction with the above article).

I use nearest neighbor matching, matching on the propensity score estimated from a logit model.

Now I have been wondering; Since the estimation of the logit model is quite fast (2min for 8,000,000 obs), and the mathcing search is very slow, would it be possible to parallelize the matching algorithm? Using multiple CPU's, to speed up the process?

I realize that this is not possible in the package, as it stands now, but could it work in theory? Psudo-code, or quick run-down would be greatly appreciated.

What type of matching are you doing? Greedy matching within a particular caliper should be more amenable to bigger datasets, and other nearest neighbor techniques should be possible after building a kdtree. Optimal matching would likely be a nightmare, but I haven't seen much evidence that optimal matching is much better other techniques, see A comparison of 12 algorithms for matching on the propensity score (Austin, 2014) for one example. — Andy W, Jun 04 '15 at 12:36
Nice reference. I will update the question to reflect the method. — Repmat, Jun 04 '15 at 13:37

score 2 · Answer 1 · answered Jan 19 '21 at 05:03

Nearest neighbor matching without replacement cannot be parallelized in a straightforward way. Each match depends on the matches that occurred before it (i.e., the matching must be performed sequentially). This means one could not perform the matching on independent cores that do not communicate with each other. An exception is when combining nearest neighbor matching with exact matching on one or more covariates; in that case, you can split the matching problem into separate matching problems defined by the strata of the covariates to be exactly matched. For example, you could request exact matching on province (i.e., if you had a country-wide dataset), which would split the matching problem into much smaller problems defined by province. In this case, province-specific matching can be performed on separate cores since the matches in one province do not depend on matches in another.

Nearest neighbor matching with replacement can be performed in parallel because each match does not depend on the results of other matches. Whether a control unit is already matched to another treated unit doesn't matter for each treated unit. The treated units can be partitioned and the matching for each partition can take place on its own core.

MatchIt has been updated since this question was asked and now relies on Rcpp for the matching, which is much faster than the original R-based code implemented when this question was asked. With 8 million observations, though, it will still take an extremely long time, and other methods might be preferred.

Is it possible to parallelize a matching method?

1 Answers1