The first thing to be concerned about with this problem is what data is needed where and when. To do so, I usually start with the stupid, serial version of the problem.
Find all parcels valued over x $/acre that are within y feet of another parcel that is valued at less than z $/acre.
foreach p in parcels {
if value(p) > x {
foreach q in parcels {
if (dist(p,q) <= y) and (value(q) < z) {
emit(p)
}
}
}
}
While this algorithm is not optimized, it will solve the problem.
I solved a similar problem for my Master's thesis which found the nearest parcel for every point in a dataset. I implemented the solution in PostGIS, Hadoop
, and MPI. The full version of my thesis is here, but I will summarize the important points as it applies to this problem.
MapReduce is not a good platform to solve this problem on because it requires access to the entire dataset (or a carefully selected subset) to process a sin
gle parcel. MapReduce does not handle secondary datasets well.
MPI, however, can solve this quite handily. The hardest part is determining how to split the data. This split is based on how much data there is, how many p
rocessors you have to run it on, and how much memory you have per processor. For the best scaling (and therefore performance) you will need to have multiple
copies of the parcels dataset in memory (across all your computers) at once.
To explain how this works, I will assume that each of your 50 computers has 8 processors. I will then assign each computer the responsibility to check 1/50
of the parcels. This checking will be executed by 8 processes on the computer, each of which has a copy of the same 1/50 part of the parcels and 1/8 of the
parcel dataset. Please note that the groups are not limited to a single machine, but can cross machine boundaries.
The process will execute the algorithm, getting the parcels for p from the 1/50th set of parcels, and the parcels for q from the 1/8th set. After the inner
loop, all the processes on the same computer will talk together to determine if the parcel should be emitted.
I implemented a similar algorithm to this for my problem. You can find the source here.
Even with this sort of non-optimized algorithm I was able to obtain impressive results that were highly optimized for programmer time (meaning that I could write a stupid simple algorithm and the computation would still be fast enough). The next spot to optimize (if you really need it), is to setup a quadtree index of the second dataset (where you get q from) for each process.
To answer the original question. There is an architecture: MPI + GEOS. Throw in a little help from my ClusterGIS implementation and quite a lot can be done. All this software can be found as open source, so no licensing fees. I'm not sure how portable to Windows it is (maybe with Cygwin) as I worked on it in linux. This solution can be deployed on EC2, Rackspace, or whatever cloud is available. When I developed it I was using a dedicated compute cluster at a University.
Kirksomeone to try both and report back. – matt wilkie Oct 30 '10 at 07:19