Is there an architecture for distributed geoprocessing?

Question

Suppose I have 50 computers on my LAN. Each computer has a geodatabase for all parcel polygons in a particular state in the US.

I'd like to write a geoprocessing task that finds all parcels valued over x $/acre that are within y feet of another parcel that is valued at less than z $/acre.

I'd like to formulate and run this query without knowing or caring that the data is distributed across 50 computers. Keep in mind the boundary conditions: I also want the query to return cases where expensive parcels in one state are near inexpensive parcels in another.

Is there an architecture that supports this sort of distributed geoprocessing?

The architecture can be described abstractly, or as an implementation specific to Azure or Amazon Web Services. Or, preferably, as a typical office where computers sit idle at night with plentiful ArcGIS desktop licenses.

Nice question. In this particular example you need a way to automatically parallelize the building and use of a spatial data structure such as a quadtree. If you don't do that, and instead just distribute a brute-force search over 50 computers, you might actually slow the query down rather than speed it up. I'm pretty sure a general architecture like this doesn't yet exist, so you might have better luck by first contemplating what kinds of queries are likely to benefit from distributed processing and then looking into the architectures they require. Maybe post this question on the TCS site? — whuber, Oct 22 '10 at 16:45
@Kirk sorry for being cryptic--I was lazy. http://cstheory.stackexchange.com/ — whuber, Oct 23 '10 at 00:26
basic CS theory probably won't help as the CS guys rarely get spatial :-) — Ian Turton, Oct 23 '10 at 19:11
@iant There aren't too many GIS people out there who are going to know much about the nuts and bolts of distributed computing (I cast no aspersions on members of this site who obviously are exceptional). I believe the TCS people will have the knowledge to answer the original question concerning the existence of an architecture. My only concern is whether they would find the question interesting! I think if it's put the right way they might. (E.g., one might reframe it in terms of data structures.) — whuber, Oct 23 '10 at 22:08
To put it in Paleogeographer terminology, remember the VISIT command in LIBRARIAN? Well, let's just extend that so that it visit's all 50 computers, asynchronously. I think the theory underlying this is important for Geographers too vis-a-vis Tobler's first law of geography "everything is related to everything else, but near things are more related than distant things". The y parameter would define a threshold for nearness. It also seems like y would determine how much overlap between datasets on each computer. — Kirk Kuykendall, Oct 23 '10 at 23:18
So we have one experienced fellow saying Map Reduce is not good for this kind of problem, and another saying it is good. :) Oh dear, I guess we'll have to wait for ~~Kirk~~ someone to try both and report back. — matt wilkie, Oct 30 '10 at 07:19
@matt Unfortunately Map Reduce is not my territory. I awarded the bounty to Paul. Not necessarily because it was the best answer, but the answer I could best understand. — Kirk Kuykendall, Oct 31 '10 at 17:31
the comment was tongue in cheek. This is clearly a domain where deep focus and extended concentrated study is required. :) — matt wilkie, Nov 01 '10 at 18:58

score 13 · Accepted Answer · answered Oct 25 '10 at 19:11

store all your parcels in one central database
formulate a grid over the USA made of squares N feet on a side, where N is such that the number of parcels that fit within N will not blow out the memory on one of your nodes
create a table in your database with one row per grid square, an id column a geometry column and a status column
each node runs a small program that
1. find the next unprocessed square
2. marks it as in-process
3. pulls all the parcels ST_DWithin(square,parcel,maxfeet)
4. does the actual query
5. writes back the query answer to a solution table in the central database
6. marks the square as complete
7. return to 1

The obvious failure case is as your radius-of-interest in the parcel query grows large enough that large portions of your dataset are potential candidates to match each parcel.

Thanks Paul, would I need one node acting as a coordinator for the other nodes? — Kirk Kuykendall, Oct 26 '10 at 14:33
The database acts as an implicit "coordinator" in that it holds the state of the queue, but the nodes don't need to be coordinated beyond being started up and pointed at the database. Not sure if that is an answer or not. — Paul Ramsey, Nov 05 '10 at 14:02

score 7 · Answer 2 · answered Oct 22 '10 at 16:14

7

There were an interesting slot on FOSS4G in September in Barcelona about this: http://2010.foss4g.org/presentations_show.php?id=3584

It became more of a panel discussion than a presentation.

In the middle of this blog post Paul Ramsey gives some kind of summary from that.

answered Oct 22 '10 at 16:14

Nicklas Avén

13,241
1
39
48

That looks promising, have they posted the presentation anywhere? – Kirk Kuykendall Oct 25 '10 at 13:03
Well, since Schuyler Erle became a moderator for the panel discussion instead of hoding the planned presentation I don't think there will be much more information about it. But since Erle had planned that presentation he probably has some information about it. He is everywhere if you make a google search. It might be an idea to ask him directly. I don't know. Most of the discussions were above my understanding so I can't give any better resume than Paul did in his blog. – Nicklas Avén Oct 25 '10 at 16:23

score 4 · Answer 3 · answered Oct 22 '10 at 22:53

4

Maybe take a look at the white paper "ArcGIS Server in Practice Series: Large Batch Geocoding" at esri white papers.

It is about geocoding but the general process of using an asynchronous geoprocessing service might be applicable to your case.

answered Oct 22 '10 at 22:53

Looks good, I wonder if this could be generalized to other forms of geoprocessing. Seems like I'd need overlap between my datasets though. – Kirk Kuykendall Oct 25 '10 at 13:11

score 3 · Answer 4 · edited Jul 30 '15 at 11:03

The first thing to be concerned about with this problem is what data is needed where and when. To do so, I usually start with the stupid, serial version of the problem.

Find all parcels valued over x $/acre that are within y feet of another parcel that is valued at less than z $/acre.

foreach p in parcels {
  if value(p) > x {
    foreach q in parcels {
      if (dist(p,q) <= y) and (value(q) < z) {
        emit(p)
      }
    }
  }
}

While this algorithm is not optimized, it will solve the problem.

I solved a similar problem for my Master's thesis which found the nearest parcel for every point in a dataset. I implemented the solution in PostGIS, Hadoop , and MPI. The full version of my thesis is here, but I will summarize the important points as it applies to this problem.

MapReduce is not a good platform to solve this problem on because it requires access to the entire dataset (or a carefully selected subset) to process a sin gle parcel. MapReduce does not handle secondary datasets well.

MPI, however, can solve this quite handily. The hardest part is determining how to split the data. This split is based on how much data there is, how many p rocessors you have to run it on, and how much memory you have per processor. For the best scaling (and therefore performance) you will need to have multiple copies of the parcels dataset in memory (across all your computers) at once.

To explain how this works, I will assume that each of your 50 computers has 8 processors. I will then assign each computer the responsibility to check 1/50 of the parcels. This checking will be executed by 8 processes on the computer, each of which has a copy of the same 1/50 part of the parcels and 1/8 of the parcel dataset. Please note that the groups are not limited to a single machine, but can cross machine boundaries.

The process will execute the algorithm, getting the parcels for p from the 1/50th set of parcels, and the parcels for q from the 1/8th set. After the inner loop, all the processes on the same computer will talk together to determine if the parcel should be emitted.

I implemented a similar algorithm to this for my problem. You can find the source here.

Even with this sort of non-optimized algorithm I was able to obtain impressive results that were highly optimized for programmer time (meaning that I could write a stupid simple algorithm and the computation would still be fast enough). The next spot to optimize (if you really need it), is to setup a quadtree index of the second dataset (where you get q from) for each process.

To answer the original question. There is an architecture: MPI + GEOS. Throw in a little help from my ClusterGIS implementation and quite a lot can be done. All this software can be found as open source, so no licensing fees. I'm not sure how portable to Windows it is (maybe with Cygwin) as I worked on it in linux. This solution can be deployed on EC2, Rackspace, or whatever cloud is available. When I developed it I was using a dedicated compute cluster at a University.

score 2 · Answer 5 · answered Oct 22 '10 at 19:59

2

The old school parallel programming methodology is to just store a state + the parcels that touch it on each processor then it is embarrassingly easy to parallelize. But given the variation in size of US states you'd get better performance by splitting the country up into grid cells (again with the touching halo of parcels) and sending each grid cell to processors using a master slave configuration.

answered Oct 22 '10 at 19:59

Ian Turton

81,417
6
84
185

Instead of parcels that touch, I'd need parcels from the adjacent states within y distance. – Kirk Kuykendall Oct 25 '10 at 13:13
I assume Y is smaller enough that it isn't significantly bigger than a small number of parcels. If it is a large fraction of a state then you'd probably be best just using an arbitrary grid to do the calculations. – Ian Turton Oct 25 '10 at 14:18

score 2 · Answer 6 · answered Oct 25 '10 at 16:57

2

You might want to give Appistry a look. It purports to enable migrating of existing applications to private cloud infrastructures. There may be other projects with a similar aim: rather than figuring out again and again for every application the very complex nut of breaking down and distributing tasks to parallel processing, make a library or platform which does that automatically.

answered Oct 25 '10 at 16:57

matt wilkie

28,176
35
147
280

Thanks Matt, that does look promising. Googling I found this presentation from FedUC 2008 http://proceedings.esri.com/library/userconf/feduc08/papers/delivering_massively_scalable_gis_services_with_arcgis_and_grid-based_technology_scott_crawford.pdf I'd be curious to see an update on what they've done since then. – Kirk Kuykendall Oct 26 '10 at 14:13

score 2 · Answer 7 · answered Oct 26 '10 at 14:43

For this type of problem, I would use a map/reduce framework. The "raw" Appistry framework is great for "embarrassingly parallel" problems, which this one is close to. The edge conditions don't allow it to be. Map/Reduce (the Google approach to distributed computing) is great at this type of problem.

The biggest advancement at Appistry since the 08 paper is the release of CloudIQ Storage product. This allows for "s3" like storage facility utilizing the disks on your local servers. Then, the CloudIQ Engine product can enable high volume services or scatter/gather style applications of any sort (we've proven scalability using ESRI runtime and other open source libs). If you are operating on file based data, you distribute it out using CloudIQ Storage, and route processing jobs to the local file replicas so they don't have to be moved around on the network. (so every node does not need all data)

For Map/Reduce, you can layer something like Hadoop (open source M/R framework) on CloudIQ Storage. I would look at Hadoop for the problem as described, but you really need to dive in, it is not easy to get started, and M/R is a brain bender. There is also a commercially supported distribution offered by Cloudera. There is another Appistry product, CloudIQ Manger which is a nice complement to Hadoop (Cloudera or otherwise) for distribution and management.

I would start with Hadoop (M/R and HDFS filesystem), and if you need a more commercially supported scalable solution, look at Appistry CloudIQ Manager and Storage, in conjunction with Cloudera Hadoop distro.

If you want a simpler architecture for "embarrassingly parallel" tasks, look at CloudIQ Engine as well. (the approaches outlined in the paper Kirk referenced are still valid)

score 1 · Answer 8 · answered Jan 16 '11 at 14:26

Have a look at OGSA-DQP. "DQP allows the tables from multiple distributed relational databases to be queried, using SQL, as if there were multiple tables in a single database" http://ogsa-dai.sourceforge.net/documentation/ogsadai4.0/ogsadai4.0-axis/DQPOverview.html

Is there an architecture for distributed geoprocessing?

8 Answers8

Linked