9

I am using SQL Server 2012.

I'm implementing a back end for a mobile app which will have to do proximity searches to find nearby POIs (points os interest). I know it's a very common scenario and looks very simple, but there are many different ways I could implement it, so I would love to see how more experienced professionals are implementing these simple spatial searches.

Since a POI is just a POINT we don't need any complex calculations involving intersections or the like. That's why I initially thought that using GEOGRAPHY columns and spatial indexes could be overkill or even slower than other strategies. So I've narrowed it down to 3 approaches:

1) GEOGRAPHY column + Spatial Index

This is perhaps the de facto solution to this problem. Since we have spatial indexes and geography columns we could just use it and search by distance. Something like this.

SELECT * FROM POIs WHERE Loc.STDistance(@radius) <= @distance;

Since we have a spatial index on Loc, it should be very fast.

2) Using a "bounding box" over Latitude and Longitude columns

This is the trivial approach without involving spatial indexes. We find a bounding box for our point and radius then simply search on the Latitude and Longitude columns. If both are indexed this search should be very fast. We'll have to apply the distance function to filter out some values outside the "circle" but withing the bounding box. But that should be pretty fast. This idea is better explained here: http://www.movable-type.co.uk/scripts/latlong-db.html

Something like this:

DECLARE @lat float
DECLARE @lon float
SET @lat = -23.001029
SET @lon = -43.328422
DECLARE @maxLat float, @minLat float, @maxlon float, @minLon float
DECLARE @R float
DECLARE @distance FLOAT = 100 -- A distance in meters   
SET @R = 6378137 -- Earth
SET @maxLat = @lat + DEGREES(@distance/@R)
SET @minLat = @lat - DEGREES(@distance/@R)

SET @maxLon = @lon + DEGREES((@distance/@R/COS(RADIANS(@lat)))) SET @minLon = @lon - DEGREES((@distance/@R/COS(RADIANS(@lat))))

SELECT * from POIs WHERE Lat Between @minLat And @maxLat And Lng Between @minLon And @maxLon

3) Use an integral GEOHASH stored on an indexed column

This approach is very interesting and it is something people are using together with REDIS ordered sets to do proximity searches. The principle can be transposed to SQL Server by using an indexed column that stores the integral GEOHASH.

I've got this idea from Ardb: https://github.com/yinqiwen/ardb/wiki/Spatial-Index

It's also explained in a little friendlier manner here: Using geohash for proximity searches?

In other words one would compute a GEOHASH with a bit-depth corresponding to the radius of the search one desires, then compute 8 neighbors geohashes and finally submit a search using these geohashes as bounding boxes on the indexed column. This will be 9 BETWEEN operators on the WHERE clause of the SQL... The results will have to be filtered out due to some spurious POI being returned.

But it looks that this will be slower than method 2 as the where clause will be more complex although it will only query over a single column instead of two.

Is there a better/correct approach to this?

PolyGeo
  • 65,136
  • 29
  • 109
  • 338
Loudenvier
  • 193
  • 1
  • 1
  • 5
  • Really it's a 'It Depends' answer. Amount of data you are querying against is definitely a factor. Since you are using SQL Server 2012, the database query should be quite quick. However make sure you follow the rules http://msdn.microsoft.com/en-us/library/ff929109.aspx or the spatial index will not be used. – MickyT Oct 01 '14 at 05:56
  • @MickyT Is the Nearest Neighbor query optimized in a different way? I don't have an order by clause, nor a TOP clause, since I'll be getting all points within the radius. I've created a test database with Lat, Long and a Geometry columng, added 4 million records to it, and the spatial-index based search with STDistance is instantaneous, but the Lat and Long columns with bounding box is also very fast. I'll try to add billions of points to see if one performs better than the other. If not I'll stick with the spatial index! – Loudenvier Oct 01 '14 at 13:25
  • It sounds like your query is using the spatial index. I haven't done a lot of testing on that particular one, just remember reading there was conditions. As another option, if you want to do bounding box searches, you may want to try Filter. http://msdn.microsoft.com/en-us/library/cc645883.aspx – MickyT Oct 01 '14 at 18:00
  • The reason databases implement R-tree indexes for spatial is because they are faster than geohashes or searches on separate x and y indexes. Usage will vary, but it is not overkill to use spatial just because you only have points. You lose nothing by using a geometry type and potentially gain a lot (not just in terms of speed), but in future proofing. What if you want to add buffering or polygon intersection at a later date? Ultimately, the only way to know is to test your use case, but my 2c is use approach 1. – John Powell Oct 02 '14 at 09:34
  • @JohnBarça I did some more testings adding 50,000,000 points and after the query plan computation queries using the spatial index are still almost instantaneous, while the other approaches take a few seconds to finish. I'll make some few more tests: since my queries will run in urban areas I'll add a region/neighborhood/district/city filter (the locations will have been previously reverse geocoded). This may or may not improve search speed. But now that I'm sure the spatial index performs this good with 50000000 points, I'll only try to optimize if there's actual need. – Loudenvier Oct 05 '14 at 09:06
  • @JohnBarça you could add your comment as an answer so I can mark it as the answer. I really believe you're spot on! – Loudenvier Oct 05 '14 at 09:07
  • @Loudenvier. Done, with slightly more info added. Yes, in my experience, there is a big diff between R-tree indexing and the other approaches, which will grow as the table size does. – John Powell Oct 05 '14 at 09:20
  • @Loudenvier. I have spatial tables, with spatial indexes, with over a billion rows, with no obvious deterioration in search speed. It scales very well, though the indexes get large. – John Powell Oct 05 '14 at 09:39

2 Answers2

4

The reason databases implement R-tree indexes for spatial is because they are faster than geohashes or searches on separate x and y indexes. The problem with geohashes, is that you have to search 9 quadrants, not just 1, to do proximity type searches -- see geohash limitations. They are useful in databases that lack R-trees, to allow the expression of an object with a 2-D range, in one dimension, which can then be indexed with a B-tree. Having separate (or compound) indexes on x and y will also be slower, as you need to scan more of the index to zero in on your area of interest, while with R-trees, you index search is on the bounding box.

Usage will vary, but it is not overkill to use spatial just because you only have points. You lose nothing by using a geometry type and potentially gain a lot (not just in terms of speed), but in future proofing. What if you want to add buffering or polygon intersection at a later date? Ultimately, the only way to know is to test your use case, but my 2c is use approach 1.

John Powell
  • 13,649
  • 5
  • 46
  • 62
0

If you end up wanting to explore the Geohash route more deeply, here's a more fleshed-out implementation of Geohash related functions for TSQL in which you might be interested.

As to the debate between the performance of R-Tree Indexes versus Integer-based Geohashes, I have different experiences related to big data scenarios. The trade-off is the same as in software engineering between using index Arrays, HashTables, and Trees. Each has use cases in which they are superior to the other two. The same holds for R-Tree Indexes versus Geohash clustering.