Following scenario: I have a table roads, which basically contains the road network of a whole country extracted from the OSM line table and a table points containing millions of GPS tracked points which are part of tracks. Both have a id column, a PostGIS geometry geom column with types LineString (roads) and Point (points) as well a spatial index on those columns. The reference system has SRID 4326. The geometry index is clustered.
Versions: Postgres 9.5, PostGIS 2.2.2
I want to find out which road is the closest for each point. I came up with the following as proposed here: How to find the nearest point by using PostGIS function?
SELECT point.id, road.id, ST_Distance(road.geom, point.geom) AS Distance
FROM roads AS road, points AS point
WHERE road.id = (
SELECT r.id
FROM roads as r, points as p
WHERE point.id = p.id
ORDER BY ST_Distance(r.geom, p.geom) ASC LIMIT 1
);
The result of EXPLAIN ANALYZE where the result is limited to 1 point:
"Limit (cost=2171519.51..4343038.72 rows=1 width=186) (actual time=7944.471..7944.471 rows=1 loops=1)"
" -> Nested Loop (cost=2171519.51..4578064121748.85 rows=2108230 width=186) (actual time=7944.470..7944.470 rows=1 loops=1)"
" -> Seq Scan on points point (cost=0.00..78098.30 rows=2108230 width=36) (actual time=0.006..0.006 rows=1 loops=1)"
" -> Index Scan using roads_id_index on roads road (cost=2171519.51..2171519.95 rows=1 width=150) (actual time=0.010..0.010 rows=1 loops=1)"
" Index Cond: (id = (SubPlan 1))"
" SubPlan 1"
" -> Limit (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
" -> Sort (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
" Sort Key: (st_distance(r.geom, p.geom))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Nested Loop (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
" -> Index Scan using points_pkey on points p (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
" Index Cond: (point.id = id)"
" -> Seq Scan on roads r (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
" SubPlan 1"
" -> Limit (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
" -> Sort (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
" Sort Key: (st_distance(r.geom, p.geom))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Nested Loop (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
" -> Index Scan using points_pkey on points p (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
" Index Cond: (point.id = id)"
" -> Seq Scan on de_road_network r (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
"Planning time: 0.426 ms"
"Execution time: 7944.560 ms"
This query works fine, but i have an efficiency problem.
The query above takes more than 7 seconds for one point. I think that there is no way to speed this up with the index, since the ST_Distance function cannot use the index. Anyway, since i would like to do this for (~ 3) millions points i would like to speed up the query or to develop a faster one.
Do you have any suggestions on that ?
Some thoughts:
- Using the
&&oeprator, which uses the index. Changing theWHEREclause of the sub-query above to:WHERE point.id = p.id AND p.geom && r.geomresults in a massive increase in speed. See: Nearest Neighbor problem in Postgis 2.0 using GIST Index (<-> function) .
Is this solution accurate ? What happens if a point p has a closest road r1 but p is not in the bounding box of r1 but in the bounding box of r2 ? Is this scenario feasible ?
- Using the
<->operator: I tried to replace theST_Distancewith the<->operator but without success. The spatial index is not involved. I guess it is because
Index only kicks in if one of the geometries is a constant (not in a subquery/cte). e.g. 'SRID=3005;POINT(1011102 450541)'::geometry instead of a.geom
Including a pre-filtering step using other postgis functions ? E.g. with the
ST_DWithinfunction. (See: Find nearest neighbours faster in PostGIS). How would a query look like if i want to specify the distance in meters ?Would it probably make sense to turn this problem into a point in polygon problem by transforming each road into a polygon with a fixed width followed by a point in polygon test ? I guess a lot of points would only match one polygon. And for those which do not, i could make the distance test.
Get an approximative solution (e.g. Using a radial sweep-line as proposed here: Find Nearest Line Segments to Point)
Solution - Final Query: Using CROSS JOIN LATERAL (Credits to Tyler)
SELECT p.id AS point_id, b.id AS road_id, ST_DISTANCE(b.geom, p.geom) AS distance
FROM points p
CROSS JOIN LATERAL (
SELECT r.id AS id, r.geom AS geom
FROM roads r
ORDER BY r.geom <-> p.geom
LIMIT 1
) b;