KNN with "Big Data"

Question

I have a point dataset of over 30mill records and I am trying to select the closest 5 lines to each point. Reading this post there seems to be two methodologies

Using <-> however as I want to select the 5 closest I have used the following

select a.id, a.code, array_agg(b.id) AS line_ids
from points p
CROSS JOIN LATERAL (
    SELECT p.id, l.id
    FROM lines l
    ORDER BY a.geom <-> c.geom LIMIT 5
) AS b
where p.id = b.id

This ran for a few hours and then failed on a memory allocation (was above 24Gb of RAM)

Use ST_DWithin to limit to 500m

select p.id, p.code, array_agg(b.id) AS line_ids
from points p
CROSS JOIN LATERAL (
    SELECT p.id, b.id
    FROM lines l
    WHERE ST_DWithin(p.geom, l.geom, 500)
      ORDER BY p.geom <-> l.geom LIMIT 5
) AS b
where p.id = b.id

All columns have an index and a spatial index. IS there any better methods or things I am missing to successfully run this query?

Using PostgreSQL 12, PostGIS 3, Windows 64Bit, 32Gb of RAM

Explain for first method

Nested Loop  (cost=0.41..433780457.02 rows=5 width=24)
  ->  Seq Scan on points p  (cost=0.00..1643568.42 rows=30634442 width=48)
  ->  Subquery Scan on b  (cost=0.41..14.10 rows=1 width=16)
        Filter: (p.id = b.id)
        ->  Limit  (cost=0.41..14.03 rows=5 width=24)
              ->  Index Scan using geometry_geom_idx on open_usrn c  (cost=0.41..3997260.47 rows=1467289 width=24)
                    Order By: (geometry <-> a.geom)

Explain for second method

GroupAggregate  (cost=5244082420.90..5244082421.01 rows=5 width=48)
  Group Key: p.id, b.id
  ->  Sort  (cost=5244082420.90..5244082420.91 rows=5 width=24)
        Sort Key: p.id, p.code
        ->  Nested Loop  (cost=0.66..5244082420.84 rows=5 width=24)
              ->  Seq Scan on points a  (cost=0.00..1797087.20 rows=33052020 width=48)
              ->  Subquery Scan on b  (cost=0.66..158.60 rows=1 width=16)
                    Filter: (p.id = b.id)
                    ->  Limit  (cost=0.66..158.53 rows=5 width=24)
                          ->  Index Scan using geometry_geom_idx on open_usrn c  (cost=0.66..4642.10 rows=147 width=24)
                                Index Cond: (geometry && st_expand(a.geom, '500'::double precision))
                                Order By: (geometry <-> a.geom)
                                Filter: st_dwithin(a.geom, geometry, '500'::double precision)

ST_DWithin adds an unnecessary function call to the execution, while <-> has a more specialized index lookup already included; use option 1. Don't pull the points.id from the LATERAL query. However, in either way, clustered or not, you are doing +30M proximity checks on at least 5 LineStrings each. Consider creating a new table rather then SELECT (why at all?) or UPDATE (for reasons you mentioned in comments), or sub-selecting points in AOIs only (e.g. within bbox: &&). — geozelot, Aug 10 '20 at 17:55
@geozelot it is part of a CREATE TABLE i just left that bit out to simplify my question. I just wanted to confirm that I am not missing anything obvious. Will see if I can break it down further, thanks — tjmgis, Aug 10 '20 at 17:59

KNN with "Big Data"

0 Answers0