Match rows from different tables by comparing similar lat lon coordinates

Question

I am new to PostgreSQL and PostGIS. The first project I am working on is the first I am struggeling with and the one I would like to discuss.

I have 2 stops tables I call for simplicity tbl_stops_1 and tbl_stops_2

tbl_stops_1

 feed_index            | integer              |           | not null |         | plain    |              | 
 stop_id               | text                 |           | not null |         | extended |              | 
 stop_name             | text                 |           |          |         | extended |              | 
 stop_lat              | double precision     |           |          |         | plain    |              | 
 stop_lon              | double precision     |           |          |         | plain    |              | 
...
 the_geom              | geometry(Point,4326) |           |          |         | main     |              | 
Indexes:
    "stops_pkey" PRIMARY KEY, btree (feed_index, stop_id)
    "stop_geom_idx" gist (the_geom)
Triggers:
    stop_geom_trigger BEFORE INSERT OR UPDATE ON gtfs.stops FOR EACH ROW EXECUTE PROCEDURE gtfs.stop_geom_update()

Row count:

select count(1) from gtfs.stops;
 count  
--------
 523887

tbl_stops_2

feed_index       | integer               |           | not null |                                               | plain    |              | 
 seqno            | bigint                |           | not null | nextval('gtfs.stops_zhv_seqno_seq'::regclass) | plain    |              | 
...
 the_geom         | geometry(Point,4326)  |           |          |                                               | main     |              | 
 SeqNo            | text                  |           |          |                                               | extended |              | 
 Type             | text                  |           |          |                                               | extended |              | 
 DHID             | text                  |           |          |                                               | extended |              | 
 Parent           | text                  |           |          |                                               | extended |              | 
 Name             | text                  |           |          |                                               | extended |              | 
 Latitude         | text                  |           |          |                                               | extended |              | 
 Longitude        | text                  |           |          |                                               | extended |              | 
...
Indexes:
    "stops_zhv_pkey" PRIMARY KEY, btree (feed_index, seqno)
Foreign-key constraints:
    "stops_zhv_feed_index_fkey" FOREIGN KEY (feed_index) REFERENCES gtfs.feed_info(feed_index) ON DELETE CASCADE
Triggers:
    stop_zhv_geom_trigger BEFORE INSERT OR UPDATE ON gtfs.stops_zhv FOR EACH ROW EXECUTE PROCEDURE gtfs.stop_zhv_geom_update()

Row count:

select count(1) from gtfs.stops_zhv;
 count  
--------
 849453

Those 2 stops tables are imported into PostgreSQL from different sources. Though they have similarities, they are not the same. In particular tbl_stops_1 lacks in information in the tbl_stops_1.stop_name column. Those rows missing information I like to combine with information from the column tbl_stops_2.DHID like this.

tbl_stops_1.stop_name = tbl_stops_2.DHID.stringConCat(tbl_stops_1.stop_name);

How to match those two stops tables?

The values of lat and lon are quite similar in those two stops tables. I can match a stop row from table tbl_stops_1 with a stop row in table tbl_stops_2 like this.

SELECT * FROM tbl_stops_1 JOIN tbl_stops_2 ON ST_DWithin(tbl_stops_1.the_geom, tbl_stops_2.the_geom, ???);

What am I aiming for?

At the end, I would like to have a view or table where the above string concatenation is done for stops that match because their geographic coordinates are close to each other.

Edit no. 1

I updated the definition to that the stops tables are more alike and * added row counts.
I like to have a the_geom column in both tables to facilitate this project. Anyhow, the function to import the CSV data into the database does not like to do this yet. I'll touch base.

Edit no. 2

I updated the definition to that the stops tables are more alike and * added row counts.
I updated the matching query using JOIN

Open Issue

How make the JOIN work and combine it with the string concatenation?

Those tables are not at all alike, because the second one doesn't have geometry. A ST_Distance query cannot use an index, and will therefore be very inefficient. You especially don't want to join to the table without the geometry column, and you want a JOIN to do so. Please [Edit] the Question to contain row counts for each table and explain what the query actually does. — Vince, May 22 '23 at 11:13
What do you mean with index and that the ST_Distance query cannot use it? What are row counts? I'll asap edit the question. Cheers anyways! — dancesWithCycles, May 22 '23 at 12:03
A row count is the number of features (rows) in a table. Indexes are what make databases worth using. — Vince, May 22 '23 at 12:12
@Vince: I am on my way to make both tables more similar to facilitate this project. I know, I am not there yet. Anyhow, do you think I can JOIN the stops tables already now? — dancesWithCycles, May 23 '23 at 12:07
If you use a ST_DistanceSphere on ST_MakePoint features, PostgreSQL will need to do 406046477655 geodetic comparisons. How long are you willing to wait? I think you need to spend some more time on geometry table and index construction before you're ready to JOIN. If you're not using ST_DWithin for the JOIN, you're going to be disappointed. See https://gis.stackexchange.com/questions/247113/setting-up-indexes-for-postgis-distance-queries/247131#247131 — Vince, May 23 '23 at 13:35
@Vince The JOIN query I am using now takes ages and returns zero rows. I am wondering what is the unit of the distance value in the ST_DWidth() function: meter, degree, aso.? Can you suggest anything for optimization? — dancesWithCycles, May 25 '23 at 14:06

score 3 · Accepted Answer · answered May 25 '23 at 14:32

A plain proximity search (e.g. via ST_DWithin) comes with the drawback that it finds all geometries within the given distance threshold - this may work out in cases where the threshold is in accordance with the overall distribution, but for the given task in general a proximity search is ambiguous.

I suspect that the problem definition could in fact be re-phrased into

"Find the single, nearest stop in table a for each stop in table_b, and create a new name from its row values.

This is referred to as (spatial) (K) Nearest Neighbor search, and can be executed efficiently using a LATERAL join in conjunction with the PostGIS implementation of the <-> distance operator.

There is a decision to make that may have a significant impact: find the closest stop in table a for each stop in table b - or vice cersa. If there is a 1:1 match between stops in both tables, with only tiny differences in their coordinate values, this decision likely has no effect.

The joined table (right hand table in the join) needs a spatial index for this to be efficient.

For this example I will assume you want to find the closest stop in tbl_stops_2 to each stop in tbl_stops_1 (spatial index on tbl_stops_2 will be used):

SELECT
  --t1.<columns>,
  --t2.<columns>,         -- these need to be selected in the LATERAL sub-query
  CONCAT_WS("_", t2."DHID", t1.stop_name) AS stop_name
FROM
  tbl_stops_1 AS t1
  CROSS JOIN LATERAL (
    SELECT
      -- _ts.<columns>,
      "DHID"              -- wrapped in "" if actually uppercase
    FROM
      tbl_stops_2 AS _t2
    ORDER BY
      t1.the_geom <-> _t2.the_geom
    LIMIT
      1
  ) AS t2
;

Similarly, you could UPDATE your tbl_stops_1 table:

UPDATE
  tbl_stops_1 AS t1
SET
  stop_name = (
    SELECT
      CONCAT_WS("_", t2."DHID", t1.stop_name)
    FROM
      tbl_stops_2 AS t2
    ORDER BY
      t1.the_geom <-> t2.the_geom
    LIMIT
      1
  )
;

In both cases, the concatenated stop_name will have the structure

<t2."DHID">_<t1.stop_name>

with _ being the separator - change this the way you need it.

I updated about 50 thousand columns from table tbl_stops_1 with about 800 thousand columns from table tbl_stops_2 which took about half a day on a single core virtual machine with about 10 gigabytes of main memory with the processor being the limiting factor. Any ideas how to speed up this update? — dancesWithCycles, Jun 02 '23 at 12:50
@dancesWithCycles Do you have a GIST index on tbl_stops_2? Don't forget to run VACUUM ANALYZE on both tables, just in case but necessarily after creating a missing index. — geozelot, Jun 02 '23 at 12:56

score 1 · Answer 2 · edited Jun 21 '23 at 11:38

For future reference and everyone tripping over this issue, I am pasting here my working version based on the above accepted answer.

DROP TABLE IF EXISTS :schema.tbl_gtfs_stops_de_not_dhid_refine;
CREATE TABLE IF NOT EXISTS :schema.tbl_gtfs_stops_de_not_dhid_refine AS SELECT * FROM :schema.tbl_gtfs_stops_de_not_dhid;
CREATE INDEX IF NOT EXISTS refine_id_idx ON :schema.tbl_gtfs_stops_de_not_dhid_refine(stop_id);
CREATE INDEX IF NOT EXISTS refine_geom_idx ON :schema.tbl_gtfs_stops_de_not_dhid_refine USING GIST (the_geom);
CREATE INDEX IF NOT EXISTS zhv_dhid_idx ON gtfs.stops_zhv(dhid);
CREATE INDEX IF NOT EXISTS zhv_geom_idx ON gtfs.stops_zhv USING GIST (the_geom);
VACUUM ANALYZE;
UPDATE
  :schema.tbl_gtfs_stops_de_not_dhid_refine AS t1
SET
  stop_id = (
    SELECT
      CONCAT_WS(':',t2.dhid, t1.stop_id)
    FROM
      gtfs.stops_zhv AS t2
    ORDER BY
      t1.the_geom <-> t2.the_geom
    LIMIT
      1
  );

You can find the code here