I have a table with metadata about sampling locations in the ocean. It includes geographic coordinates, date-time in which sampling was done the number of vertical measurements in each location (nlevels), and the kid of instrument used to take samples (instrument column).
Due to this particular dataset came from historical records, it can include duplicates that I would like to identify using spatiotemporal criteria (allowing rounding errors in spatial coordinates). I would like to detect all stations separated by a distance less than 5km and a temporal difference less than 25 hours
My code for testing was:
CREATE TABLE dups_v2 AS SELECT DISTINCT ON (t1.cast_id) t1.cast_id, t1.date, t1.date_time ,t1.lat, t1.lon, t1.nlevels, t1.instrument
FROM good_dates t1 JOIN good_dates t2
ON ST_DWithin(t1.geom::geography, t2.geom::geography, 5000.0, true)
WHERE AGE(t1.date_time, t2.date_time) < interval '1 days 01:00:00' AND t1.cast_id <> t2.cast_id
The create table looks like:
cast_id | date_time | lat | lon | nlevels | instrument
----------+---------------------+---------+-----------+---------+------------
943046 | 1950-02-03 05:00:00 | 25.9200 | -114.5500 | 25 | mbt
177584 | 1950-02-03 08:00:00 | 25.8200 | -114.8080 | 14 | osd
943053 | 1950-02-03 08:06:00 | 25.8200 | -114.7700 | 24 | mbt
943059 | 1950-02-03 10:00:00 | 25.7000 | -114.9700 | 22 | mbt
4369553 | 1950-03-05 00:00:00 | 25.7720 | -114.8570 | 40 | mbt
4369511 | 1950-03-05 04:00:00 | 25.9500 | -114.5000 | 38 | mbt
4369514 | 1950-03-05 05:00:00 | 25.8700 | -114.6500 | 39 | mbt
178401 | 1950-03-05 07:54:00 | 25.8580 | -114.7500 | 8 | osd
4369529 | 1950-03-05 07:58:12 | 25.8200 | -114.7500 | 42 | mbt
944149 | 1950-03-09 00:10:12 | 25.4300 | -113.4700 | 13 | mbt
944153 | 1950-03-09 13:55:12 | 25.0000 | -112.8300 | 13 | mbt
The time difference looks ok, but spatial separation it is incorrect. For example, consider the cast ids 944149 and 944153 (the last two) from the above results.
944149 | 1950-03-09 00:10:12 | 25.4300 | -113.4700 | 13 | mbt
944153 | 1950-03-09 13:55:12 | 25.0000 | -112.8300 | 13 | mbt
The distance seems to be greater than 5 km. Thus, is not fulfilling my duplicated definition (< 5km; < 25 hours)
Could you help me to identify my error?
t2.id <> t1.id) or previous clusters (t2.id > t1.id). It's not clear what your SRID is so the significance of your search distance is unclear. If you use or cast togeography, you can use a geodetic search in meters (though you need to index appropriately if casting) – Vince Mar 31 '19 at 12:38ageoperator is probably more in the realm of [dba.se] – Vince Mar 31 '19 at 20:11geom | geometry(Point,4326) | | | | main. Also I usedt1.geom::geographybut I got the same results: some records are separated by a distance greater than 5 km. – Amaru Apr 03 '19 at 07:51DISTINCT ON, add thet2.cast_idto the select list and check if944153appears as duplicate (as respectivet2.cast_idvalue) for944149. I have the feeling your query might just be doing what it's supposed to... – geozelot Apr 05 '19 at 12:43