1

I have a table with metadata about sampling locations in the ocean. It includes geographic coordinates, date-time in which sampling was done the number of vertical measurements in each location (nlevels), and the kid of instrument used to take samples (instrument column).

Due to this particular dataset came from historical records, it can include duplicates that I would like to identify using spatiotemporal criteria (allowing rounding errors in spatial coordinates). I would like to detect all stations separated by a distance less than 5km and a temporal difference less than 25 hours

My code for testing was:

CREATE TABLE dups_v2 AS SELECT DISTINCT ON (t1.cast_id) t1.cast_id, t1.date, t1.date_time ,t1.lat, t1.lon, t1.nlevels, t1.instrument 
FROM good_dates t1 JOIN good_dates t2
ON ST_DWithin(t1.geom::geography, t2.geom::geography, 5000.0, true)
WHERE AGE(t1.date_time, t2.date_time) < interval '1 days 01:00:00' AND t1.cast_id <> t2.cast_id

The create table looks like:

cast_id  |      date_time      |   lat   |    lon    | nlevels | instrument 
----------+---------------------+---------+-----------+---------+------------
 943046   | 1950-02-03 05:00:00 | 25.9200 | -114.5500 |      25 | mbt
 177584   | 1950-02-03 08:00:00 | 25.8200 | -114.8080 |      14 | osd
 943053   | 1950-02-03 08:06:00 | 25.8200 | -114.7700 |      24 | mbt
 943059   | 1950-02-03 10:00:00 | 25.7000 | -114.9700 |      22 | mbt
 4369553  | 1950-03-05 00:00:00 | 25.7720 | -114.8570 |      40 | mbt
 4369511  | 1950-03-05 04:00:00 | 25.9500 | -114.5000 |      38 | mbt
 4369514  | 1950-03-05 05:00:00 | 25.8700 | -114.6500 |      39 | mbt
 178401   | 1950-03-05 07:54:00 | 25.8580 | -114.7500 |       8 | osd
 4369529  | 1950-03-05 07:58:12 | 25.8200 | -114.7500 |      42 | mbt
 944149   | 1950-03-09 00:10:12 | 25.4300 | -113.4700 |      13 | mbt
 944153   | 1950-03-09 13:55:12 | 25.0000 | -112.8300 |      13 | mbt

The time difference looks ok, but spatial separation it is incorrect. For example, consider the cast ids 944149 and 944153 (the last two) from the above results.

 944149   | 1950-03-09 00:10:12 | 25.4300 | -113.4700 |      13 | mbt
 944153   | 1950-03-09 13:55:12 | 25.0000 | -112.8300 |      13 | mbt

The distance seems to be greater than 5 km. Thus, is not fulfilling my duplicated definition (< 5km; < 25 hours)

Could you help me to identify my error?

John Powell
  • 13,649
  • 5
  • 46
  • 62
Amaru
  • 21
  • 2
  • You haven't attempted a temporal deconfliction constraint, so there's no indication what your time column might be. When doing a self join, you need a WHERE constraint to eliminate identity (t2.id <> t1.id) or previous clusters (t2.id > t1.id). It's not clear what your SRID is so the significance of your search distance is unclear. If you use or cast to geography, you can use a geodetic search in meters (though you need to index appropriately if casting) – Vince Mar 31 '19 at 12:38
  • I followed your suggestion using WHERE to eliminate identity. Also, I used geography to get the distance in meters. The list of duplicates makes sense to me but I am struggling with the date. I am going to edit my question. – Amaru Mar 31 '19 at 20:03
  • Asking for help with the age operator is probably more in the realm of [dba.se] – Vince Mar 31 '19 at 20:11
  • 2
    Where does nlevels come from? I think you need the rank function, but, I am unclear on the question. Ignore @Vince, there are plenty of us here who can cope with time as well as space :-) – John Powell Apr 01 '19 at 14:17
  • Thanks for your interest. nlevels comes from the original database. I am going to edit my question to be more clear. – Amaru Apr 01 '19 at 18:38
  • 1
    The problem, I think, is that if you use the distance in meters in ST_DWithin, where the input data are in lat/lon, 4326, then you need to cast to geography not geometry. If you use the first form, then the geometries need to be in a projected coordinate system where the units are in meters. It isn't clear from the question what SRID you are using. – John Powell Apr 03 '19 at 05:50
  • Thank you for your comment, @John Powell. My SRID is 4326 and my geom column indicated geom | geometry(Point,4326) | | | | main. Also I used t1.geom::geography but I got the same results: some records are separated by a distance greater than 5 km. – Amaru Apr 03 '19 at 07:51
  • Sorry, I have no idea without seeing the data. Those two points are indeed more than 5k apart. I have edited your question because the cast to geography is essentially for the non-projected form of ST_DWithin to work in meters. – John Powell Apr 03 '19 at 08:13
  • Can I attach the data as part of my question? – Amaru Apr 04 '19 at 03:51
  • that query doesn't guarantee that consecutive rows are related. remove the DISTINCT ON, add the t2.cast_id to the select list and check if 944153 appears as duplicate (as respective t2.cast_id value) for 944149. I have the feeling your query might just be doing what it's supposed to... – geozelot Apr 05 '19 at 12:43

0 Answers0