Efficient downsampling of a selected timeseries to equidistant samples

Question

We have timeseries data from several sensors in a table and I am interested in getting N equally-spaced samples from the full range, as the whole data would be too much to display in a web-app.

I am currently using the following query for N = 500:

SELECT sensor_id, sensor_data_id, val1, val2, datetime
FROM (
   SELECT sensor_id, sensor_data_id, val1, val2, datetime
        , ROW_NUMBER() OVER (ORDER BY datetime)
   FROM sensordata
   WHERE sensor_id = 22
  ) x
WHERE mod(ROW_NUMBER, (SELECT COUNT(1) / 500 FROM sensordata WHERE sensor_id = 22)) = 0;

The table contains val1 and val2 for several timestamps per sensor_id (roughly 2M rows for sensor_id = 22). The table has primary key (sensor_data_id) and the following other indices - (sensor_id) and a composite index (sensor_id, date_time).

The above query roughly takes 36s and the result size for sensor_id = 22 is around 278 MB, found using (SELECT pg_size_pretty( sum(pg_column_size(sensordata))) FROM sensordata WHERE sensor_id = 22;). While the above query does not consider a time-range, I want to eventually add another condition in the WHERE clause to downsample the data for a selected datetime range for a particular sensor_id.

I have tried EXPLAIN ANALYZE a couple of times, and the only hint that seemed to help a little was increasing the work_mem which is 300 MB now. The database does not need to serve a lot of connections simultaneously, so this high a number is not an issue currently.

Are there more efficient ways to achieve the above? The table sensordata is likely to grow in time. Currently around 7 GBs for 121 distinct sensor_id.

About PG: "PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit"

Running on: AWS RDS db.t3.micro which has the following specs:

Here and here are execution plans for my query.

@LaurenzAlbe: I pasted the output of EXPLAIN ANALYZE in depesz. Hope this makes it easier to read here: https://explain.depesz.com/s/pWE2. I also updated the hardware to 2vCPUs and 8GB RAM, and postgres to 14.5 and I see notable performance improvement before using Erwin's solution. The latest EA results are here: https://explain.depesz.com/s/dqsu — Chintan Pathak, Dec 15 '22 at 19:35
Your query allows for almost twice the rows. Integer division truncates. If 999 rows qualify, you get all of them in your sample (instead of just 500). — Erwin Brandstetter, Dec 16 '22 at 00:39

Erwin Brandstetter · Accepted Answer · 2023-08-20T03:14:11.360

In any case, you need the multicolumn index on (sensor_id, datetime) that you already have.

Optimized simple query

You can improve your original query to:

SELECT sensor_data_id, val1, val2, datetime
FROM  (
   SELECT sensor_data_id, val1, val2, datetime
        , row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING) AS rn
   FROM   sensordata
   WHERE  sensor_id = 22
   ) x
WHERE rn % (SELECT COUNT(*) / 500 FROM sensordata WHERE sensor_id = 22) = 0;

The added ROWS UNBOUNDED PRECEDING is a workaround for a sneaky performance issue in window functions that is fixed in Postgres 16.
See my bug report on pgsql-bugs and the discussion on pgsql-hackers.

Either way, upgrade to a current version of Postgres. There has been a steady stream of improvements to performance.

Correct query

Neither query gets you exactly 500 equidistant rows. The function width_bucket() would get us there:

SELECT DISTINCT ON (bucket)
       sensor_data_id, val1, val2, datetime
FROM  (
   SELECT sensor_data_id, val1, val2, datetime
        , width_bucket(row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING), 1
                     , (SELECT COUNT(*) + 1 FROM sensordata WHERE sensor_id = 22), 500) AS bucket
   FROM   sensordata
   WHERE  sensor_id = 22
   ) sub
ORDER BY bucket, datetime;

Adding 1 to the count makes the 500th row land inside the 500th bucket, not the 501st.

Much faster index-based selection?

If there is any reliable (!) pattern in your datetime column (or any column, for that matter; maybe sensor_data_id?), or if sampled rows don't have to be exactly equidistant, a much faster index-based selection may be possible. Say, you have a (mostly) regular time schedule for your sensor (regular intervals for datetime), then you can fetch 1 row per computed time interval:

WITH RECURSIVE rcte AS (   
   (  -- parentheses required!
   SELECT sensor_data_id, val1, val2, datetime
        , (SELECT (max(s2.datetime) - s.datetime) / 500
           FROM   sensordata s2
           WHERE  s2.sensor_id = 22) AS intvl  -- very fast idx-based
   FROM   sensordata s
   WHERE  sensor_id = 22
   ORDER  BY datetime
   LIMIT  1
   )
UNION ALL
   SELECT s.*, r.intvl
   FROM   rcte r
   CROSS  JOIN LATERAL (
      SELECT s.sensor_data_id, s.val1, s.val2, s.datetime
      FROM   sensordata s
      WHERE  s.sensor_id = 22
      AND    s.datetime > r.datetime + r.intvl
      ORDER  BY s.datetime
      LIMIT  1
      ) s
   )
TABLE rcte;

Should be substantially faster for big full samples as it can skip over many rows and pick the next via index, recursively.

and the following other indices - (sensor_id) and a composite index (sensor_id, date_time).

The index on just (sensor_id) is kind of redundant. Unless Index-deduplication in Postgres 13 or later compresses it enough to be useful, additionally. See:

Is a composite index also good for queries on the first field?

Thank you for your help and pointing out that the composite index should be sufficient. I think the query you posted does not do what I want. To ensure performance on the UI, I want the query to always return 500 rows (irrespective of the time range selected), not every 500th row as yours does now. As such, it is difficult to compare performance with the existing query. Also, did get a boost by updating hw and pg version. Still hoping to get to the better query. — Chintan Pathak, Dec 15 '22 at 20:01
@Chintan: I see, I got the 500 backwards. But do you have any regular patterns in your data? Else, to fulfill your strict requirement, there is no way around an expensive full count of all qualifying rows before filtering. We can still make it faster than your original, but not by orders of magnitude. — Erwin Brandstetter, Dec 15 '22 at 23:10
Thanks again for a detailed answer. The index-based selection is much faster and predictable - typically under 500ms, which is ideal. The non-deterministic nature of output is not ideal. As to your question of order, the only pattern is that sensor_data_id and datetime both are ascending. Some datetimes are closer by compared to others. Also, while mostly the sensor_data_id are contiguous (as they come from importing from a file) for a given sensor_id - it is not guaranteed. Someone else could decide to enter data into the table. — Chintan Pathak, Dec 17 '22 at 02:36
The ideal soultion would be something similar to the deterministic fast approach that leverages the continuous aggregates in TimescaleDB and use time-range sensitive queries like described here: https://nickb.dev/blog/downsampling-timescale-data-with-continuous-aggregations/ — Chintan Pathak, Dec 17 '22 at 03:05
@ErwinBrandstetter Last query should be (max(datetime) - min(datetime))/ 500? — jian, Apr 17 '23 at 01:02
@jian; Indeed, parentheses were missing. Thanks, fixed. (Plus optimized some more.) — Erwin Brandstetter, Apr 17 '23 at 01:36

Efficient downsampling of a selected timeseries to equidistant samples

1 Answers1

Optimized simple query

Correct query

Much faster index-based selection?

Aside

Linked