Optimize summary stats PostGIS query for multiple polyline layers over a worldwide grid

Question

I have two global PostGIS polyline layers, one of which is a subset of the other. The subset defines where there are issues with the system being mapped.

I am trying to map the likelihood of a polyline being present, but no polylines from the subset. I.E., the likelihood that the system is performing as expected. I've tried Krigging, interpolation, zonal stats, etc and have working code using a grid of 1 deg lat lon

--create the grid and convert to vector.  I've also tried this using a raster grid.
create table fishnet (geom geometry, id bigserial primary key) insert into fishnet (geom) select rast::geometry 
   from ST_MakeEmptyCoverage(1, 1, 720, 360, -180, 90, (360)/(720)::float, (-180)/360::float, 0., 0., 4326) rast
   create index on fishnet using gist (geom)
--The query subtracts the length of the subset polyines in each grid from the main 
--polyine and divides by the sum of their lengths to get an average.  This is faster than using
--st_difference to get the length of the main polyline layer that doesn't intersect the subset layer
explain analyse select greatest(sum(st_length(st_intersection(geom, g_geom))-st_length(st_intersection(geom, g_geom)))/sum(st_length(st_intersection(geom, g_geom))+st_length(st_intersection(geom, b_geom))), 0.0) as perc_conn, geom
FROM  fishnet a join
(select a.id,
    st_collect(a.geom) as g_geom, st_collect(b.geom) as b_geom
    from 
    (select a.geom, a.id from good_polylines a inner join 
        (select tmp.* from 
           (select a.id, t_id, date_added from events a 
               inner join 
                   (select min(date_added) as date_added, id from events_processed group by id) dp 
                on a.id = dp.id 
            union 
            select id, t_id, start_event from asp_cgi_err) tmp 
            inner join t_list t on t.t_id=tmp.t_id

         where sys_type = 'SYS_TYPE') foo on a.id = foo.id where foo.date_added >= '20230201'::timestamp without time zone and foo.date_added < '20230217'::timestamp without time zone + interval '1 day') a 
    left join bad_polylines b on a.id = b.id group by a.id) b
on st_intersects(geom, g_geom) group by geom

My problem is I need to create the map on demand and the query takes 30 minutes to run with a small sample of 4k polylines. All tables have spatial indexes and primary keys.

The output from explain analyse is

"GroupAggregate  (cost=16655846.98..590800409.83 rows=259200 width=128) (actual time=1205194.109..1968037.472 rows=17105 loops=1)"
"  Group Key: a.geom"
"  ->  Sort  (cost=16655846.98..16670126.96 rows=5711990 width=184) (actual time=1205188.303..1214805.832 rows=41579 loops=1)"
"        Sort Key: a.geom"
"        Sort Method: external merge  Disk: 11969064kB"
"        ->  Nested Loop  (cost=15035.89..14491977.35 rows=5711990 width=184) (actual time=964.756..1085082.337 rows=41579 loops=1)"
"              ->  GroupAggregate  (cost=15035.61..119301.07 rows=22037 width=83) (actual time=961.596..7909.971 rows=764 loops=1)"
"                    Group Key: a_1.id"
"                    ->  Nested Loop Left Join  (cost=15035.61..55812.00 rows=32381 width=17854) (actual time=237.184..7322.351 rows=4067 loops=1)"
"                          ->  Merge Join  (cost=15035.19..41816.51 rows=7029 width=17353) (actual time=236.687..7049.449 rows=764 loops=1)"
"                                Merge Cond: (a_1.id = foo.id)"
"                                ->  Index Scan using good_polylines_id_idx on good_polylines a_1  (cost=0.41..26621.04 rows=22104 width=17353) (actual time=0.839..1663.204 rows=22152 loops=1)"
"                                ->  Sort  (cost=15034.78..15052.30 rows=7008 width=32) (actual time=141.302..142.031 rows=782 loops=1)"
"                                      Sort Key: foo.id"
"                                      Sort Method: quicksort  Memory: 86kB"
"                                      ->  Hash Join  (cost=14275.49..14587.15 rows=7008 width=32) (actual time=139.164..139.536 rows=782 loops=1)"
"                                            Hash Cond: (foo.t_id = t.t_id)"
"                                            ->  HashAggregate  (cost=14260.82..14398.23 rows=13741 width=72) (actual time=139.006..139.197 rows=804 loops=1)"
"                                                  Group Key: foo.id, foo.t_id, (min(id_events_processed.date_added))"
"                                                  Batches: 1  Memory Usage: 465kB"
"                                                  ->  Append  (cost=13558.01..14157.76 rows=13741 width=72) (actual time=133.985..138.693 rows=804 loops=1)"
"                                                        ->  Hash Join  (cost=13558.01..13896.68 rows=13720 width=72) (actual time=133.984..138.469 rows=786 loops=1)"
"                                                              Hash Cond: (foo.id = id_events_processed.id)"
"                                                              ->  Sort  (cost=12064.24..12120.01 rows=22309 width=272) (actual time=112.825..113.940 rows=22129 loops=1)"
"                                                                    Sort Key: foo.id"
"                                                                    Sort Method: quicksort  Memory: 2497kB"
"                                                                    ->  Subquery Scan on foo  (cost=6624.53..7628.44 rows=22309 width=272) (actual time=59.568..75.443 rows=22129 loops=1)"
"                                                                          ->  Unique  (cost=6624.53..7405.35 rows=22309 width=272) (actual time=59.564..71.488 rows=22129 loops=1)"
"                                                                                ->  Sort  (cost=6624.53..6680.30 rows=22309 width=272) (actual time=59.563..66.408 rows=22256 loops=1)"
"                                                                                      Sort Key: t_list_new.t_id, t_list_new.start_event, t_list_new.end_event, t_list_new.event_time, t_list_new.perc, t_list_new.lc, t_list_new.orig_geom, t_list_new.dest_geom, t_list_new.orig, t_list_new.dest, t_list_new.id, t_list_new.url, t_list_new.last_timestamp"
"                                                                                      Sort Method: external merge  Disk: 6128kB"
"                                                                                      ->  Append  (cost=0.00..2188.73 rows=22309 width=272) (actual time=0.015..8.644 rows=22256 loops=1)"
"                                                                                            ->  Seq Scan on t_list_new  (cost=0.00..896.65 rows=2365 width=260) (actual time=0.014..1.341 rows=2365 loops=1)"
"                                                                                            ->  Seq Scan on t_list_old  (cost=0.00..957.44 rows=19944 width=277) (actual time=0.010..5.682 rows=19891 loops=1)"
"                                                              ->  Hash  (cost=1492.23..1492.23 rows=123 width=36) (actual time=21.126..21.129 rows=1080 loops=1)"
"                                                                    Buckets: 2048 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 89kB"
"                                                                    ->  HashAggregate  (cost=1122.96..1491.00 rows=123 width=36) (actual time=17.378..20.860 rows=1080 loops=1)"
"                                                                          Group Key: id_events_processed.id"
"                                                                          Filter: ((min(id_events_processed.date_added) >= '2023-02-01 00:00:00'::timestamp without time zone) AND (min(id_events_processed.date_added) < '2023-02-18 00:00:00'::timestamp without time zone))"
"                                                                          Batches: 1  Memory Usage: 3601kB"
"                                                                          Rows Removed by Filter: 23548"
"                                                                          ->  Seq Scan on id_events_processed  (cost=0.00..740.48 rows=38248 width=36) (actual time=0.021..3.670 rows=38269 loops=1)"
"                                                        ->  Seq Scan on t_list_err  (cost=0.00..54.97 rows=21 width=48) (actual time=0.074..0.164 rows=18 loops=1)"
"                                                              Filter: ((start_event >= '2023-02-01 00:00:00'::timestamp without time zone) AND (start_event < '2023-02-18 00:00:00'::timestamp without time zone))"
"                                                              Rows Removed by Filter: 1113"
"                                            ->  Hash  (cost=13.40..13.40 rows=102 width=6) (actual time=0.144..0.145 rows=102 loops=1)"
"                                                  Buckets: 1024  Batches: 1  Memory Usage: 12kB"
"                                                  ->  Seq Scan on t_ids t  (cost=0.00..13.40 rows=102 width=6) (actual time=0.011..0.124 rows=102 loops=1)"
"                                                        Filter: (sys_type = 'SYS_TYPE'::text)"
"                                                        Rows Removed by Filter: 10"
"                          ->  Index Scan using bad_polylines_id_idx on bad_polylines b  (cost=0.42..1.92 rows=7 width=528) (actual time=0.240..0.348 rows=5 loops=764)"
"                                Index Cond: (id = a_1.id)"
"              ->  Index Scan using fishnet_geom_idx on fishnet a  (cost=0.28..651.94 rows=26 width=120) (actual time=17.677..1409.874 rows=54 loops=764)"
"                    Index Cond: (geom && (st_collect(a_1.geom)))"
"                    Filter: st_intersects(geom, (st_collect(a_1.geom)))"
"                    Rows Removed by Filter: 1023"
"Planning Time: 8.424 ms"
"Execution Time: 1969143.215 ms"

Note, I had to move my postgres data directory, which is 15 gb, from /root to /home because the query with explain analyse used all available space, 5-6 gb, on root. I am using postgres-13 with rhel8.

We need to see the results of explain analyse on the query? — Ian Turton, Feb 17 '23 at 17:12
I'm still looking at explain analyze and will likely post tomorrow. When I create test tables as listed in the query, it run in 20 seconds. When I use a more complex query that creates the tables internally as subqueries, it takes 30+ minutes, which is perplexing because it takes a second or two to create the test tables — jgm_GIS, Feb 19 '23 at 17:29

score 0 · Answer 1 · answered Feb 24 '23 at 18:14

In your explain analyze it seems the most time is used to scan the fishnet, as expected (tens of thousand lines times 1 million). Maybe you can try to tweak your server to fully use your computation capacity (see an old answer of mine to start), but I think there can be a better approach if you need on-demand computation.

If your fishnet doesn't dynamically change, you don't need expensive geo computation every time. You can:

Create an ID for each cell of your grid
Create a table where you have all your lines cut by the grid (using ST_Split)
- Each line part will have a new column with the cell ID in which it belongs
- Also compute the length for each line part
- Note: it's probably more efficient to create horizontal and vertical lines to cut instead of creating all the cells as geometries, especially if you want to have a smaller grid. In this case you could use rounded lng_lat as ID.
All the request after would only use IDs (group by cell_id for example)

The preparation of the data may be long, but it should be worth it. Even if you want to have a more dynamic grid, I still think it can help to have several big regions to pre-filter, and you can even call several requests (one for each region) to compute the different regions more parallely (sometimes it is more efficient, and you can avoid a memory bottleneck by treating each region sequencially).

On another note, when the request starts to become big, I personnaly prefer to switch to CTE instead of sub-requests. It's less messy, let you see the process in the order it will actually be processed and you can easily try to MATERIALIZED tables at different stages to see if it improve something.

I tried to rewrite your request with CTE:

WITH first_event AS (
    select
        min(date_added) as date_added,
        id
    from events_processed
    group by id
), all_events AS (
    select
        a.id,
        t_id,
        date_added
    from events a
    inner join first_event as fe 
    on a.id = fe.id
    union
    select
        id,
        t_id,
        start_event
    from asp_cgi_err
), all_events_sys AS (
    select
        ae.*
    from all_events as ae
    inner join t_list as t 
    on t.t_id = ae.t_id
    where sys_type = 'SYS_TYPE'
), good_polylines_with_events as (
    select
        gp.geom,
        gp.id
    from good_polylines as gp
    inner join all_events_sys as aes
    on gp.id = aes.id
    where
        aes.date_added >= '20230201' :: timestamp without time zone
        and aes.date_added < '20230217' :: timestamp without time zone + interval '1 day'
), polyline_by_id as (
    select
        gpe.id,
        st_collect(gpe.geom) as g_geom,
        st_collect(bp.geom) as b_geom
    from good_polylines_with_events as gpe
    left join bad_polylines bp
    on gpe.id = bp.id
    group by gpe.id
)
select
    greatest(
        sum(
            st_length(st_intersection(geom, g_geom)) - st_length(st_intersection(geom, g_geom))
        ) / sum(
            st_length(st_intersection(geom, g_geom)) + st_length(st_intersection(geom, b_geom))
        ),
        0.0
    ) as perc_conn,
    geom
FROM fishnet as fn
join polyline_by_id as pbi
on st_intersects(geom, g_geom)
group by geom

In this case I don't think materializing a polygon table (good_polylines_with_events or polyline_by_id) would help, because you would lose the index that you have in your original tables. It would only be a little interesting if they never change, and you can create the actual table with indexes.

Maybe materializing previous tables will help a little, but I wouldn't expect much...

Optimize summary stats PostGIS query for multiple polyline layers over a worldwide grid

1 Answers1