Very slow WHERE IN query on very large table - postgresql 9.6

Question

My table (with 140 000 000 rows) :

CREATE TABLE attendees(
  attendee_id                 VARCHAR NOT NULL,
  name                        VARCHAR NOT NULL,
  status                      VARCHAR NOT NULL,
  event_id                    VARCHAR NOT NULL,
  PRIMARY KEY (attendee_id, event_id)
);

CREATE INDEX attendees_event_id ON attendees(event_id);
CREATE INDEX attendees_attendee_id ON attendees(attendee_id);

My query :

SELECT event_id FROM attendees 
where attendee_id IN (SELECT attendee_id FROM attendees where event_id=$eventId)

This query can take several minutes. I saw here: Optimizing a Postgres query with a large IN I can change the IN by IN (VALUES (..)...) but I don't know how to change this sub query :

SELECT attendee_id FROM attendees where event_id=$eventId

to have (VALUES (..)...)

So, my question is, what is the best way to make this query faster ?

edit (I run the explain (analyze, verbose) on smaller table):

Hash Join  (cost=5296.72..14097.38 rows=209116 width=16) (actual time=177.516..362.851 rows=121564 loops=1)
  Output: attendees.event_id
  Hash Cond: ((attendees.attendee_id)::text = (attendees_1.attendee_id)::text)
  ->  Seq Scan on public.attendees  (cost=0.00..5051.21 rows=230721 width=33) (actual time=0.572..90.302 rows=230721 loops=1)
        Output: attendees.event_id, attendees.attendee_id
  ->  Hash  (cost=4664.54..4664.54 rows=50574 width=17) (actual time=176.611..176.611 rows=49650 loops=1)
        Output: attendees_1.attendee_id
        Buckets: 65536  Batches: 1  Memory Usage: 2917kB
        ->  Bitmap Heap Scan on public.attendees attendees_1  (cost=1288.37..4664.54 rows=50574 width=17) (actual time=14.141..151.031 rows=49650 loops=1)
              Output: attendees_1.attendee_id
              Recheck Cond: ((attendees_1.event_id)::text = '1193751294041282'::text)
              Heap Blocks: exact=595
              ->  Bitmap Index Scan on attendees_event_id  (cost=0.00..1275.73 rows=50574 width=0) (actual time=13.265..13.265 rows=49650 loops=1)
                    Index Cond: ((attendees_1.event_id)::text = '1193751294041282'::text)
Planning time: 1.611 ms
Execution time: 371.431 ms

I'm having trouble understanding what you are trying to achieve with that query. select event_id from attendees where event_id = $eventId should be functionally equivalent to your existing query. — Colin 't Hart, Apr 19 '17 at 09:12
Please [EDIT] your question and add the execution plan generated using explain (analyze, verbose). Formatted text please, no screen shots — , Apr 19 '17 at 09:39
Instead of the two single-column indexes, create one index on (event_id, attendee_id) — , Apr 19 '17 at 10:13
@a_horse_with_no_name I already have a PRIMARY KEY on (event_id, attendee_id) — Loann Delgado, Apr 19 '17 at 12:45
No, you have a primary key on (attendee_id, event_id) The order of the columns in an index matters. The index on only (attendee_id) is useless because Postgres can also use the PK index for that. And the index on (event_id) can be replaced by one on (event_id, attendee_id) — , Apr 19 '17 at 13:01

MarmiK · Answer 1 · 2017-04-24T08:39:52.553

-1

To Speed up the query,

We may use join instead of subquery

Try this query:

SELECT a1.event_id 
FROM attendees a1
JOIN attendees a2
    ON a1.attendee_id = a2.attendee_id
        and a2.event_id = $eventId

To make is faster check that if you can create CLUSTER on that table with index on Primary key

Note: cluster a table according to an index reference

edited Apr 24 '17 at 08:39

answered Apr 19 '17 at 08:47

MarmiK

269
2
15

1

This is effectively faster (3 minutes instead of more than 10) but still slow... – Loann Delgado Apr 19 '17 at 12:58
1

@a_horse_with_no_name When would it not return the same information? event_id is part of the PK – paparazzo Apr 19 '17 at 12:58
@LoannDelgado Your table does have 140 000 000 rows – paparazzo Apr 19 '17 at 13:00
@Paparazzi Indeed – Loann Delgado Apr 19 '17 at 13:01
@downvoter please care to comment why down vote. – MarmiK Apr 21 '17 at 05:55

score -1 · Accepted Answer · answered Apr 19 '17 at 13:07

-1

Drop the extra indexes
Reverse the order of the PK to PRIMARY KEY (event_id, attendee_id)

SELECT a1.event_id 
FROM attendees a1
JOIN attendees a2
  ON a2.event_id    = $eventId
 and a2.attendee_id = a1.attendee_id

answered Apr 19 '17 at 13:07

paparazzo

5,043
1
18
32

Very slow WHERE IN query on very large table - postgresql 9.6

2 Answers2