1

My table (with 140 000 000 rows) :

CREATE TABLE attendees(
  attendee_id                 VARCHAR NOT NULL,
  name                        VARCHAR NOT NULL,
  status                      VARCHAR NOT NULL,
  event_id                    VARCHAR NOT NULL,
  PRIMARY KEY (attendee_id, event_id)
);

CREATE INDEX attendees_event_id ON attendees(event_id);
CREATE INDEX attendees_attendee_id ON attendees(attendee_id);

My query :

SELECT event_id FROM attendees 
where attendee_id IN (SELECT attendee_id FROM attendees where event_id=$eventId)

This query can take several minutes. I saw here: Optimizing a Postgres query with a large IN I can change the IN by IN (VALUES (..)...) but I don't know how to change this sub query :

SELECT attendee_id FROM attendees where event_id=$eventId

to have (VALUES (..)...)

So, my question is, what is the best way to make this query faster ?

edit (I run the explain (analyze, verbose) on smaller table):

Hash Join  (cost=5296.72..14097.38 rows=209116 width=16) (actual time=177.516..362.851 rows=121564 loops=1)
  Output: attendees.event_id
  Hash Cond: ((attendees.attendee_id)::text = (attendees_1.attendee_id)::text)
  ->  Seq Scan on public.attendees  (cost=0.00..5051.21 rows=230721 width=33) (actual time=0.572..90.302 rows=230721 loops=1)
        Output: attendees.event_id, attendees.attendee_id
  ->  Hash  (cost=4664.54..4664.54 rows=50574 width=17) (actual time=176.611..176.611 rows=49650 loops=1)
        Output: attendees_1.attendee_id
        Buckets: 65536  Batches: 1  Memory Usage: 2917kB
        ->  Bitmap Heap Scan on public.attendees attendees_1  (cost=1288.37..4664.54 rows=50574 width=17) (actual time=14.141..151.031 rows=49650 loops=1)
              Output: attendees_1.attendee_id
              Recheck Cond: ((attendees_1.event_id)::text = '1193751294041282'::text)
              Heap Blocks: exact=595
              ->  Bitmap Index Scan on attendees_event_id  (cost=0.00..1275.73 rows=50574 width=0) (actual time=13.265..13.265 rows=49650 loops=1)
                    Index Cond: ((attendees_1.event_id)::text = '1193751294041282'::text)
Planning time: 1.611 ms
Execution time: 371.431 ms
  • 1
    I'm having trouble understanding what you are trying to achieve with that query. select event_id from attendees where event_id = $eventId should be functionally equivalent to your existing query. – Colin 't Hart Apr 19 '17 at 09:12
  • Please [EDIT] your question and add the execution plan generated using explain (analyze, verbose). Formatted text please, no screen shots –  Apr 19 '17 at 09:39
  • 1
    Instead of the two single-column indexes, create one index on (event_id, attendee_id) –  Apr 19 '17 at 10:13
  • @a_horse_with_no_name I already have a PRIMARY KEY on (event_id, attendee_id) – Loann Delgado Apr 19 '17 at 12:45
  • 3
    No, you have a primary key on (attendee_id, event_id) The order of the columns in an index matters. The index on only (attendee_id) is useless because Postgres can also use the PK index for that. And the index on (event_id) can be replaced by one on (event_id, attendee_id) –  Apr 19 '17 at 13:01

2 Answers2

-1

To Speed up the query,

We may use join instead of subquery

Try this query:

SELECT a1.event_id 
FROM attendees a1
JOIN attendees a2
    ON a1.attendee_id = a2.attendee_id
        and a2.event_id = $eventId 

To make is faster check that if you can create CLUSTER on that table with index on Primary key

Note: cluster a table according to an index reference

MarmiK
  • 269
  • 2
  • 15
-1

Drop the extra indexes
Reverse the order of the PK to PRIMARY KEY (event_id, attendee_id)

SELECT a1.event_id 
FROM attendees a1
JOIN attendees a2
  ON a2.event_id    = $eventId
 and a2.attendee_id = a1.attendee_id
paparazzo
  • 5,043
  • 1
  • 18
  • 32