I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the document_id column (no other columns in the result).
My schema is essentially:
CREATE TABLE documents (
document_id SERIAL PRIMARY KEY,
title TEXT
);
CREATE TABLE tags (
tag_id SERIAL PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE documents_tags (
document_id INTEGER REFERENCES documents,
tag_id INTEGER REFERENCES tags,
PRIMARY KEY (document_id, tag_id)
);
I can write this query in Python by pre-computing the set of documents for a given tag, which reduces the problem to a few fast set operations:
In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654]) 100 loops, best of 3: 13.7 ms per loop
Translating the set operations to Postgres doesn't work that fast, however:
stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN (
stuff(# SELECT documents_tags.document_id AS id FROM documents_tags
stuff(# WHERE documents_tags.tag_id IN (12345, 7654)
stuff(# );
document_id
---------------
...
Time: 201.476 ms
- Replacing
NOT INwithEXCEPTmakes it even slower. - I have btree indexes on
document_idandtag_idin all three tables and another one on(document_id, tag_id). - The default memory limits on Postgres' process have been increased significantly, so I don't think Postgres is misconfigured.
How do I speed up this query? Is there any way to pre-compute the mapping between like I did with Python, or am I thinking about this the wrong way?
Here is the result of an EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT document_id AS id FROM documents
WHERE document_id NOT IN (
SELECT documents_tags.documents_id AS id FROM documents_tags
WHERE documents_tags.tag_id IN (12345, 7654)
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on documents (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 146388
SubPlan 1
-> Bitmap Heap Scan on documents_tags (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1)
Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Heap Blocks: exact=3300
-> Bitmap Index Scan on documents_tags__tag_id_index (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1)
Index Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
Planning time: 0.117 ms
Execution time: 303.289 ms
(11 rows)
Time: 303.790 ms
The only settings I changed from the default configuration were:
shared_buffers = 5GB
temp_buffers = 128MB
work_mem = 512MB
effective_cache_size = 16GB
Running Postgres 9.4.5 on a server with 64GB RAM.