I have 25 billion <int, int, float> rows that I'm trying to import into Postgres, and after 77% of the data being imported, the Postgres data folder is taking up 840GB, about 4x the storage requirements for that many rows at 12 bytes each. Additionally, importing is taking 4x longer than my same import on MySQL (as described in MySQL MyISAM index causes query to match no rows; indexes disabled, rows match).
Here are my commands:
mydb=# CREATE TABLE mytable (id1 int, id2 int, score float)
$ psql mydb -c "COPY mytable (id1, id2, score) FROM 'file_000'"
$ psql ...
$ psql mydb -c "COPY mytable (id1, id2, score) FROM 'file_099'" # 100 files total
I'm running Postgres 9.1. There are no other tables in the database. This is not a production environment. The files are TSV text files. The only output from each COPY command is something like "COPY 256448118" -- at least until I ran out of disk space.
Am I doing something wrong here, or is this the expected behavior?
Subquestion 1: Where is this extra storage overhead coming from, and can I avoid it?
- Update: It looks like there is a
HeapTupleHeaderof 23 bytes on each row, so that probably explains this overhead (source: StackOverflow post). Any way to avoid this?
Subquestion 2: If storage requirements are indeed 4x that of expected size, can I speed up importing (i.e. with some configuration change, etc)?
Subquestion 3: I need an index on id1, so what will storage requirements be for that during and after creation (I planned to run CREATE INDEX id1x ON mytable (id1))?
work_memto 64gb and started theCREATE INDEX, so we'll see how it goes. – Dolan Antenucci Aug 13 '13 at 17:13maintenance_work_memthat should be changed. See Resource Consumption in the doc. – Daniel Vérité Aug 13 '13 at 18:07maintenance_work_mem=8gb, andwork_mem=64gb. Right now there's a tmp file in postgres data dir that is slowly growing. (index type = btree) – Dolan Antenucci Aug 13 '13 at 18:50