Help with Postgres 9.1 data import (is 4x expected data size, 4x slower than MySQL, and index still needed)

Question

I have 25 billion <int, int, float> rows that I'm trying to import into Postgres, and after 77% of the data being imported, the Postgres data folder is taking up 840GB, about 4x the storage requirements for that many rows at 12 bytes each. Additionally, importing is taking 4x longer than my same import on MySQL (as described in MySQL MyISAM index causes query to match no rows; indexes disabled, rows match).

Here are my commands:

mydb=# CREATE TABLE mytable (id1 int, id2 int, score float)
$ psql mydb -c "COPY mytable (id1, id2, score) FROM 'file_000'"
$ psql ...
$ psql mydb -c "COPY mytable (id1, id2, score) FROM 'file_099'" # 100 files total

I'm running Postgres 9.1. There are no other tables in the database. This is not a production environment. The files are TSV text files. The only output from each COPY command is something like "COPY 256448118" -- at least until I ran out of disk space.

Am I doing something wrong here, or is this the expected behavior?

Subquestion 1: Where is this extra storage overhead coming from, and can I avoid it?

Update: It looks like there is a HeapTupleHeader of 23 bytes on each row, so that probably explains this overhead (source: StackOverflow post). Any way to avoid this?

Subquestion 2: If storage requirements are indeed 4x that of expected size, can I speed up importing (i.e. with some configuration change, etc)?

Subquestion 3: I need an index on id1, so what will storage requirements be for that during and after creation (I planned to run CREATE INDEX id1x ON mytable (id1))?

When you say "MySQL", do you mean "MySQL with MyISAM"? Or "MySQL with InnoDB"? Very different beasts. MyISAM: Like a rocket car, it's super fast until it explodes and sprays pieces everywhere. — Craig Ringer, Aug 13 '13 at 13:42
@CraigRinger -- I have exploding rocket car apparently (as my other DBA.SE post describes) :P At this point, the data is almost done loading, so I'll live with the slowness and mass of storage costs.. now just need to figure out how best to create the index efficiently — Dolan Antenucci, Aug 13 '13 at 13:54
I cranked up work_mem to 64gb and started the CREATE INDEX, so we'll see how it goes. — Dolan Antenucci, Aug 13 '13 at 17:13
It's maintenance_work_mem that should be changed. See Resource Consumption in the doc. — Daniel Vérité, Aug 13 '13 at 18:07
I have the create index command running, but it is only using 1.8gb ram, even with maintenance_work_mem=8gb, and work_mem=64gb. Right now there's a tmp file in postgres data dir that is slowly growing. (index type = btree) — Dolan Antenucci, Aug 13 '13 at 18:50

score 1 · Answer 1 · answered Aug 13 '13 at 20:33

SubQuestion 1: I don't know of any way to get rid of the 23 byte header. There is also a 24 byte header on the page. http://www.postgresql.org/docs/8.3/static/storage-page-layout.html

SubQuestion 2: This guide should help. As suggested, increasing maintenance_work_mem is advised, but also increasing checkpoint_segments should help (to reduce the frequency of checkpoints). You've set your work_mem at 64GB, which seems really high. I would suggest keeping it under 2GB even on a system like yours. work_mem is for doing sorts and operations of that nature. http://www.postgresql.org/docs/current/interactive/populate.html

SubQuestion 3: I am not able to find any information on the physical layout of indexes within the pages. I would create a temporary table, fill it with test data (1 million rows?), create a temporary index, get the size of the index from the system, then divide by the number of rows to get the average. I can't show you an SQLFiddle version because it doesn't allow access to system functions. Try these techniques: http://postgresql.cc/postgres-index-size

PS - 25 billion rows? That's more than I've ever dealt with. I'm in the 4 billion range for one particular table ... For comparison, other database servers, like SQL Server 2008+, have a minimum 9 byte header for each row, so you can't entirely get rid of it but I've always thought that PostgreSQL's 23 byte header was a bit excessive.

Help with Postgres 9.1 data import (is 4x expected data size, 4x slower than MySQL, and index still needed)

1 Answers1