4

I am using the following code to connect to a PostgreSQL 12 database:

con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)

This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:

DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)

where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.

When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?

Dylan Russell
  • 906
  • 1
  • 7
  • 20

2 Answers2

7

I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.

Setup for this test:

  • win10 (2004)
  • docker
  • postgres:11 container, using port 35432 on localhost, simple authentication
  • a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
  • I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
DBI::dbCreateTable(con2, "mt", mtcars)
DBI::dbGetQuery(con2, "select count(*) as n from mt")
#   n
# 1 0

z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
nrow(z1000)
# [1] 32000
system.time({
  DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    1.56    1.09   30.90 

system.time({
  data.table::fwrite(z1000, "mt.csv")
  URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
  system(
    sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
            "mt", paste(colnames(z1000), collapse = ","),
            sQuote("mt.csv"), URI)
  )
})    
# COPY 32000
#    user  system elapsed 
#    0.05    0.00    0.19 
DBI::dbGetQuery(con2, "select count(*) as n from mt")
#       n
# 1 64000

While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.


Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:

z100 <- rbindlist(replicate(100, mtcars, simplify=F))

system.time({
  data.table::fwrite(z100, "mt.csv")
  URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
  system(
    sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
            "mt", paste(colnames(z100), collapse = ","),
            sQuote("mt.csv"), URI)
  )
})
# COPY 3200
#    user  system elapsed 
#     0.0     0.0     0.1 

system.time({
  DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    0.17    0.04    2.95 

system.time({
  DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    0.74    0.33   23.59 

(I don't want to time dbAppendTable with z1000 above ...)

(For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...

r2evans
  • 108,754
  • 5
  • 72
  • 122
  • 1
    This is awesome, thanks for the hard work! Will reply and let you know how it goes. – Dylan Russell Jun 06 '20 at 03:12
  • I can't seem to get this to work. I know I have psql installed. It is in my Windows path. The `system()` command should be able to get to it. I can get into my database from the command line. I can even run the command you are writing with `sprintf()` directly in the terminal and it works. But when I try your script, nothing happens. – Dylan Russell Jun 08 '20 at 08:26
  • Ah, well I think I've found at least one thing wrong - when I change the command from `system` to `shell`, I can see the terminal output and it says 'psql.exe' is not recognized as an internal or external command... – Dylan Russell Jun 08 '20 at 08:56
  • Nevermind - I had to restart R since I had updated my path to include the psql binaries. The problem I have now is the console requires interaction to input my password. But I can't do that with the `system` or `shell` command. – Dylan Russell Jun 08 '20 at 09:04
  • Okay, that's not hard to get around. If you use the `processx` package instead of `system` or `shell`, you can use `process::run(cmd, args, env = c(PGPASSWORD="mypass"))`. The start of that is knowing that `psql` can accept its password in an env-var named `PGPASSWORD`. It's not perfectly secure in that multi-user systems allow other users to see all env-vars for each command, but perhaps you're on windows or macos and have some expectations. See https://stackoverflow.com/q/6405127/3358272 (other options are listed, too). – r2evans Jun 08 '20 at 13:37
  • This is a great answer, but it can't process a csv >2GB on Windows. – Kyouma Jul 01 '21 at 21:38
0

I suspect that dbAppendTable results in an INSERT statement per row, which can take a long time for high numbers of rows.

However, you can generate a single INSERT statement for the entire data frame using the sqlAppendTable function and run it by using dbSendQuery explicitly:

res <- DBI::dbSendQuery(con, DBI::sqlAppendTable(con, tablename, df, row.names=FALSE))
DBI::dbClearResult(res)

For me, this was much faster: a 30 second ingest reduced to a 0.5 second ingest.

Michal Charemza
  • 24,475
  • 11
  • 89
  • 143