21

I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.

However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes

What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?

Thanks for your ideas and suggestions!

Tamil Selvan
  • 1,284
  • 1
  • 7
  • 22
Michael
  • 899
  • 6
  • 24
  • I was just reading in large tables with `data.table::fread` and `read.csv`. `fread(my.csv, data.table=F)` beats out `fread(my.csv)`, and both take 1/10 as long as `read.csv`. I don't yet know much about how `fread` works but is there an equivalent method for fetching query results? – Michael Mar 18 '20 at 18:26
  • As I learn more about `fread`, wondering if there is a way to `dbFetch` by `mmap`ing the query results for fast read in. – Michael Mar 19 '20 at 12:40
  • Did you also try out `dbGetQuery`? Do the queries have to be on the fly? Otherwise you could preprocess some data and store it in an `.rda` and read that. Or maybe you could write the Query result first to a .csv file and then use `fread` – SeGa Mar 23 '20 at 08:47
  • @SeGa, `dbGetQuery` is just as bad... but could you spell out your two recommendations a bit more for me? Those both look promising. I think for my purposes (data exploration, visualization, some cleaning, of data collected by others in a database built by others) being able to do things on the fly is pretty important for now. – Michael Mar 23 '20 at 13:10
  • 1
    You might be better off running the query in SSMS with "Results to Text". That will create a csv flat file - try fread on that csv? – Russell Fox Mar 24 '20 at 21:22
  • To go on @RussellFox direction, maybe try with a dedicated db software to see the time, if it still takes few hours there is few hope that it can be better in R. I have time to time to use a C++ odbc implementation to extract data (also some GB), and it is incredibly slow... – Chelmy88 Mar 25 '20 at 09:50
  • @RusselFox, is there a good SSMS alternative for mac? FWIWI, the query runs locally on the server using SSMS in a couple minutes; I think the slow part is getting the results of the query into R. Thanks for the ideas! I'm experimenting today! – Michael Mar 25 '20 at 14:14
  • It's not a good solution and it does seem like there's a deeper problem at it's root, but you might try using the `future` package to read data in as chunks in parallel to speed it up in the short term – Roger-123 Mar 25 '20 at 14:43
  • Azure Data Studio is available for mac: https://github.com/Microsoft/azuredatastudio. Or you might be able to use the BCP command line tools on the server: "bcp "SELECT TOP 5 [BusinessEntityID],[NationalIDNumber],[OrganizationNode],[OrganizationLevel] FROM AdventureWorks2016CTP3.[HumanResources].[Employee] WITH (NOLOCK)" queryout c:\sql\bcp.txt -c -T" – Russell Fox Mar 25 '20 at 15:50
  • https://www.sqlservercentral.com/articles/8-ways-to-export-sql-results-to-a-text-file – Russell Fox Mar 25 '20 at 15:50
  • Truely an interesting problem. As for a UI alternative, [dbeaver](https://dbeaver.io/download/) can be used ([export example](https://stackoverflow.com/a/47515548/10782538)):. As for how to change the format for `dbFetch` one would have to rewrite the source `C++` code, as this is where most of the package code resides. Similar for `DBI`, which seems to simply call the methods from `odbc`. – Oliver Mar 26 '20 at 08:13
  • I believe @SeGa's comment re: "on the fly" was suggesting something like this workflow. In SQL, `CREATE TABLE TEMP as SELECT LONG_COMPLEX_QUERY FROM TABLE` and then `dbFetch()` the pre-computed table? That's a go-to workflow for me as reading in the pre-computed table executed much faster. – Chase May 09 '20 at 14:33
  • 1
    I found this helps in some situations: https://db.rstudio.com/pool/ – BSCowboy Jan 08 '21 at 23:07
  • I have been running into similar problems on my projects, and I found the solution usually to be to *reduce the amount of data transfered*. There's a reason large datatables reside on databases. The most user-friendly packages I've encounterd is [dbplyr](https://db.rstudio.com/r-packages/dplyr), which allows `dplyr`-style query construction, which only gets evaluated when needed. Can you provide some details on the use cases of your GB+ datasets? – All Downhill From Here Aug 22 '21 at 09:29

0 Answers0