12

I was looking for ways to speed up pushing a dataframe to sql server and stumbled upon an approach here. This approach blew me away in terms of speed. Using normal to_sql took almost 2 hours and this script was done in 12.54 seconds to push a 100k row X 100 column df.

So after testing the code below with a sample df, I attempted to use a df that had many different datatypes (int, string, floats, Booleans). However, I was sad to see a memory error. So I started reducing the size of my df to to see what the limitations were. I noticed that if my df had any strings then I wasn't able to load to sql server. I am having trouble isolating the issue further. The script below is taken from the question in the link, however, I added a tiny df with strings. Any suggestions on how to rectify this issue would be great!

import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
import pyodbc

conn =  "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)


@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    print("FUNC call")
    if executemany:
        cursor.fast_executemany = True


table_name = 'fast_executemany_test'
df1 = pd.DataFrame({'col1':['tyrefdg','ertyreg','efdgfdg'],
                   'col2':['tydfggfdgrefdg','erdfgfdgfdgfdgtyreg','edfgfdgdfgdffdgfdg']
                   })



s = time.time()
df1.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)
Martin Bobak
  • 1,979
  • 2
  • 24
  • 44

3 Answers3

23

I was able to reproduce your issue using pyodbc 4.0.23. The MemoryError was related to your use of the ancient

DRIVER={SQL Server}

Further testing using

DRIVER=ODBC Driver 11 for SQL Server

also failed, with

Function sequence error (0) (SQLParamData)

which was related to an existing pyodbc issue on GitHub. I posted my findings here.

That issue is still under investigation. In the meantime you might be able to proceed by

  • using a newer ODBC driver like DRIVER=ODBC Driver 13 for SQL Server, and
  • running pip install pyodbc==4.0.22 to use an earlier version of pyodbc.
Gord Thompson
  • 107,466
  • 28
  • 191
  • 387
  • 2
    Thanks! pyodbc 4.0.22 didn't work for me but pyodbc 4.0.19 worked fine. Also, I noticed that the script throws an error when adding datatypes into the `to_sql` statement. So i just load the data to SQL Server and change the datatypes there, in case anyone is getting that same error. – Martin Bobak Apr 23 '18 at 18:35
  • Thanks @MartinBobak It's such a strange error..! I had to fiddle around with the metadata and it worked, but very annoying. Also found the performance boost amazing. – Umar.H Jun 17 '19 at 13:45
1

I was facing thi issue too.

Solved by changing:

conn =  "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"

To this:

conn =  "DRIVER={ODBC Driver 17 for SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
Hugo Vares
  • 767
  • 6
  • 7
0

I experienced this issue with 32 bit and switched my interrupter to 64 bit which resolved my memory issue. Out side of that solution, I would recommend is chunking the amount of data you process. You can establish your threshold and once that threshold is met you can process that chunk of data and iterate through until you have processed all of the data.