0

I have a 7 gb zip file (~80gb unzipped) that is on a network drive. I am operating in a VDI where I can use Python, but have limited storage. This file is 65 million + rows of data that I need to load into an mssql database. Using legacy processes, this takes around 5 days. Which seems absurd to me.

So I'm wondering if there is a way that I could stream the file and unzip it on the fly, while using sqlalchemy and fast_execute_many to upload the data into the database.

Other possible options could be downloading the file to my local machine and unzipping there, then giving the VDI access to it. But I'm not sure how slow that would be when it came to streaming the file?

I have tested creating a connection and uploading to a table using sqlalchemy. I'm able to upload 100k rows in around 7 seconds. - Napkin math says that 65mil rows could go as quickly as a little over an hour. (Obviously wont be that fast because the real data has 10x the number of columns as my test set, but I think I could get orders of magnitude quicker than legacy processes.) (14,285 rows per second,4500 seconds for 65 mil rows = 75 min upload time.)

#database code
import sqlalchemy as sa, pyodbc, time, pandas as pd
from from urllib.parse import quote_plus

#db credentials
db_UID =''
db_PWD =''

#MSSQL connection settings
trust = 'yes'
driver ='{ODBC Driver 17 for SQL Server}'
server = ''
database = ''
schema = ''
table = ''
con_string = f'DRIVER={driver};SERVER={server};DATABASE={database};UID={db_UID};PWD={db_PWD};Trusted_Connection={trust};'

#initialize conn for sqlalchemy
quoted = quote_plus(con_string)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = sa.create_engine(new_con, fast_executemany=True)
s = time.time()
output_df.to_sql(table, engine, schema = schema, if_exists = 'replace', chunksize = None)
print(time.time() - s)
engine.dispose()

EDIT:

Some things that I am thinking about trying, but have no idea if it will work:


#stream a zip

with ZipFile(myFile, 'r') as zipObj:
    file_objects = [zipObj.read(item).decode() for item in zipObj.namelist()]

#using naya

messages = ny.stream_array(ny.tokenize(open('file', encoding='utf-8')))
for i in messages: "load to sql?"


#possibly some promising solution pieces here where I could read row by row from the zipped csv, and then do an insert:

https://stackoverflow.com/questions/26942476/reading-csv-zipped-files-in-python

Steve
  • 535
  • 3
  • 11

0 Answers0