52

I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?

Burhan Ali
  • 2,256
  • 1
  • 25
  • 38
Elyza Agosta
  • 531
  • 2
  • 5
  • 8
  • See my answer here [without downloading zip files] https://stackoverflow.com/a/45771620/348168 – Vinod Aug 19 '17 at 12:51

10 Answers10

82

I used the zipfile module to import the ZIP directly to pandas dataframe. Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":

import pandas as pd
import zipfile

zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip') 
df = pd.read_csv(zf.open('intfile.csv'))
ZygD
  • 10,844
  • 36
  • 65
  • 84
Yaron
  • 1,477
  • 13
  • 14
45

If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:

import csv
from io import TextIOWrapper
from zipfile import ZipFile

with ZipFile('yourfile.zip') as zf:
    with zf.open('your_csv_inside_zip.csv', 'r') as infile:
        reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
        for row in reader:
            # process the CSV here
            print(row)
Community
  • 1
  • 1
volker238
  • 2,061
  • 1
  • 18
  • 15
27

A quick solution can be using below code!

import pandas as pd

#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")
Hari Prasad
  • 852
  • 1
  • 8
  • 10
  • Outstanding answer! I check that using this same solution without the ".csv" extension also works: `df = pd.read_csv("/path/to/file.zip")` – Gian Arauz Mar 04 '21 at 14:46
10

zipfile also supports the with statement.

So adding onto yaron's answer of using pandas:

with zipfile.ZipFile('file.zip') as zip:
    with zip.open('file.csv') as myZip:
        df = pd.read_csv(myZip) 
gaius_baltar
  • 109
  • 1
  • 2
9

Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:

import os
import pandas as pd
import zipfile

curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []

print ("Uncompressing and reading data... ")

for text_file in text_files:
    print(text_file.filename)
    df = pd.read_csv(zf.open(text_file.filename)
    # do df manipulations
    list_.append(df)

df = pd.concat(list_)
Xukrao
  • 6,984
  • 4
  • 25
  • 50
Arthur D. Howland
  • 3,561
  • 3
  • 19
  • 26
5

Yes. You want the module 'zipfile'

You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])

You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])

brycem
  • 573
  • 3
  • 9
4

this is the simplest thing I always use.

import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')
SHR
  • 7,570
  • 9
  • 36
  • 56
3

Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Anatoly Alekseev
  • 1,554
  • 20
  • 19
2

Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:

#!/usr/bin/env python3

from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile

import requests

def all_tickers():
    url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
    r = requests.get(url)
    zip_ref = ZipFile(BytesIO(r.content))
    for name in zip_ref.namelist():
        print(name)
        with zip_ref.open(name) as file_contents:
            reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
            for item in reader:
                print(item)

This takes care of all python3 bytes/str issues.

hughdbrown
  • 45,464
  • 20
  • 81
  • 105
-1

If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip

you may simply do this:

df = pd.read_csv("my_big_file.zip")

Note: check your pandas version first (not applicable for older versions)

enter image description here

adhg
  • 9,777
  • 11
  • 56
  • 93