Import multiple csv files into pandas and concatenate into one DataFrame

Question

I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

import glob
import pandas as pd

# get data file names
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

I guess I need some help within the for loop???

your code does nothing because you are not appending to your `dfs` list, don't you want to replace the line `data = pd.read_csv(filename)` with `dfs.append(pd.read_csv(filename)`. You would then need to loop over the list and `concat`, I don't think `concat` will work on a list of `df`s. — EdChum, Jan 03 '14 at 15:05
also you are mixing an alias for the module with the module name in your last line, shouldn't it be `big_frame = pd.concat(dfs, ignore_index=True)`?, anyway once you have a list of dataframes you will need to iterate over the list and concat to `big_frame` — EdChum, Jan 03 '14 at 15:11
Yes, I edited the code, but i'm still not able to build a concatenated dataframe from the csv-files, I'm new to python so I need some more help on this — jonas, Jan 03 '14 at 15:14
you need to loop over `dfs` now, so something like `for df in dfs: big_frame.concat(df, ignore_index=True)` should work, you could also try `append` instead of `concat` also. — EdChum, Jan 03 '14 at 15:16
Can you tell more exactly what is not working? Because `concat` should handle a list of DataFrames just fine like you did. I think this is a very good approach. — joris, Jan 03 '14 at 15:18
I stand corrected on the use of `concat` after reading the [docs](http://pandas.pydata.org/pandas-docs/dev/merging.html), it should work like @joris says, can you edit your question and post the error message — EdChum, Jan 03 '14 at 15:21
I guess I'm totally lost now. Are you able to try to improve my code on some simple csv files and build a dataframe? — jonas, Jan 03 '14 at 15:23
Jonas, from what we can see, your code is completely fine. Can you specify what goes wrong? Does reading only 1 csv file work? What is the output of that? Does the list `dfs` contain several dataframes? What do you get if you print that? — joris, Jan 03 '14 at 15:30
Hm, if I try to read only one file from the file list I get an error message (df1 =pd.read_csv(filenames[0])). error: CParserError: Error tokenizing data. C error: Expected 1 fields in line 1040, saw 2 — jonas, Jan 03 '14 at 15:41
Then there goes something wrong with the reading of your files itself, not with the concatenation. Is it a regulat csv file? Because apparantly there is something wrong on line 1040 as your error message says. — joris, Jan 03 '14 at 15:45
Something wrong with the csv files I guess!! I have the excel files, any suggestions on how to perform the same operations on excel files?? — jonas, Jan 03 '14 at 15:57
There is a `read_excel` function (http://pandas.pydata.org/pandas-docs/dev/io.html#io-excel), but maybe you can just figure out what is wrong with the csv file: just look at line 1040 of the file. — joris, Jan 03 '14 at 16:01
tried to clean the files but still there are errors, posted a new question for reading excelfiles instead.. — jonas, Jan 03 '14 at 16:17
Joris, if you add an answer I could rate it and accept your answer. Thank you so much for helping — jonas, Jan 03 '14 at 16:43
The easiest answer, I found in here https://stackoverflow.com/a/53017476/7060530 — Sayed Mohsin Reza, Oct 26 '18 at 23:42

score 746 · Accepted Answer · edited Apr 14 '22 at 02:48

746

If you have same columns in all your csv files then you can try the code below. I have added header=0 so that after reading csv first row can be assigned as the column names.

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

edited Apr 14 '22 at 02:48

paradocslover

2,600
3
15
38

answered Jan 20 '14 at 11:29

Gaurav Singh

11,121
5
21
24

4

This seems like an old fashioned aka manual way of doing things, esp. as the Hapood ecosystem has growing list of tools where you can perform sql queries directly on many different directories containing different file types (csv, json, txt, databases) as if it was one data source. There must be something similar in python, since it has had a 20 year jump start on doing "big data". – Hexatonic Dec 28 '15 at 04:22
433

The same thing more concise, and perhaps faster as it doesn't use a list: `df = pd.concat((pd.read_csv(f) for f in all_files))` Also, one should perhaps use `os.path.join(path, "*.csv")` instead of `path + "/*.csv"`, which makes it OS independent. – Sid Jan 23 '16 at 00:41
6

Using this answer allowed me to add new column with the file name eg with `df['filename'] = os.path.basename(file_)` in the for file_ loop .. not sure if Sid's answer allows this? – curtisp Oct 20 '16 at 19:49
13

@curtisp you can still do that with Sid's answer, just use `pandas.read_csv(f).assign(filename = foo)` inside the generator. `assign` will return the entire dataframe including the new column `filename` – C8H10N4O2 Apr 04 '17 at 20:50
If you have __many__ files, I'd use a generator instead of importing + appending to a list before concatenating them all. – gustafbstrom Feb 19 '20 at 11:35
Note that glob.glob() won't preserve the order of your files, so you need to throw a quick `sorted(all_files)` in there for that. – sigma1510 Jun 26 '20 at 17:15
1

This was first clear answer I was able to find hat described combining multiple csv into list, then convert combined to dataframe without having to define dataframe columns first. I modified this answer for my use case combining multiple `requests.get(url)` csv responses by replacing `filename` with ` io.StringIO(response.content.decode('utf-8'))` – curtisp Jul 12 '20 at 16:09
I use this solution to combine multiple excel files. The files have latitude and logitude and when I use pd.read_excel to check each file both of the values are read correct as float. When I use your solution to convert the files into one datafrage latitude is always an object and only longitude is correct as float. Any ideas why this is so? – Gobrel Dec 22 '21 at 13:35

score 359 · Answer 2 · edited May 23 '17 at 10:31

359

An alternative to darindaCoder's answer:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one

edited May 23 '17 at 10:31

Community

1
1

answered Apr 05 '16 at 02:47

Sid

5,442
2
13
18

4

@Mike @Sid the final two lines can be replaced by: `pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)`. The inner brackets are required by Pandas version 0.18.1 – Dr Fabio Gori Oct 31 '16 at 15:27
15

I recommend using `glob.iglob` instead of `glob.glob`; The first one returns and [iterator (instead of a list)](https://docs.python.org/3/library/glob.html#glob.iglob). – toto_tico Aug 02 '17 at 12:52

score 107 · Answer 3 · edited Jan 27 '21 at 11:57

107

import glob
import os
import pandas as pd   
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

edited Jan 27 '21 at 11:57

Asocia

5,441
2
19
42

answered Feb 21 '17 at 16:25

Jose Antonio Martin H

1,327
1
9
10

5

Excellent one liner, specially useful if no read_csv arguments are needed! – rafaelvalle Nov 09 '17 at 19:38
23

If, on the other hand, arguments are needed, this can be done with lambdas: `df = pd.concat(map(lambda file: pd.read_csv(file, delim_whitespace=True), data_files))` – fiedl Apr 11 '18 at 14:46
2

^ or with `functools.partial`, to avoid lambdas – cs95 May 27 '19 at 05:10

robmsmt · Answer 4 · 2021-11-05T03:10:55.083

79

Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional 3rd party libraries. You can do this in 2 lines using everything Pandas and python (all versions) already have built in.

For a few files - 1 liner

df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))

For many files

import os

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

For No Headers

If you have specific things you want to change with pd.read_csv (i.e. no headers) you can make a separate function and call that with your map:

def f(i):
    return pd.read_csv(i, header=None)

df = pd.concat(map(f, filepaths))

This pandas line which sets the df utilizes 3 things:

Python's map (function, iterable) sends to the function (the pd.read_csv()) the iterable (our list) which is every csv element in filepaths).
Panda's read_csv() function reads in each CSV file as normal.
Panda's concat() brings all these under one df variable.

edited Nov 05 '21 at 03:10

answered Jun 30 '18 at 21:23

robmsmt

1,252
11
19

4

or just `df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv))` – muon Mar 01 '19 at 18:05
I tried the method prescribed by @muon. But, i have multiple files with headers(headers are common). I don't want them to be concatenated in the dataframe. Do you know how can i do that ? I tried `df = pd.concat(map(pd.read_csv(header=0), glob.glob('data/*.csv))` but it gave an error "parser_f() missing 1 required positional argument: 'filepath_or_buffer'" – cadip92 Mar 03 '20 at 13:14
It's a little while since you asked... but I updated my answer to include answers without headers (or if you want to pass any change to read_csv). – robmsmt Nov 05 '21 at 03:01

score 63 · Answer 5 · edited Jul 31 '21 at 18:45

63

Easy and Fast

Import two or more csv's without having to make a list of names.

import glob
import pandas as pd

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))

edited Jul 31 '21 at 18:45

marcelovca90

2,603
3
24
33

answered Mar 30 '19 at 22:09

MrFun

1,723
13
16

3

How do we pass arguments to this syntax? – delimiter Sep 21 '21 at 20:37
My answer: https://stackoverflow.com/a/69994928/10358768, inspired from this particular answer! – Milan Nov 16 '21 at 19:21
@delimiter, to insert the file path to your docs, replace the word 'data' with your file path, and keep the / at the end. – Brad G Grounds Feb 17 '22 at 20:55

Jouni K. Seppänen · Answer 6 · 2020-11-30T04:05:00.750

59

The Dask library can read a dataframe from multiple files:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(Source: https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files)

The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.

edited Nov 30 '20 at 04:05

answered Feb 13 '18 at 08:13

Jouni K. Seppänen

40,105
5
71
99

Similar to this, there should be a function in pandas API for reading multiple files in a dir. Apparently it does not have it, as now. – Shiv Krishna Jaiswal Aug 08 '21 at 13:42
Does this work with AWS S3 paths too? – xmar Jun 02 '22 at 11:45

SKG · Answer 7 · 2017-07-17T18:54:08.750

17

Edit: I googled my way into https://stackoverflow.com/a/21232849/186078. However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.

I do sincerely want anyone hitting this page to consider this approach, but don't want to attach this huge piece of code as a comment and making it less readable.

You can leverage numpy to really speed up the dataframe concatenation.

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1","col2"....]

Timing stats:

total files :192
avg lines per file :8492
--approach 1 without numpy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with numpy -- 2.289292573928833 seconds ---

edited Jul 17 '17 at 18:54

answered Mar 17 '16 at 22:44

SKG

1,342
2
12
23

Any numbers to back the "speed up"? Specifically, is it faster than http://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe/36073025#comment57648172_21232849 ? – ivan_pozdeev Mar 17 '16 at 22:46
I don't see the OP asking for a way to speed up his concatenation, this just looks like a rework of a pre-existing accepted answer. – pydsigner Mar 17 '16 at 22:49
3

That won't work if the data has mixed columns types. – Pimin Konstantin Kefaloukos Nov 16 '17 at 13:46
1

@SKG perfect.. this is the only working solution for me. 500 files 400k rows total in 2 secs. Thanks for posting it. – FrankC Sep 15 '18 at 01:45
1

1500 files and 750k rows in 5 secs. Excellent @SKG – J. Velazquez-Muriel Aug 12 '20 at 22:57
' np_array_list.append(df.as_matrix())' showing an error AttributeError: 'DataFrame' object has no attribute 'as_matrix' – Krishnaap Apr 28 '22 at 10:14

muon · Answer 8 · 2020-04-23T20:38:30.140

14

one liner using map, but if you'd like to specify additional args, you could do:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None), 
                    glob.glob("data/*.csv")))

Note: map by itself does not let you supply additional args.

edited Apr 23 '20 at 20:38

answered Mar 01 '19 at 18:15

muon

10,786
7
60
76

toto_tico · Answer 9 · 2018-09-09T19:36:50.407

If you want to search recursively (Python 3.5 or above), you can do the following:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

Note that the three last lines can be expressed in one single line:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.

EDIT: Multiplatform recursive function:

You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:

df = read_df_rec('C:\user\your\path', *.csv)

Here is the function:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

score 7 · Answer 10 · answered Nov 16 '21 at 19:20

Inspired from MrFun's answer:

import glob
import pandas as pd

list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()

df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)

Notes:

By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp.
In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like
```
    timestamp    id    valid_frame
0
1
2
.
.
.
0
1
2
.
.
.
```
With ignore_index=True, it looks like:
```
    timestamp    id    valid_frame
0
1
2
.
.
.
108
109
.
.
.
```
IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g. begin_timestamp = df['timestamp'][0]

Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.

score 6 · Answer 11 · edited Feb 25 '21 at 15:06

6

If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:

import zipfile
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train = []

train = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() ]

df = pd.concat(train)

edited Feb 25 '21 at 15:06

Skippy le Grand Gourou

5,996
4
52
70

answered Aug 12 '16 at 05:48

Nim J

883
2
9
15

score 6 · Answer 12 · answered Mar 11 '19 at 12:35

6

Another on-liner with list comprehension which allows to use arguments with read_csv.

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

answered Mar 11 '19 at 12:35

mjspier

5,992
5
30
42

score 6 · Answer 13 · answered Sep 20 '19 at 13:08

Alternative using the pathlib library (often preferred over os.path).

This method avoids iterative use of pandas concat()/apped().

From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

Paul Rougieux · Answer 14 · 2022-03-14T09:41:48.630

Based on @Sid's good answer.

To identify issues of missing or unaligned columns

Before concatenating, you can load csv files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.

Import modules and locate file paths:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

Note: OrderedDict is not necessary, but it'll keep the order of files which might be useful for analysis.

Load csv files into a dictionary. Then concatenate:

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

Keys are file names f and values are the data frame content of csv files. Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.

Gonçalo Peres · Answer 15 · 2021-05-20T21:13:30.857

2

import os

os.system("awk '(NR == 1) || (FNR > 1)' file*.csv > merged.csv")

Where NR and FNR represent the number of the line being processed.

FNR is the current line within each file.

NR == 1 includes the first line of the first file (the header), while FNR > 1 skips the first line of each subsequent file.

edited May 20 '21 at 21:13

answered Feb 08 '21 at 14:00

Gonçalo Peres

6,010
3
36
66

1

I bet this is a lot faster than using pandas concat! – kristianp Dec 30 '21 at 23:37

score 2 · Answer 16 · answered Jun 22 '21 at 17:27

In case anyone is facing Unnamed column issue, can use this code for merging multiple csv files along x-axis.

import glob
import os
import pandas as pd

merged_df = pd.concat([pd.read_csv(csv_file, index_col=0, header=0) for csv_file in glob.glob(
        os.path.join("data/", "*.csv"))], axis=0, ignore_index=True)
    
merged_df.to_csv("merged.csv")

score 0 · Answer 17 · answered Oct 21 '20 at 07:05

You can do it this way also:

import pandas as pd
import os

new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
    for file in f:
        complete_file_path = csv_folder_path+file
        read_file = pd.read_csv(complete_file_path)
        new_df = new_df.append(read_file, ignore_index=True)


new_df.shape

score 0 · Answer 18 · answered Dec 02 '21 at 08:23

Consider using convtools library, which provides lots of data processing primitives and generates simple ad hoc code under the hood. It is not supposed to be faster than pandas/polars, but sometimes it can be.

e.g. you could concat csv files into one for further reuse - here's the code:

import glob

from convtools import conversion as c
from convtools.contrib.tables import Table
import pandas as pd


def test_pandas():
    df = pd.concat(
        (
            pd.read_csv(filename, index_col=None, header=0)
            for filename in glob.glob("tmp/*.csv")
        ),
        axis=0,
        ignore_index=True,
    )
    df.to_csv("out.csv", index=False)
# took 20.9 s


def test_convtools():
    table = None
    for filename in glob.glob("tmp/*.csv"):
        table_ = Table.from_csv(filename, header=False)
        if table is None:
            table = table_
        else:
            table = table.chain(table_)

    table.into_csv("out_convtools.csv", include_header=False)
# took 15.8 s

Of course if you just want to obtain a dataframe without writing a concatenated file, it will take 4.63 s and 10.9 s correspondingly (pandas is faster here because it doesn't need to zip columns for writing it back).

score -2 · Answer 19 · answered Dec 21 '19 at 07:20

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)

score -2 · Answer 20 · answered Apr 19 '20 at 11:57

This is how you can do using Colab on Google Drive

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')