Concatenate files into one Dataframe while adding identifier for each file

Question

The first part of this question has been asked many times and the best answer I found was here: Import multiple csv files into pandas and concatenate into one DataFrame.

But what I essentially want to do is be able to add another variable to each dataframe that has participant number, such that when the files are all concatenated, I will be able to have participant identifiers.

The files are named like this:

So perhaps I could just add a column with the ucsd1, etc. to identify each participant?

Here's code that I've gotten to work for Excel files:

path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")

li = []

for filename in all_files:
    df = pd.read_excel(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Should the participant number match the number in the filename? — richardec, Nov 17 '21 at 20:31

score 1 · Accepted Answer · answered Nov 17 '21 at 20:35

1

If I understand you correctly, it's simple:

import re # <-------------- Add this line

path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")

li = []

for filename in all_files:
    df = pd.read_excel(filename, index_col=None, header=0)
    participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
    df['participant_number'] = participant_number  # <-------------- Add this line
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

That way, each dataframe loaded from an Excel file will have a column called participant_number, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.

answered Nov 17 '21 at 20:35

richardec

14,202
6
23
49

it looks like all participants are just being named "1" – James Nov 17 '21 at 20:40
Hmm. Well, I can't tell. can you `print(participant_number)` before it's assigned to the df? – richardec Nov 17 '21 at 21:56
1

Yeah, I figured out what's going on, I think. So it would be taking the first digit of the file name which will be the 1 from "Watch_data_1." If use the relative file path, everything seems to work well. Thanks! – James Nov 17 '21 at 22:41
Aha! Funny. Okay, I'm glad you got it working. – richardec Nov 17 '21 at 23:35
1

Thanks for your help! – James Nov 18 '21 at 23:45

Concatenate files into one Dataframe while adding identifier for each file

1 Answers1