How to handle 2500 .csv files in R

Question

I was wondering how I can handle 2500 .csv files with the same number of columns (i.e., 4 columns)? I want to import these files, remove the first and second columns, change the title of the remaining columns to x and y, remove duplicates in each data frame, and finally save these data frames as separate .csv files (2500 files). I have used the following script:

library(dplyr)
# Get all filenames
list.files(path = "D:/R_project", full.names = `TRUE) %>%`
  # Import all files
  purrr::map(readr::read_csv) %>%
  purrr::map(
    ~ .x %>%
      # Select columns and rename
      select(
        x = Col3,
        y = Col4
      ) %>% 
      # Remove duplicates
      distinct()
  ) %>% 
  # Save all files (same filename, but in a different folder)
  purrr::walk2(
    list.files("D:/R_project/Treated"),
    ~ readr::write_csv(x = .x, file = paste0("output/folder/", .y))
  )

However, I received this error for all of the data frames in the end (below is an example for one of the data frames):

Rows: 1579 Columns: 4
Column specification ---------------------------------------
Delimiter: ","

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:                                                                                                          
* `` -> ...1

How can I rectify the problem? any helps?

Read https://stackoverflow.com/a/24376207/3358227. Load them in using `alldat — r2evans, Dec 28 '21 at 20:57
Welcome to Stack Overflow! Can you say a little bit more about what you've tried so far to find answers to your own questions? If you can figure out how to do the steps you want for a single file, then a `for` loop iterating over the names of the files should do the trick ... — Ben Bolker, Dec 28 '21 at 20:59
Thank you @BenBolker! Actually, I know the steps for a single file but don't know how to use them for more than 2000 .csv files. For example, I know that there are several ways for removing certain columns from a single data frame. I searched the Stack Overflow for discovering more on the topic but there were no similar issues. — Hooman Norouzi, Dec 28 '21 at 21:49
If you post example code for handling a single file, someone can help show you how to make it work for 2500 separate files. — Ben Bolker, Dec 28 '21 at 22:30
Does this answer your question? [How do I make a list of data frames?](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames) — Avraham, Dec 28 '21 at 22:37
That's not an error, it's just a message to let you know how the files were read. It even says in the message how to turn it off. Is that the only issue happening here? Do the files write the way you expect them to? — camille, Dec 29 '21 at 23:19

score 0 · Accepted Answer · answered Dec 28 '21 at 22:36

There are many ways to do this in R. Under is an example using dplyr for data manipulation, readr for import/export of CSVs and purrr to handle all your files at the same time.

library(dplyr)

# Get all filenames
list.files("path/to/your/csv/files", full.names = TRUE) %>%
  # Import all files
  purrr::map(readr::read_csv) %>%
  purrr::map(
    ~ .x %>% 
      # Select columns and rename
      select(
        x = <your x column>,
        y = <your y column>
      ) %>% 
      # Remove duplicates
      distinct()
  ) %>% 
  # Save all files (same filename, but in a different folder)
  purrr::walk2(
    list.files("path/to/your/csv/files"),
    ~ readr::write_csv(x = .x, file = paste0("output/folder/", .y))
  )

Since you didn't provided us with any code you probably need to do some adjustments to this example, but I hope it's enough to get you started.

score -1 · Answer 2 · answered Dec 29 '21 at 01:19

If you don't need to use R, there's a great command-line utility for processing CSVs that will do everything you need, GoCSV.

I have two small sample CSVs, file1.csv and file2.csv:

c1,c2,c3,c4
1,2,a,b
1,2,a,b
2,3,b,c

c1,c2,c3,c4
5,6,e,f
5,6,e,f
6,7,f,g

This little script:

cuts out the first 2 columns
dedupes based on the remaining 2 columns
renames the those 2 columns

ls file1.csv file2.csv| while read CSV; do
    gocsv select -c c3,c4 $CSV |                     # cut out the first 2 columns
    gocsv uniq -c c3,c4 |                            # dedupe based on the remaining 2 columns combined
    gocsv rename -c c3,c4 -names c1,c2 > $CSV.fixed  # rename columns and save to tmp file
    mv $CSV.fixed $CSV                               # mv tmp back to old filename
done

The "fixed" CSVs now look like this:

c1,c2
a,b
b,c

and

c1,c2
e,f
f,g

How to handle 2500 .csv files in R

2 Answers2