Batch Importing Data Files into R

Picture of a database and two arrows going to and from the R logo.

As an avid programmer and data enthusiast, folks often approach me with questions about streamlining data preparation processes. Anyone who knows me knows that I am an advocate for getting things done with the least amount of effort. (In some corners of the internet, folks refer to this as being “effectively lazy”— or as I refer to it, ‘an ethic of working smarter, not harder.’) It should not come as a surprise then that I would apply this ethic to my data work.

Not too long ago, I was working on a data analytics project where the data were stored in multiple files (comma-separated values (CSV) files to be exact) in a simple tabular format:

Picture of a spreadsheet showing a table with 5 columns and 4 rows.

I recently talked to someone about the project, and they wanted to learn more about my process for importing (or reading) data files into R (my programming environment of choice). When I asked why they replied, ‘because I import data files one at a time.’

Don’t get me wrong: there is ABSOLUTELY nothing wrong with reading files this way. But one of the major benefits of programming is that you can automate repetitive and tedious tasks. Naturally, I put on my educator hat and dedicated an entire blog post to it.

Disclaimer: This post is intended for intermediate to advanced R users already familiar with Base R, the Tidyverse, loops, and functions. Some of the concepts may be unfamiliar if you are new to R. (I may follow up with additional posts introducing R and its functionalities…Or I will keep posting these one-offs. Who knows?)

READING (IMPORTING) DATA INTO R

Ok, so why do we care about reading data into R? Well, most people store their data on their computer (or a web address or server) in a CSV, XLSX, or text file (e.g., .TXT, .DAT). Therefore, there is a need to import that data into a program before any analyses can be performed. Super fancy folks with a lot of data tend to store their information in databases. Situations where you have to query a database to extract data are beyond this article’s scope. However, I encourage all of you to learn how to pull data from databases.

READING A SINGLE DATA SET INTO R

Using R, you can quickly import (i.e., read) data from a wide variety of file formats. In Base R (which is R’s basic functionalities), you could import the contents of a single CSV file (with variable names in the first row) using the following function (be sure to replace file.location.here.csv with the file path to a CSV data file):

Mydata <- read.csv("file.location.here.csv", header = TRUE, sep = ",")

I, myself, am partial to the data.table function fread when dealing with large data sets (or small honestly):

Mydata <- fread("file.location.here.csv")

READING MULTIPLE DATA SETS INTO R – THE LONG WAY

Let’s say you had four CSV files you wanted to import into your R environment. You could import these files one-by-one using a function like fread:

dat_1 <- fread(file = "sample_dat_1.csv")
dat_2 <- fread(file = "sample_dat_2.csv")
dat_3 <- fread(file = "sample_dat_3.csv")
dat_4 <- fread(file = "sample_dat_4.csv")

OR you could import the data sets all at the same time. Below I outline my three preferred methods.

READING MULTIPLE DATA SETS INTO R – THE FUN WAY!

Setting Up Your Environment
Before importing a batch of files from a directory, several pieces of information are helpful to know:

  1. Where the data are located;

  2. The extension of the data files; and

  3. What variable(s) the data frame(s) will be assigned to.

Let’s say you have a folder called importing files on your computer that contains four CSV files titled sample_dat_1.csv, sample_dat_2.csv, sample_dat_3.csv, and sample_dat_4.csv. (Note: These are generated (fake) data files.) Create a new R script file in a text editor (I called my file importing files.R), save it to the importing files folder (with the .R Extension), and open the (script) file in R.

First, check to make sure you know where the CSV files are. Given all files have been saved to the same folder, the list of files of the working directory will include both the R script file and the CSV files. We only care about the (CSV) data files. To quickly produce a list of the CSV files, use the list.files() function, setting the pattern argument to \\.csv$ (this will allow us to search for files in our folder that have a CSV file extension) and full.names argument to TRUE. (I know, I know, you technically do not need the last argument as everything is in the same directory. But in my experience, it is always a good habit to work with relative file paths.)

CSVfiles <- list.files(pattern="\\.csv$", full.names=TRUE)
print(CSVfiles)

[1] "./sample_dat_1.csv" "./sample_dat_2.csv" "./sample_dat_3.csv"
[4] "./sample_dat_4.csv"

Okay, so we know:

  • Where the data are located AND

  • The file extension of the data files.

But we still need to define the variable(s) the data frame(s) will be assigned to. Create these variables by removing everything before the last slash along with the .csv extensions from the file paths we extracted in the previous step. This can easily be accomplished using the mgsub function from the mgsub (i.e., Multiple, Simultaneous String Substitution) package (you could also use the str_replace_all function from the stringr package here.):

dataframe_names <- mgsub(CSVfiles, c("./",".csv"), c("",""))
# or using str_replace_all
# CSVfiles %>% str_replace_all(c(".csv" = "", "./" = ""))
print(dataframe_names)

[1] "sample_dat_1" "sample_dat_2" "sample_dat_3" "sample_dat_4"

(Note, I also saved these names to a character vector (with multiple elements) called dataframe_names.)

Okay, now we are ready to rock n roll!

Option # 1: List + For Loop (Packages used: Purrr (can be loaded from tidyverse)data.table)

Create an empty list

dat <- list()

Next, write a for loop that iterates over the unique data files in the CSVfiles variable we created and imports each file using the fread function. Note how each data file is assigned as the nth element of the list.

for(x in unique(CSVfiles)){
       dat[[x]] <- fread(x)
}

Then assign the data frame names we created to the elements of the list using the set_names function from the Purrr package.

dat <- dat %>% set_names(dataframe_names)

Option # 2: lapply (Packages used: Purrr (can be loaded from tidyverse)data.table)

The family of apply functions are the backbone of Base R. Lapply is probably my favorite of the bunch. It can easily condense for loops into a single line of code. Using lapply, the four data files can be imported like so:

dat <- lapply(CSVfiles, FUN = function(x) fread(x)) %>%
set_names(dataframe_names)

Note how the results of the lapply function are fed to the set_names function using the pipe (%>%) operator.

Option # 3: map (Packages used: Purrr (can be loaded from tidyverse)data.table)

The map functions (from the Purrr package) are very similar to the family of apply functions. (If you understand how to use the apply functions, the map functions are the apply functions’ less transparent cousins.)

To import the data files using this method, you need just three lines of code separated by the pipe (%>%) operator. The first line is the variable CSVfiles (which, if you remember, is where the file paths to the data files are stored). These file paths are then passed to the map function in the second line. Finally, in line three, we assign each element in the list a name using the set_names function.

dat <- CSVfiles %>%
map(fread) %>%
set_names(dataframe_names)

Putting it All Together

Once you have imported the data files, the last step is to unload the elements (in this case, data frames) from the list into the environment. (You could also work with the data frames nested in the list. This is a particularly appealing option for data practitioners who intend to perform the same operation(s) on all data frames.)

invisible(list2env(dat ,.GlobalEnv))
ls()

[1]  "%notin%"                      "CSVfiles"                "dat"
[4]  "dataframe_names"   "pckgs"                   "sample_dat_1"
[7]  "sample_dat_2"            "sample_dat_3"   "sample_dat_4"
[10] "x"

And VOILA! Now you can go forth and process your data!

So, the next time you work on a project with data stored in multiple files, use one of my favorite three methods to simplify the tedious task of data import.

Do you have a favorite function or method for simplifying or automating your data import process in R? Let me know in the comments below.

Need support streamlining your data preparation process? Get in Touch!

Previous
Previous

Variable Renaming in R

Next
Next

Beyond the Trend: Storytelling with Scatter Plots