Lesson 8. Use lapply in R Instead of For Loops to Process .csv files - Efficient Coding in R
Learning Objectives
After completing this tutorial, you will be able to:
- Use the
lapply()
function inR
to automate your code.
What You Need
You will need a computer with internet access to complete this lesson.
In the previous lessons, you learned how to use for loops to perform tasks that you want to implement over and over - for example on a set of files. For loops are a good start to automating your code. However if you want to scale this automation to process more and / or larger files, the R
apply
family of functions are useful to know about.
apply
functions perform a task over and over - on a list, vector, etc. So, for example you can use the lapply
function (list apply) on the list of file names that you generate when using list.files()
.
Why Use Apply vs For Loops
There are several good reasons to use the apply
family of functions.
1. They make your code more expressive and in turn easier to read:
Here’s what the master, Hadley Wikham has to say about expressive code and the apply
family:
The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.
Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.
In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast, (million > 0) * 2 - 1. Source: Hadley Wikham, stackoverflow comment
And another quote:
A common reflex is to use a function in the apply family. This is not vectorization, it is loop-hiding. The apply function has a for loop in its definition – Source: Patrick Burns - the Inferno.
2. They make it easier to parallelize your code: Most computers these days have more than one core that can be used to process your data. However, by default, most functions in R only take advantage of one core on your machine. This means your computer course process things faster. Parallelized code refers to code that is optimized to use the cores available to it on a machine. While this topic is out of the scope of this class - it’s important to know about if you ever need to process large amounts of data - particularly in a cloud or high performance computing environment.
3. They make your code just a bit faster At the bottom of this lesson you’ll see a quick benchmark test where you see whether the apply
version of a for loop
is faster or not. The apply
functions do run a for loop
in the background. However they often do it in the C programming language (which is used to build R). This does make the apply
functions a few milliseconds faster than regular for loops. However, this is not the main reason to use apply
functions!
# view code for the lapply function
lapply
## function (X, FUN, ...)
## {
## FUN <- match.fun(FUN)
## if (!is.vector(X) || is.object(X))
## X <- as.list(X)
## .Internal(lapply(X, FUN))
## }
## <bytecode: 0x55612e339e20>
## <environment: namespace:base>
library(parallel)
# how many cores are on this machine
detectCores()
## [1] 36
Use lapply to Process Lists of Files
Next, let’s look at an example of using lapply
to perform the same task that you performed in the previous lesson. To do this you will need to:
- Write a function that performs all of the tasks that you executed in your
for loop
. - Call the
apply
function and tell it to use the function that you created in step 1.
To get started, call the lubridate and dplyr libraries like you did in the previous lessons.
library(lubridate)
library(dplyr)
How lapply Works
lapply takes a vector (or list) as its first argument (in this case a vector of the continent names), then a function as its second argument. This function is then executed on every element in the first argument. This is very similar to a for loop: first, cc stores the first continent name, “Asia”, then runs the code in the function body, then cc stores the second continent name, and runs the function body, and so on. The code in the function body can be thought of in exactly the same way as the body of the for loop. The result of the last line is then returned to lapply, which combines the results into a list. –Software Carpentry
check_create_dir <- function(the_dir) {
if (!dir.exists(the_dir)) {
dir.create(the_dir, recursive = TRUE) }
}
in_to_mm <- function(data_in_inches) {
value_inches <- data_in_inches * 25.4
return(value_inches)
}
In the previous lessons, you created a list of files in a directory.
all_precip_files <- list.files("data/week_06", pattern = "*.csv",
full.names = TRUE)
# create an object with the directory name
the_dir <- "data/week-06/outputs/precip_mm"
# check to see if the directory exists - make it if it doesn't
check_create_dir(the_dir)
# print the name of each file
for (file in all_precip_files) {
# read in the csv
the_data <- read.csv(file, header = TRUE, na.strings = 999.99) %>%
mutate(DATE = as.POSIXct(DATE, tz = "America/Denver", format = "%Y-%m-%d %H:%M:%S"),
# add a column with precip in mm
precip_mm = in_to_mm(HPCP))
# write the csv to a new file
write.csv(the_data, file = paste0(the_dir, "/", basename(file)),
na = "999.99")
}
Create a function that performs all of the tasks performed in the for loop
above.
# create a function that performs all of the tasks performed in the for loop above
summarize_data <- function(a_csv, the_dir) {
# open the data, fix the date and add a new column
the_data <- read.csv(a_csv, header = TRUE, na.strings = 999.99) %>%
mutate(DATE = as.POSIXct(DATE, tz = "America/Denver", format = "%Y-%m-%d %H:%M:%S"),
# add a column with precip in mm - you did this using a function previously
precip_mm = (HPCP * 25.4))
# write the csv to a new file
write.csv(the_data, file = paste0(the_dir, "/", basename(a_csv)),
na = "999.99")
}
As you did above, make sure your output directory is created. Then use list.files()
to get a list of all of the files that you’d like to process.
the_dir_ex <- "data/week-06/outputs/example"
check_create_dir(the_dir_ex)
# get a list of all files that you want to process
# you can use a list with the lapply function
all_precip_files <- list.files("data/week_06", pattern = "*.csv",
full.names = TRUE)
Now you can perform the same task that you performed above in a loop with one line of code (ok two if you break them up for readability).
lapply(all_precip_files,
FUN = summarize_data,
the_dir = the_dir_ex)
## list()
# turn off the output empty list
invisible(lapply(all_precip_files, (FUN = summarize_data),
the_dir = the_dir_ex))
Are Apply Function Faster Than For Loops?
As promised let’s test your code to see whether the lapply()
function is in fact faster than the for loop
.
# let's see what approach is faster
library(microbenchmark)
microbenchmark(invisible(lapply(all_precip_files, (FUN = summarize_data),
the_dir = the_dir_ex)))
## Unit: microseconds
## expr
## invisible(lapply(all_precip_files, (FUN = summarize_data), the_dir = the_dir_ex))
## min lq mean median uq max neval
## 1.17 1.207 1.70154 1.3335 1.4865 29.29 100
# print the name of each file
microbenchmark(for (file in all_precip_files) {
# read in the csv
the_data <- read.csv(file, header = TRUE, na.strings = 999.99) %>%
mutate(DATE = as.POSIXct(DATE, tz = "America/Denver", format = "%Y-%m-%d %H:%M:%S"),
# add a column with precip in mm
precip_mm = in_to_mm(HPCP))
# write the csv to a new file
write.csv(the_data, file = paste0(the_dir, "/", basename(file)),
na = "999.99")
})
## Unit: nanoseconds
## expr
## for (file in all_precip_files) { the_data <- read.csv(file, header = TRUE, na.strings = 999.99) %>% mutate(DATE = as.POSIXct(DATE, tz = "America/Denver", format = "%Y-%m-%d %H:%M:%S"), precip_mm = in_to_mm(HPCP)) write.csv(the_data, file = paste0(the_dir, "/", basename(file)), na = "999.99") }
## min lq mean median uq max neval
## 248 290 497.76 313 453.5 10812 100
Is it faster on average? Perhaps just by a few milliseconds?
Leave a Comment