Lesson 5. Automate Workflows Using Loops in R Clean coding tidyverse intro
Learning objectives
At the end of this activity, you will be able to:
- Use for-loops to handle repetitive tasks
- Bind multiple data frames together by row
What you need
Follow the setup instructions here:
Don’t Repeat Yourself (DRY)
The DRY (Don’t Repeat Yourself) principles refers to repeating code over and over in a script. When you notice yourself doing this, it’s a good time to consider whether there is another approach that may be more efficient.
A snippet of the code that we examined at the beginning of this workshop is below. Notice here, our colleague is building a data.frame
of elements, manually, line by line.
finalSUMMARYmean <- data.frame(jan_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "01"], na.rm = TRUE),
feb_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "02"], na.rm = TRUE),
march_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "03"], na.rm = TRUE),
apr_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "04"], na.rm = TRUE),
may_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "05"], na.rm = TRUE),
june_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "05"], na.rm = TRUE),
may_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "06"], na.rm = TRUE),
july_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "07"], na.rm = TRUE),
aug_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "08"], na.rm = TRUE),
sept_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "09"], na.rm = TRUE),
oct_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "09"], na.rm = TRUE),
nov_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "11"], na.rm = TRUE),
dec_mean_2003 = mean(myFinalData$HPCP[myFinalData$month == "12"], na.rm = TRUE))
finalSUMMARYmean
Similarly, our colleague may opt to open a set of csv files line by line.
myDATA1 <- read.csv("https://s3-us-west-2.amazonaws.com/earthlab-teaching/vchm/My_Data2004.csv",
na.strings = c("999.99"))
myDATA2 <- read.csv("https://s3-us-west-2.amazonaws.com/earthlab-teaching/vchm/My_Data2005.csv",
na.strings = c("999.99"))
myDATA3 <- read.csv("https://s3-us-west-2.amazonaws.com/earthlab-teaching/vchm/My_Data2006.csv",
na.strings = c("999.99"))
We refer to this as copy pasta. When you repeat code over and over. The DRY principle supports automating these types of tasks using for-loops, functions and other approaches. In this lesson we will review using for loops to automate opening and aggregating a set of .csv
files.
For-loops in R
For-loops provide a way to iterate over objects in R
. For example, pretend you want to print each number in a sequence of numbers: 1:10
. You can do that with a for loop as follows:
numbers <- 1:10
for (i in numbers) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
In the above for-loop, the object i
will take on the values in numbers
sequentially: first i
will be set to the first element in numbers
(1), then it will be set to the second element (2), and so on, until finally i = 10
. Everything contained within the curly braces {...}
is considered the body of the for-loop, and this will be executed for every iteration of the loop.
IMPORTANT: The variable, i
, and any other variable in the for-loop will persist as an object in your R
environment after the for-loop is done executing. This is the opposite of what you may have learned working with functions.
i
## [1] 10
You can use a for-loop in the same way with a character vector:
charvec <- c('first element', 'second element', 'third element')
for (i in charvec) {
print(i)
}
## [1] "first element"
## [1] "second element"
## [1] "third element"
Notice here that i
takes on the values of the elements of charvec
.
When iterating over objects with loops, it is often useful to use the seq_along
function to create a numeric sequence of element indices. For example, here’s what seq_along
returns when given our charvec
as an input:
seq_along(charvec)
## [1] 1 2 3
Protip: Using seq_along
in a for-loop allows you to get numeric indices for the object that you want to iterate over:
for (i in seq_along(charvec)) {
print(paste('i =', i))
}
## [1] "i = 1"
## [1] "i = 2"
## [1] "i = 3"
Of course, if you wanted to iterate over charvec
and still get the character elements, you can use these i
values as indices:
for (i in seq_along(charvec)) {
print(paste('charvec[i] =', charvec[i]))
}
## [1] "charvec[i] = first element"
## [1] "charvec[i] = second element"
## [1] "charvec[i] = third element"
Populate Objects with For-loops
Suppose you wanted to create a list, and have each element in that list be some number. You could create an empty list, then populate each element in that list using a for loop.
my_list <- list()
for (i in seq_along(charvec)) {
my_list[i] <- charvec[i]
}
Let’s dissect this a bit.
- First, you create an empty list using
my_list <- list()
. This is the object that you will populate in the loop. - Then, you use
seq_along(charvec)
as the thing to iterate over with our for-loop, so that firsti=1
, theni=2
, theni=3
, because there are 3 elements incharvec
. - Finally, within the body of the for loop (between the curly braces), you assign
charvec[i]
to be the #i element inmy_list
.
Answer the following:
- What will
my_list
be after this for-loop? +What is its class? +What is its length? +What is the first element?
Let’s have a look:
class(my_list)
## [1] "list"
length(my_list)
## [1] 3
my_list
## [[1]]
## [1] "first element"
##
## [[2]]
## [1] "second element"
##
## [[3]]
## [1] "third element"
Challenge
There are multiple url’s in the data/data_urls.csv
file that you provided in this workshop. Your challenge is to combine all of the .csv
files into 1 data.frame in R
.
Your list of url’s looks something like the code below
urls <- c(
'https://s3-us-west-2.amazonaws.com/earthlab-teaching/vchm/My_Data2003-boulder.csv',
'https://s3-us-west-2.amazonaws.com/earthlab-teaching/vchm/My_Data2003-denver.csv',
'https://s3-us-west-2.amazonaws.com/earthlab-teaching/vchm/My_Data2003-lyons.csv'
)
Lost? Here are a few functions that may help you out.
- You might find it useful to populate a list, so that each element in your list is a
data.frame
. - You can create one data.frame from a list of data.frames using the function,
bind_rows(list_object_here)
. bind_rows is adplyr
function that combines data.frames contained in a list row-wise (it stacks them on top of each other).
Additional resources
You may find the materials below useful as an overview of what we cover during this workshop:
Leave a Comment