Lesson 5. Subset & aggregate time series precipitation data in R using mutate(), group_by() and summarise()

Bonus / Graduate activity. In this lesson, you will plot precipitation data in R. However, these data were collected over several decades and sometimes there are multiple data points per day. The data are also not cleaned. You will find heading names that may not be meaningful, and other issues with the data.

This lesson shows you what the plots should look like but does not provide each and every step that you need to process the data. You have the skills that you need from the other lessons covered this week!

Learning objectives

After completing this tutorial, you will be able to:

  • Aggregate data by a day in R
  • View names and rename columns in a data.frame

Things you’ll need To complete this lesson

Please be sure you have the most current version of R and, preferably, RStudio to write your code.

R skill level: Intermediate - To succeed in this tutorial, you will need to have basic knowledge for use of the R software program.

R libraries to install:

  • ggplot2: install.packages("ggplot2")
  • plotly: install.packages("dplyr")

Data download

If you haven’t already downloaded this data (from the previous lesson), do so now.

Download week 02 data

Work with precipitation data

R libraries

To get started, load the ggplot2 and dplyr libraries, set up your working directory and set stringsAsFactors to FALSE using options().

Import precipitation data

We will use the 805333-precip-daily-1948-2013.csv dataset for this assignment. in this analysis. This dataset contains the precipitation values collected daily from the COOP station 050843 in Boulder, CO for 1 January 2003 through 31 December 2013.

Import the data into R and then view the data structure.


# import precip data into R data.frame
# be sure to handle NA values!
precip_boulder <- read.csv("data/week_02/precipitation/805333-precip-daily-1948-2013.csv",
                           header = TRUE)

# view first 6 lines of the data
head(precip_boulder)
##       STATION    STATION_NAME ELEVATION LATITUDE LONGITUDE           DATE
## 1 COOP:050843 BOULDER 2 CO US   unknown  unknown   unknown 19480801 01:00
## 2 COOP:050843 BOULDER 2 CO US   unknown  unknown   unknown 19480802 15:00
## 3 COOP:050843 BOULDER 2 CO US   unknown  unknown   unknown 19480803 09:00
## 4 COOP:050843 BOULDER 2 CO US   unknown  unknown   unknown 19480803 14:00
## 5 COOP:050843 BOULDER 2 CO US   unknown  unknown   unknown 19480803 15:00
## 6 COOP:050843 BOULDER 2 CO US   unknown  unknown   unknown 19480804 01:00
##   HPCP Measurement.Flag Quality.Flag
## 1 0.00                g             
## 2 0.05                              
## 3 0.01                              
## 4 0.03                              
## 5 0.03                              
## 6 0.05

# view structure of data
str(precip_boulder)
## 'data.frame':	14476 obs. of  9 variables:
##  $ STATION         : chr  "COOP:050843" "COOP:050843" "COOP:050843" "COOP:050843" ...
##  $ STATION_NAME    : chr  "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" ...
##  $ ELEVATION       : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ LATITUDE        : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ LONGITUDE       : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ DATE            : chr  "19480801 01:00" "19480802 15:00" "19480803 09:00" "19480803 14:00" ...
##  $ HPCP            : num  0 0.05 0.01 0.03 0.03 0.05 0.02 0.01 0.01 0.01 ...
##  $ Measurement.Flag: chr  "g" " " " " " " ...
##  $ Quality.Flag    : chr  " " " " " " " " ...

About the data

The structure of the data are similar to what you saw in previous lessons. HPCP is the total precipitation given in inches, recorded for the hour ending at the time specified by DATE. There is a designated missing data value of 999.99. Note that hours with no precipitation are not recorded.

The metadata for this file is located in your week_02 directory: PRECIP_HLY_documentation.pdf file that can be downloaded along with the data. (Note: as of Sept. 2016, there is a mismatch in the data downloaded and the documentation. The differences are in the units and missing data value: inches/999.99 (standard) or millimeters/25399.75 (metric)).

NoData values

Next, check out the data. Are there no data values? If so, make sure to adjust your data import code above to account for no data values. Then determine how many no data values you have in your dataset.

# plot histogram
# you wanted to explore the data
hist(precip_boulder$HPCP, main ="Are there NA values?")

histogram of data


precip_boulder <- read.csv("data/week_02/precipitation/805333-precip-daily-1948-2013.csv",
                           header = TRUE, na.strings = 999.99)

View histogram without the 999 NA values!

hist(precip_boulder$HPCP,
     main = "This looks better after the reimporting with\n no data values specified",
     xlab = "Precip (inches)", ylab = "Frequency",
     col = "darkorchid4")

histogram without NA values

Next, let’s see how many NA values are in our data.

print("how many NA values are there?")
## [1] "how many NA values are there?"
sum(is.na(precip_boulder))
## [1] 401

Convert date and time

Compared to the previous lessons, notice that we now have date & time in our date field. To deal with both date and time, we use the as.POSIXct() method rather than as.date which we used previously. The syntax to convert to POSIXct is similar to what we used previously, but now, we will add the hour (H) and minute (M) to the format argument as follows:

as.POSIXct(column-you-want-to-convert-here, format="%Y%m%d %H:%M")


# convert to date/time and retain as a new field
precip_boulder$DATE <- as.POSIXct(precip_boulder$DATE,
                                  format = "%Y%m%d %H:%M")
                                  # date in the format: YearMonthDay Hour:Minute

# double check structure
str(precip_boulder$DATE)
##  POSIXct[1:14476], format: "1948-08-01 01:00:00" "1948-08-02 15:00:00" ...

Plot precipitation data

Next, let’s have a look at the data. Plot it using ggplot(). Format the plot using the colors, labels, etc. that are most clear and look the best. Your plot colors and labels doesn’t need to be exactly like the one below!

Note the code is hidden as you should know how to create a ggplot plot now!

hourly precipitation

Differences in the data

Any ideas what might be causing the notable difference in the plotted data through time? If you can answer this you can get a bonus point for the week!

hourly precipitation

It is difficult to interpret this plot which spans so many years at such a fine temporal scale. For our research project, we only need to explore 30 years of data. Let’s do the following:

  1. Aggregate the precipitation totals (sum) by day.
  2. Subset the data for 30 years (we learned how to do this in a previous lesson).

Calculate daily total precipitation with summarize()

To summarize data by a particular variable or time period, you first create a new column in your dataset called day. Next, take all of the values (in this case precipitation measured each hour) for each day and add them using the sum() function. We can do all of this efficiently using dplyr mutate() function.

We use the mutate() function to add a new column called day to a new data.frame c alled daily_sum_precip. Note that we used as.Date() to just grab the dates rather than dates and times which are stored in the POSIX format.

# use dplyr and mutate to add a day column to your data
daily_sum_precip <- precip_boulder %>%
  mutate(day = as.Date(DATE, format="%Y-%m-%d"))

# let's look at the new column
head(daily_sum_precip$day)
## [1] "1948-08-01" "1948-08-02" "1948-08-03" "1948-08-03" "1948-08-03"
## [6] "1948-08-04"

Daily precip plot

Next we summarize() the precipitation column (total_precip) - grouped by day. What this means is that we ADD UP all of the values for each day to get a grand total amount of precipitation each day.

# use dplyr
daily_sum_precip <- precip_boulder %>%
  mutate(day = as.Date(DATE, format="%Y-%m-%d")) %>%
  group_by(day) %>% # group by the day column
  summarise(total_precip=sum(HPCP)) %>%  # calculate the SUM of all precipitation that occurred on each day
  na.omit()

# how large is the resulting data frame?
nrow(daily_sum_precip)
## [1] 4576

# view the results
head(daily_sum_precip)
## # A tibble: 6 x 2
##          day total_precip
##       <date>        <dbl>
## 1 1948-08-01         0.00
## 2 1948-08-02         0.05
## 3 1948-08-03         0.07
## 4 1948-08-04         0.14
## 5 1948-08-06         0.02
## 6 1948-08-14         0.03

# view column names
names(daily_sum_precip)
## [1] "day"          "total_precip"

Now plot the daily data.

Daily precipitation for boulder

Finally, plot a temporal subset of the data from Jan-October 2013. We learned how to do this in the previous lessons.

Now we can easily see the dramatic rainfall event in mid-September!

R tip: If you are using a date-time class, instead of just a date class, you need to use scale_x_datetime().

Subset the data

If we wanted to, we could subset this data set using the same code that we used previously to subset! An example of the subsetted plot is below.

final precip plot daily sum

Additional resources

Leave a Comment