Lesson 3. Summarize time series data by month or year using tidyverse pipes in R

In this lesson, we will summarize time series data by various time units including month, day and year.

Learning Objectives

After completing this tutorial, you will be able to:

  • Summarize time series data by a particular time unit (e.g. month to year, day to month, using pipes etc.)
  • Use dplyr pipes to manipulate data in R

What you need

You need R and RStudio to complete this tutorial. Also you should have an earth-analytics directory setup on your computer with a /data directory within it.

R Libraries to Install:

  • ggplot2: install.packages("ggplot2")
  • dplyr: install.packages("dplyr")
  • lubridate: install.packages("lubridate")

Download Week 2 Data

Get started with time series data

To begin, load the ggplot2 and dplyr libraries. Also, set your working directory. Finally, set stringsAsFactors to FALSE globally using options(stringsAsFactors = FALSE).

# set your working directory to the earth-analytics directory
# setwd("working-dir-path-here")

# load packages
library(ggplot2)
library(dplyr)
library(lubridate)

# set strings as factors to false
options(stringsAsFactors = FALSE)

Import precipitation time series data

We will use the same precipitation data that we used in the last lesson. The data cover the time span between 1 January 2003 through 31 December 2013. We have a single data point for each day in this dataset. However we are interested in summary values per MONTH instead of per day.

To begin, use read.csv() to import the .csv file as we did in the last lesson.

# download the data
# download.file(url = "https://ndownloader.figshare.com/files/7283285",
#              destfile = "data/week_02/805325-precip-dailysum_2003-2013.csv",
#              method = "libcurl")

# import data
boulder_daily_precip <- read.csv("data/week_02/precipitation/805325-precip-dailysum-2003-2013.csv",
         header = TRUE,
         na.strings = 999.99)

# view structure of data
str(boulder_daily_precip)
## 'data.frame':	792 obs. of  9 variables:
##  $ DATE        : chr  "1/1/03" "1/5/03" "2/1/03" "2/2/03" ...
##  $ DAILY_PRECIP: num  0 NA 0 NA 0.4 0.2 0.1 0.1 0 0 ...
##  $ STATION     : chr  "COOP:050843" "COOP:050843" "COOP:050843" "COOP:050843" ...
##  $ STATION_NAME: chr  "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" ...
##  $ ELEVATION   : num  1650 1650 1650 1650 1650 ...
##  $ LATITUDE    : num  40 40 40 40 40 ...
##  $ LONGITUDE   : num  -105 -105 -105 -105 -105 ...
##  $ YEAR        : int  2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 ...
##  $ JULIAN      : int  1 5 32 33 34 36 37 38 41 49 ...

# are there any unusual / No data values?
summary(boulder_daily_precip$DAILY_PRECIP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1000  0.1000  0.2478  0.3000  9.8000       4

As we did in the previous lesson, let’s convert the DATE column to a date class. In the previous lesson, we performed this step by directly assigning the column

boulder_daily_precip$DATE

We can use pipes ( %>% ) to achieve the same thing. The mutate() function in dplyr is used to

  1. reassign a column value / format OR
  2. create a new column

The syntax for the mutate function is as follows:

mutate(column_name = what_you_want_to_store_in_this_column)

So if we want to create a new date column contain the information from the existing DATE column we’d write

mutate(new_date = DATE)

In this case we will reassign the date column to the values populated by the as.Date() function with converts the class of the column to a date class. Like this:

mutate(DATE = as.Date(DATE, format = "%m/%d/%y"))

Because we are using a pipe we need to reassign our data.frame output to the boulder_daily_precip object.

# Create a new data.frame with the newly formatted date field
boulder_daily_precip <- boulder_daily_precip %>%
  mutate(DATE = as.Date(DATE, format = "%m/%d/%y"))

Finally, we plot the data using ggplot(). In the example below, you send the data.frame directly to ggplot using a pipe too.


# plot the data using ggplot2 and pipes
boulder_daily_precip %>%
ggplot(aes(x = DATE, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      labs(title = "Precipitation - Boulder, Colorado",
           subtitle = "The data frame is sent to the plot using pipes",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

precip plot w fixed dates

The code below created the same plot as above.


# plot the data using ggplot2
ggplot(data=boulder_daily_precip, aes(x = DATE, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      labs(title = "Precipitation - Boulder, Colorado",
           subtitle = "Note using pipes",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

precip plot w fixed dates

Create facets

Notice in our data we have a “year” column. We can quickly plot data by year using facet_wrap(). When we use facet wrap, we select a column in our data that we wish to “group by”. In this case, we have a “YEAR” column that we can use to plot. To plot by year we add the following line to our ggplot code:

facet_wrap( ~ YEAR )


# plot the data using ggplot2 and pipes
boulder_daily_precip %>%
  na.omit() %>%
ggplot(aes(x = DATE, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      facet_wrap( ~ YEAR ) +
      labs(title = "Precipitation - Boulder, Colorado",
           subtitle = "Use facets to plot by a variable - year in this case",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15) +
     # adjust the x axis breaks
     scale_x_date(date_breaks = "5 years", date_labels = "%m-%Y")

create facets using facet_wrap

Our plot looks ok but there is a problem with the x axis. Each date is unique to that particular YEAR. We need to plot a variable on the x axis that is the same across all years for the plots to be comparable. One option is the day # of the year sometimes referred to as Julian Day or day of year. Lucky for us, we have a column called JULIAN in our data. This column contains the day of the year. Create the same plot however this time, use the JULIAN column for the x axis instead of DATE.

# plot the data using ggplot2 and pipes
boulder_daily_precip %>%
ggplot(aes(x = JULIAN, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      facet_wrap( ~ YEAR, ncol = 3) +
      labs(title = "Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily Precipitation (inches)",
           x = "Day of Year") + theme_bw(base_size = 15)

create facets using facet_wrap

Challenge

Create a subset plot that only shows data for Julian day range: 230-290. This date range is approximately the end of August - Oct (2013).

# subset 2 months around flood
boulder_daily_precip %>%
  filter(JULIAN > 230 & JULIAN < 290) %>%
  ggplot(aes(x = JULIAN, y = DAILY_PRECIP)) +
      geom_bar(stat = "identity", fill = "darkorchid4") +
      facet_wrap( ~ YEAR, ncol = 3) +
      labs(title = "Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

Summarize data by month

In the example above, we plotted our data plot by day of the year. This column, however already existed in our data. However, what if we don’t have these columns in our data?

Next, you will create a month column in the data which will allow us to summarize the data by month. Note that you could do this for any particular time subset that you want. We are just using month as an example.

We use the lubridate package to quickly extract the month from an existing date formatted field.

IMPORTANT: this will only work on data where you’ve already converted the date into a date class that R can read as a date.

Below, you extract just the date from the date field using the month() function.

# if we send the month function a particular date and specify the format, it returns the month
month(as.POSIXlt("01-01-2003", format = "%m-%d-%y"))
## [1] 1
month(as.POSIXlt("06-01-2003", format = "%m-%d-%y"))
## [1] 6

# extract just the date from the date field in our data.frame
head(month(boulder_daily_precip$DATE))
## [1] 1 1 2 2 2 2

Add month column to the data

Now we have everything that we need to add a “month” column to our data. As we did earlier, we can use the mutate() function to add the date. However in this case, instead of modifying an existing column, we will create a new month column.

# add a month column to our boulder_daily_precip data.frame
boulder_daily_precip <- boulder_daily_precip %>%
  mutate(month = month(DATE))

#boulder_daily_precip$test <- as.Date(paste0(boulder_daily_precip$month_year,"-01"),"%Y-%m-%d")

Now that we have a new column we can create a summary precipitation value for each month. To do this, we need to do the following:

  1. group_by(): group the data by month AND year (so we have unique values for each month)
  2. summarise(): add up all precipitation values for each month to get our summary statistic
  3. ggplot(): plot the newly summarized data!
# calculate the sum precipitation for each month
boulder_daily_precip_month <- boulder_daily_precip %>%
  group_by(month) %>%
  summarise(sum_precip = sum(DAILY_PRECIP))

Next, plot the data.

# subset 2 months around flood
boulder_daily_precip_month %>%
  ggplot(aes(x = month, y = sum_precip)) +
      geom_point(color = "darkorchid4") +
      labs(title = "Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

monthly summary of precipitation plot

The plot above is not quite what we want. We want to be able to group by both month and year. To do this we send the group_by() function both columns. Like this:

group_by(month, YEAR)

Then we summarize using the same syntax we used above.

# calculate the sum precipitation for each month
boulder_daily_precip_month <- boulder_daily_precip %>%
  group_by(month, YEAR) %>%
  summarise(max_precip = sum(DAILY_PRECIP))

Once again, plot. Now that we have the YEAR column in our data we can use facet_wrap() to create a unique plot in a grid for each year.

# subset 2 months around flood
boulder_daily_precip_month %>%
  ggplot(aes(x = month, y = max_precip)) +
      geom_bar(stat = "identity", fill = "darkorchid4") +
  facet_wrap(~ YEAR, ncol = 3) +
      labs(title = "Total Monthly Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Month") + theme_bw(base_size = 15)

monthly summary of precipitation plot grouped by month and year

Clean up x and y axes

You can clean up the x axes by formatting the month column as a date. Note that below we are being a bit tricky with our dates. We are assigning each month to the same year and day. Note that this is one of many different ways that we can go about cleaning up our axis.

# assign each month to the same year and day for plotting
as.Date(paste0("2015-", boulder_daily_precip_month$month,"-01"),"%Y-%m-%d")
##   [1] "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01"
##   [6] "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01"
##  [11] "2015-01-01" "2015-02-01" "2015-02-01" "2015-02-01" "2015-02-01"
##  [16] "2015-02-01" "2015-02-01" "2015-02-01" "2015-02-01" "2015-02-01"
##  [21] "2015-02-01" "2015-02-01" "2015-03-01" "2015-03-01" "2015-03-01"
##  [26] "2015-03-01" "2015-03-01" "2015-03-01" "2015-03-01" "2015-03-01"
##  [31] "2015-03-01" "2015-03-01" "2015-03-01" "2015-04-01" "2015-04-01"
##  [36] "2015-04-01" "2015-04-01" "2015-04-01" "2015-04-01" "2015-04-01"
##  [41] "2015-04-01" "2015-04-01" "2015-04-01" "2015-04-01" "2015-05-01"
##  [46] "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01"
##  [51] "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01"
##  [56] "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01"
##  [61] "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01"
##  [66] "2015-06-01" "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01"
##  [71] "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01"
##  [76] "2015-07-01" "2015-07-01" "2015-08-01" "2015-08-01" "2015-08-01"
##  [81] "2015-08-01" "2015-08-01" "2015-08-01" "2015-08-01" "2015-08-01"
##  [86] "2015-08-01" "2015-08-01" "2015-08-01" "2015-09-01" "2015-09-01"
##  [91] "2015-09-01" "2015-09-01" "2015-09-01" "2015-09-01" "2015-09-01"
##  [96] "2015-09-01" "2015-09-01" "2015-09-01" "2015-09-01" "2015-10-01"
## [101] "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01"
## [106] "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01"
## [111] "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01"
## [116] "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01"
## [121] "2015-11-01" "2015-12-01" "2015-12-01" "2015-12-01" "2015-12-01"
## [126] "2015-12-01" "2015-12-01" "2015-12-01" "2015-12-01" "2015-12-01"
## [131] "2015-12-01" "2015-12-01"

We can use the code above with the mutate() function to create a new month column that contains the month of the year as a class of type date. If R reads the column as a date, we can then use the:

scale_x_date(date_labels = "%b")

function with ggplot(), to format the x axis as a date.

Note that %b represents the abbreviated month which will be plotted as labels on the x-axis.


boulder_daily_precip_month %>%
  mutate(month2 = as.Date(paste0("2015-", month,"-01"),"%Y-%m-%d")) %>%
  ggplot(aes(x = month2, y = max_precip)) +
      geom_bar(stat = "identity", fill = "darkorchid4") +
  facet_wrap(~ YEAR, ncol = 3) +
      labs(title = "Montly Total Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Month") + theme_bw(base_size = 15) +
  scale_x_date(date_labels = "%b")

monthly summary of precipitation plot grouped by month and year

Leave a Comment