Lesson 3. Summarize Time Series Data by Month or Year Using Tidyverse Pipes in R


In this lesson, you will learn about time series data by various time units including month, day and year.

Learning Objectives

After completing this tutorial, you will be able to:

  • Summarize time series data by a particular time unit (e.g. month to year, day to month, using pipes etc.).
  • Use dplyr pipes to manipulate data in R.

What You Need

You need R and RStudio to complete this tutorial. Also you should have an earth-analytics directory set up on your computer with a /data directory within it.

R Libraries to Install:

  • ggplot2: install.packages("ggplot2")
  • dplyr: install.packages("dplyr")
  • lubridate: install.packages("lubridate")

Download Week 2 Data

Get Started with Time Series Data

To begin, load the ggplot2 and dplyr libraries. Also, set your working directory. Finally, set stringsAsFactors to FALSE globally using options(stringsAsFactors = FALSE).

# set your working directory to the earth-analytics directory
# setwd("working-dir-path-here")

# load packages
library(ggplot2)
library(dplyr)
library(lubridate)

# set strings as factors to false
options(stringsAsFactors = FALSE)

Import Precipitation Time Series Data

You will use the same precipitation data that you used in the last lesson. The data cover the time span between 1 January 2003 through 31 December 2013. You have a single data point for each day in this dataset. However you are interested in summary values per MONTH instead of per day.

To begin, use read.csv() to import the .csv file as you did in the last lesson.

# download the data
# download.file(url = "https://ndownloader.figshare.com/files/7283285",
#              destfile = "data/week-02/805325-precip-dailysum_2003-2013.csv",
#              method = "libcurl")

# import data
boulder_daily_precip <- read.csv("data/week-02/precipitation/805325-precip-dailysum-2003-2013.csv",
         header = TRUE,
         na.strings = 999.99)

# view structure of data
str(boulder_daily_precip)
## 'data.frame':	792 obs. of  9 variables:
##  $ DATE        : chr  "1/1/03" "1/5/03" "2/1/03" "2/2/03" ...
##  $ DAILY_PRECIP: num  0 NA 0 NA 0.4 0.2 0.1 0.1 0 0 ...
##  $ STATION     : chr  "COOP:050843" "COOP:050843" "COOP:050843" "COOP:050843" ...
##  $ STATION_NAME: chr  "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" ...
##  $ ELEVATION   : num  1650 1650 1650 1650 1650 ...
##  $ LATITUDE    : num  40 40 40 40 40 ...
##  $ LONGITUDE   : num  -105 -105 -105 -105 -105 ...
##  $ YEAR        : int  2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 ...
##  $ JULIAN      : int  1 5 32 33 34 36 37 38 41 49 ...

# are there any unusual / No data values?
summary(boulder_daily_precip$DAILY_PRECIP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1000  0.1000  0.2478  0.3000  9.8000       4

As you did in the previous lesson, let’s convert the DATE column to a date class. In the previous lesson, you performed this step by directly assigning the column

boulder_daily_precip$DATE

You can use pipes ( %>% ) to achieve the same thing. The mutate() function in dplyr is used to

  1. reassign a column value / format OR
  2. create a new column

The syntax for the mutate function is as follows:

mutate(column_name = what_you_want_to_store_in_this_column)

So if you want to create a new date column contain the information from the existing DATE column you’d write

mutate(new_date = DATE)

In this case you will reassign the date column to the values populated by the as.Date() function with converts the class of the column to a date class. Like this:

mutate(DATE = as.Date(DATE, format = "%m/%d/%y"))

Because you are using a pipe you need to reassign your data.frame output to the boulder_daily_precip object.

# Create a new data.frame with the newly formatted date field
boulder_daily_precip <- boulder_daily_precip %>%
  mutate(DATE = as.Date(DATE, format = "%m/%d/%y"))

Finally, you plot the data using ggplot(). In the example below, you send the data.frame directly to ggplot using a pipe too.


# plot the data using ggplot2 and pipes
boulder_daily_precip %>%
ggplot(aes(x = DATE, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      labs(title = "Precipitation - Boulder, Colorado",
           subtitle = "The data frame is sent to the plot using pipes",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

precip plot w fixed dates

The code below created the same plot as above.


# plot the data using ggplot2
ggplot(data=boulder_daily_precip, aes(x = DATE, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      labs(title = "Precipitation - Boulder, Colorado",
           subtitle = "Note using pipes",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

precip plot w fixed dates

Create Facets

Notice in your data you have a “year” column. You can quickly plot data by year using facet_wrap(). When you use facet wrap, you select a column in your data that you wish to “group by”. In this case, you have a “YEAR” column that you can use to plot. To plot by year you add the following line to your ggplot code:

facet_wrap( ~ YEAR )


# plot the data using ggplot2 and pipes
boulder_daily_precip %>%
  na.omit() %>%
ggplot(aes(x = DATE, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      facet_wrap( ~ YEAR ) +
      labs(title = "Precipitation - Boulder, Colorado",
           subtitle = "Use facets to plot by a variable - year in this case",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15) +
     # adjust the x axis breaks
     scale_x_date(date_breaks = "5 years", date_labels = "%m-%Y")

create facets using facet_wrap

Your plot looks ok but there is a problem with the x axis. Each date is unique to that particular YEAR. You need to plot a variable on the x axis that is the same across all years for the plots to be comparable. One option is the day # of the year sometimes referred to as Julian Day or day of year. Lucky for us, you have a column called JULIAN in your data. This column contains the day of the year. Create the same plot however this time, use the JULIAN column for the x axis instead of DATE.

# plot the data using ggplot2 and pipes
boulder_daily_precip %>%
ggplot(aes(x = JULIAN, y = DAILY_PRECIP)) +
      geom_point(color = "darkorchid4") +
      facet_wrap( ~ YEAR, ncol = 3) +
      labs(title = "Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily Precipitation (inches)",
           x = "Day of Year") + theme_bw(base_size = 15)

create facets using facet_wrap

Challenge

Create a subset plot that only shows data for Julian day range: 230-290. This date range is approximately the end of August - Oct (2013).

# subset 2 months around flood
boulder_daily_precip %>%
  filter(JULIAN > 230 & JULIAN < 290) %>%
  ggplot(aes(x = JULIAN, y = DAILY_PRECIP)) +
      geom_bar(stat = "identity", fill = "darkorchid4") +
      facet_wrap( ~ YEAR, ncol = 3) +
      labs(title = "Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

plot of chunk subset-data

Summarize Data by Month

In the example above, you plotted your data plot by day of the year. This column, however already existed in your data. However, what if you don’t have these columns in your data?

Next, you will create a month column in the data which will allow us to summarize the data by month. Note that you could do this for any particular time subset that you want. You are just using month as an example.

You use the lubridate package to quickly extract the month from an existing date formatted field.

IMPORTANT: this will only work on data where you’ve already converted the date into a date class that R can read as a date.

Below, you extract just the date from the date field using the month() function.

# if you send the month function a particular date and specify the format, it returns the month
month(as.POSIXlt("01-01-2003", format = "%m-%d-%y"))
## [1] 1
month(as.POSIXlt("06-01-2003", format = "%m-%d-%y"))
## [1] 6

# extract just the date from the date field in your data.frame
head(month(boulder_daily_precip$DATE))
## [1] 1 1 2 2 2 2

Add Month Column to the Data

Now you have everything that you need to add a “month” column to your data. As you did earlier, you can use the mutate() function to add the date. However in this case, instead of modifying an existing column, you will create a new month column.

# add a month column to your boulder_daily_precip data.frame
boulder_daily_precip <- boulder_daily_precip %>%
  mutate(month = month(DATE))

#boulder_daily_precip$test <- as.Date(paste0(boulder_daily_precip$month_year,"-01"),"%Y-%m-%d")

Now that you have a new column you can create a summary precipitation value for each month. To do this, you need to do the following:

  1. group_by(): group the data by month AND year (so you have unique values for each month)
  2. summarise(): add up all precipitation values for each month to get your summary statistic
  3. ggplot(): plot the newly summarized data!
# calculate the sum precipitation for each month
boulder_daily_precip_month <- boulder_daily_precip %>%
  group_by(month) %>%
  summarise(sum_precip = sum(DAILY_PRECIP))

Next, plot the data.

# subset 2 months around flood
boulder_daily_precip_month %>%
  ggplot(aes(x = month, y = sum_precip)) +
      geom_point(color = "darkorchid4") +
      labs(title = "Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Date") + theme_bw(base_size = 15)

monthly summary of precipitation plot

The plot above is not quite what you want. You want to be able to group by both month and year. To do this you send the group_by() function both columns. Like this:

group_by(month, YEAR)

Then you summarize using the same syntax you used above.

# calculate the sum precipitation for each month
boulder_daily_precip_month <- boulder_daily_precip %>%
  group_by(month, YEAR) %>%
  summarise(max_precip = sum(DAILY_PRECIP))

Once again, plot. Now that you have the YEAR column in your data you can use facet_wrap() to create a unique plot in a grid for each year.

# subset 2 months around flood
boulder_daily_precip_month %>%
  ggplot(aes(x = month, y = max_precip)) +
      geom_bar(stat = "identity", fill = "darkorchid4") +
  facet_wrap(~ YEAR, ncol = 3) +
      labs(title = "Total Monthly Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Month") + theme_bw(base_size = 15)

monthly summary of precipitation plot grouped by month and year

Clean Up x and y Axes

You can clean up the x axes by formatting the month column as a date. Note that below you are being a bit tricky with your dates. You are assigning each month to the same year and day. Note that this is one of many different ways that you can go about cleaning up your axis.

# assign each month to the same year and day for plotting
as.Date(paste0("2015-", boulder_daily_precip_month$month,"-01"),"%Y-%m-%d")
##   [1] "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01"
##   [6] "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01" "2015-01-01"
##  [11] "2015-01-01" "2015-02-01" "2015-02-01" "2015-02-01" "2015-02-01"
##  [16] "2015-02-01" "2015-02-01" "2015-02-01" "2015-02-01" "2015-02-01"
##  [21] "2015-02-01" "2015-02-01" "2015-03-01" "2015-03-01" "2015-03-01"
##  [26] "2015-03-01" "2015-03-01" "2015-03-01" "2015-03-01" "2015-03-01"
##  [31] "2015-03-01" "2015-03-01" "2015-03-01" "2015-04-01" "2015-04-01"
##  [36] "2015-04-01" "2015-04-01" "2015-04-01" "2015-04-01" "2015-04-01"
##  [41] "2015-04-01" "2015-04-01" "2015-04-01" "2015-04-01" "2015-05-01"
##  [46] "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01"
##  [51] "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01" "2015-05-01"
##  [56] "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01"
##  [61] "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01"
##  [66] "2015-06-01" "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01"
##  [71] "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01" "2015-07-01"
##  [76] "2015-07-01" "2015-07-01" "2015-08-01" "2015-08-01" "2015-08-01"
##  [81] "2015-08-01" "2015-08-01" "2015-08-01" "2015-08-01" "2015-08-01"
##  [86] "2015-08-01" "2015-08-01" "2015-08-01" "2015-09-01" "2015-09-01"
##  [91] "2015-09-01" "2015-09-01" "2015-09-01" "2015-09-01" "2015-09-01"
##  [96] "2015-09-01" "2015-09-01" "2015-09-01" "2015-09-01" "2015-10-01"
## [101] "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01"
## [106] "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01" "2015-10-01"
## [111] "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01"
## [116] "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01" "2015-11-01"
## [121] "2015-11-01" "2015-12-01" "2015-12-01" "2015-12-01" "2015-12-01"
## [126] "2015-12-01" "2015-12-01" "2015-12-01" "2015-12-01" "2015-12-01"
## [131] "2015-12-01" "2015-12-01"

You can use the code above with the mutate() function to create a new month column that contains the month of the year as a class of type date. If R reads the column as a date, you can then use the:

scale_x_date(date_labels = "%b")

function with ggplot(), to format the x axis as a date.

Note that %b represents the abbreviated month which will be plotted as labels on the x-axis.


boulder_daily_precip_month %>%
  mutate(month2 = as.Date(paste0("2015-", month,"-01"),"%Y-%m-%d")) %>%
  ggplot(aes(x = month2, y = max_precip)) +
      geom_bar(stat = "identity", fill = "darkorchid4") +
  facet_wrap(~ YEAR, ncol = 3) +
      labs(title = "Montly Total Daily Precipitation - Boulder, Colorado",
           subtitle = "Data plotted by year",
           y = "Daily precipitation (inches)",
           x = "Month") + theme_bw(base_size = 15) +
  scale_x_date(date_labels = "%b")

monthly summary of precipitation plot grouped by month and year

Leave a Comment