Lesson 4. Handle Missing Data in R Clean coding tidyverse intro

Max Joseph, Leah Wasser

Learning objectives

At the end of this activity, you will be able to:

Understand why it is important to make note of missing data values.
Be able to define what a NA value is in R and how it is used in a vector.

What you need

Follow the setup instructions here:

Setup instructions

In the previous lesson you attempted to plot the first file’s worth of data by time. However, the plot did you turn out as planned. There were at least two values that likely represent missing data values:

missing and
999.99

In this lesson, you will learn how to handle missing data values in R using readr and some basic data exploration approaches.

Missing Data Values

Sometimes, your data are missing values. Imagine a spreadsheet in Microsoft Excel with cells that are blank. If the cells are blank, you don’t know for sure whether those data weren’t collected, or someone forgot to fill them in. To indicate that data are missing (not by mistake) you can put a value in those cells that represents no data.

The R programming language uses NA to represent missing data values.

Lucky for us, readr makes it easy to deal with missing data values too. To account for these, we use the argument:

na = "value_to_change_to_na_here"

You can also send na a vector of missing data values, like this: na = c("value1", "value2")

# load libraries
library(readr)
library(ggplot2)
library(dplyr)

Let’s go through our workflow again but this time account for missing values. First, let’s have a look at the unique values contained in our HPCP column

# import data using readr
all_paths <- read_csv("data/data_urls.csv")
# grab first url from the file
first_csv <- all_paths$url[1]

# open data
year_one <- read_csv(first_csv)
# view unique vales in HPCP field
unique(year_one$HPCP)
## [1] "0"       "0.2"     "0.1"     "999.99"  "missing" "0.3"     "0.9"    
## [8] "0.5"

Next, we can create a vector of missing data values. We can see that we have 999.99 and missing as possible NA values.

# define all missing data values in a vector
na_values <- c("missing", "999.99")

# use the na argument to read in the csv
year_one <- read_csv(first_csv,
                     na = na_values)
unique(year_one$HPCP)
## [1] 0.0 0.2 0.1  NA 0.3 0.9 0.5

Once you have specified possible missing data values, try to plot again.

year_one %>%
  ggplot(aes(x = DATE, y = HPCP)) +
  geom_point() +
  theme_bw() +
  labs(x = "Date",
       y = "Precipitation",
       title = "Precipitation Over Time")

plot of chunk final-precip-plot

Note that when ggplot encounters missing data values, it tells you with a warning message:

Warning message:
Removed 3 rows containing missing values (geom_point).

On Your Own (OYO)

The mutate() function allows you to add a new column to a data.frame. And the month() function in the lubridate package, will convert a datetime object to a month value (1-12) as follows

mutate(the_month = month(date_field_here))

Create a plot that summarizes total precipitation by month for the first csv file that we have worked with through this lesson. Use everything that you have learned so far to do this.

Your final plot should look like the one below:

plot of chunk plot-by-month

HINTS:

The bar plot was created using the following ggplot elements:

geom_bar(stat = "identity", fill = "darkorchid4") + theme_bw()

Additional resources

You may find the materials below useful as an overview of what we cover during this workshop:

Handling Missing data in R - Earth Analytics Course

Write Loops Summarize Data

Share on

Twitter Facebook Google+ LinkedIn

Earth Data Analytics Online Certificate

Lesson 4. Handle Missing Data in R Clean coding tidyverse intro

Learning objectives

What you need

Missing Data Values

On Your Own (OYO)

Additional resources

Share on

Leave a Comment

You May Also Enjoy

Plot Data With Matplotlib

Calculate Seasonal Summary Values from Climate Data Variables Stored in NetCDF 4 Format: Work With MACA v2 Climate Data in Python

Calculate Summary Values Using Spatial Areas of Interest (AOIs) including Shapefiles for Climate Data Variables Stored in NetCDF 4 Format: Work With MACA v2 Climate Data in Python

How to Open and Process NetCDF 4 Data Format in Open Source Python