Lesson 7. Programmatically accessing geospatial data using API’s - Working with and mapping JSON data from the Colorado Information Warehouse in R


Learning Objectives

After completing this tutorial, you will be able to:

  • Extract geospatial (x,y) coordinate information embedded within a JSON hierarchical data structure.
  • Use the flatten() function to remove nested data.frames from data imported in JSON format.
  • Create a map of geospatial data using ggmap().

What you need

You will need a computer with internet access to complete this lesson.

In the previous lesson, we learned how to work with JSON data accessed via the Colorado information warehouse. In this lesson, we will explore another dataset however this time, the data will contain geospatial information nested within it that will allow us to create a map of the data.

Working with geospatial data

Check out the map Colorado DWR Current Surface Water Conditions map. If you remember from the previous lesson, APIs can be used for many different things. Web developers (people who program and create web sites and cool applications) can use API’s to create user friendly interfaces - like the map in this link that allow us to look at and interact with data. These API’s are similar to - if not the same as the ones that we often use to access data in R.

In this lesson, we will access the data used to create the map at the link above - in R.

# load packages
library(dplyr)
library(ggplot2)
library(rjson)
library(jsonlite)
# get URL
water_base_url <- "https://data.colorado.gov/resource/j5pc-4t32.json?"
water_full_url <- paste0(water_base_url, "station_status=Active",
            "&county=BOULDER")
water_data_url <- URLencode(water_full_url)

water_data_df <- fromJSON(water_data_url)

ATTENTION WINDOWS USERS: We have noticed a bug where on windows machines, sometimes the https URL doesn’t work. Instead try the same url as above but without the s - like this: water_base_url <- "http://data.colorado.gov/resource/j5pc-4t32.json?" This change has resolved many issues on windows machines so give it a try if you are having problems with the API.

Remember that the JSON structure supports hierarchical data and can be NESTED.

[ {
  "station_name" : "SOUTH PLATTE RIVER AT COOPER BRIDGE NEAR BALZAC",
  "amount" : "262.00",
  "flag" : "na",
  "station_status" : "Gage temporarily anavailable",
  "county" : "MORGAN",
  "wd" : "1",
  "dwr_abbrev" : "PLABALCO",
  "data_source" : "Co. Division of Water Resources",
  "http_linkage" : {
    "url" : "http://www.dwr.state.co.us/SurfaceWater/data/detail_graph.aspx?ID=PLABALCO&MTYPE=DISCHRG"
  },
  "div" : "1",
  "date_time" : "2017-02-15T09:00:00",
  "stage" : "1.86",
  "usgs_station_id" : "06759910",
  "variable" : "DISCHRG",
  "location" : {
    "latitude" : "40.357498",
    "needs_recoding" : false,
    "longitude" : "-103.528053"
  },
  "station_type" : "Stream"
}
...
]

If you look at the structure of the .json file below, you can see that the location object, is nested with 3 sub objects:

  • latitude
  • needs_recoding
  • longitude
[ {
"div" : "1",
"date_time" : "2017-02-15T09:00:00",
"stage" : "1.86",
"usgs_station_id" : "06759910",
"variable" : "DISCHRG",
"location" : {
  "latitude" : "40.357498",
  "needs_recoding" : false,
  "longitude" : "-103.528053"
},
"station_type" : "Stream"
}
...
]

Next, let’s look at the structure of the data.frame that R creates from the .json data.

# view data structure
typeof(water_data_df)
## [1] "list"

In this case, we have a data.frame nested within a data.frame.

# view first 6 lines of the location nested data.frame
head(water_data_df$location)
##    latitude needs_recoding   longitude
## 1 40.256031          FALSE -105.209549
## 2 40.255581          FALSE -105.209595
## 3  40.21139          FALSE  -105.25095
## 4 40.053036          FALSE -105.193048
## 5 40.849982          FALSE -105.218036
## 6 40.042028          FALSE -105.364917
# view for 6 lines of the location.latitude column
head(water_data_df$location$latitude)
## [1] "40.256031" "40.255581" "40.21139"  "40.053036" "40.849982" "40.042028"

We can remove the nesting using the flatten() function in R. When we flatten our json data, R creates new columns for each nested data.frame column. In this case it creates a unique column for latitude and longitude. Notice that the name of each new column contains the name of the previously nested data.frame followed by a period, and then the column name. For example

location.latitude

# remove the nested data frame
water_data_df <- flatten(water_data_df, recursive = TRUE)
water_data_df$location.latitude
##  [1] "40.256031" "40.255581" "40.21139"  "40.053036" "40.849982"
##  [6] "40.042028" "40.018667" "40.172925" "39.938598" "39.931099"
## [11] NA          "40.006374" "40.258038" "40.215772" NA         
## [16] "40.153341" NA          "39.931096" "40.214984" NA         
## [21] "40.733879" "40.15336"  "40.193757" "40.18188"  "40.187577"
## [26] "40.19932"  "40.174844" "40.18858"  "40.134278" "40.20419" 
## [31] "40.173949" "40.19642"  "40.2125"   "40.21266"  "40.187524"
## [36] NA          "40.21804"  "40.1946"   "40.170997" "40.160347"
## [41] "40.21905"  "40.21108"  "40.193018" "40.172677" "40.172677"
## [46] "40.19328"  "40.18503"  "40.051652" "39.98617"  "40.09404" 
## [51] "40.05366"  "39.961655" NA

Now we can clean up the data. Notice that our longitude and latitude values are in quotes. What does this mean about the structure of the data?

str(water_data_df$location.latitude)
##  chr [1:53] "40.256031" "40.255581" "40.21139" "40.053036" "40.849982" ...

In order to map or work with latitude and longitude data, we need numeric values. We can use the same dplyr function - mutate_at() - that we used in the previous lessons, to convert columns that are numbers to numeric rather than char data types:

Notice in the code below that there is an addition pipe that removes NA values (missing latitude values) from the dataset. We want to create a map of these data. We will not be able to map points with missing X,Y coordinate locations so it is best to remove them.

# where are the cells with NA values in our data?
is.na(water_data_df$location.latitude)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [12] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Note, in the code above, we can identify each location where there is a NA value in our data. If we add an ! to our code, R returns the INVERSE of the above.

# where are calls with values in our data?
!is.na(water_data_df$location.latitude)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [45]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Thus in our dplyr pipe, the code below removes all ROWS cells with a NA value in the latitude column.

Remove NA Values: filter(!is.na(location.latitude))

# turn columns to numeric and remove NA values
water_data_df <- water_data_df %>%
  mutate_at(c( "amount", "location.longitude", "location.latitude"),
            as.numeric) %>%
  filter(!is.na(location.latitude))

Now we can plot the data

ggplot(water_data_df, aes(location.longitude, location.latitude, size=amount,
  color=station_type)) +
  geom_point() + coord_equal() +
      labs(x = "Longitude",
           y = "Latitude",
          title = "Surface Water Site Locations by Type",
          subtitle = "Boulder, Colorado") +
  labs(size="Amount", colour="Station Type")

ggplot of water surface data.

Plotting the data using ggplot() creates a scatterplot of longitude and latitude, with some minor aesthetic tweaks. We really want to create a web map like the Colorado DWR Current Surface Water Conditions map.

We can create a non-interactive version of this map using the ggmap() package in R. We used ggmap() earlier in the semester. It provides an efficient way to quickly draw a basemap on the fly. We can overlay our data on top of that basemap to create a nice static map.

boulder <- get_map(location="Boulder, CO, USA",
                  source="google", crop=FALSE, zoom=10)
## Error in data.frame(ll.lat = ll[1], ll.lon = ll[2], ur.lat = ur[1], ur.lon = ur[2]): arguments imply differing number of rows: 0, 1
ggmap(boulder) +
  geom_point(data=water_data_df, aes(location.longitude, location.latitude, size=amount,
  color=factor(station_type)))
## Error in ggmap(boulder): object 'boulder' not found

In the next lesson, we will learn how to create interactive maps using the leaflet package for R.

Leave a Comment