After completing this tutorial, you will be able to:
- Access data from the Colorado information warehouse RESTful API.
- Describe and recognize query parameters in a RESTful call.
- Define: response and request relative to data API data access.
- Define API endpoint in the context of the SODA API.
- Be able to list the 2 potential responses that you may get when querying a RESTful API.
- Use the
mutate_at()function with dplyr pipes to adjust the format / data type of multiple columns.
What you need
You will need a computer with internet access to complete this lesson.
In the previous lessons, we learned how to access human readable text files data programmatically using:
download.file()to download a file to your computer and work with it (ideal if you want to save a copy of the data to your computer)
read.csv()ideal for reading in a tabular file stored on the web but may sometimes fail when there are secure connections involved (e.g. https).
fromJSON()ideal for data accessed in JSON format.
In this lesson, we will learn about API interfaces. An API allows us to access data stored on a computer or server using a specific query. API’s are powerful ways to access data and more specifically the specific type and subset of data that we need for our analysis, programmatically.
We will also explore the machine readable JSON data structure. Machine readable data structures are more efficient - particularly for larger data that contain hierarchical structures. In this lesson, we will use the
getJSON() function from the
rjson package to import data from an API, provided in
.json format into a data.frame.
#NOTE: if you have problems with ggmap, try to install both ggplot and ggmap from github #devtools::install_github("dkahle/ggmap") #devtools::install_github("hadley/ggplot2") library(ggmap) library(ggplot2) library(dplyr) library(rjson) library(jsonlite) library(RCurl)
REST API review
Remember that in the first lesson in this module, we discussed RESTful APIs. We explored the concept of a request and then a subsequent response. The request to an RESTful API is composed of a URL and the associated parameters required to access a particular subset of the data that we wish to access.
When you send the request, the web API returns one of the following:
- The data that we requested or
- A failed to return message which tells us that something was wrong with our request.
In this lesson we will access data stored in JSON format from a RESTful API.
Colorado Population Projection data
The Colorado Information Marketplace is a comprehensive data warehouse that contains a wide range of Colorado-specific open datasets available via a RESTful API called the Socrata Open Data API (SODA).
There are lots of API endpoints or data sets available via this API. An endpoint refers to a dataset that you can access and query against.
The “endpoint” of a (SODA) API is simply a unique URL that represents an object or collection of objects. Every Socrata dataset, and even every individual data record, has its own endpoint. The endpoint is what you’ll point your HTTP client at to interact with data resources. Read more about endpoints
One endpoint on the CO information warehouse website contains Colorado Population Projections. If you click on the Colorado Population Projections data link (JSON format) you will see data returned in a
JSON format. These data include population estimates for males and females for every county in Colorado for every year from 1990 to 2040 for multiple age groups.
URL parameters, we can define a more specific request to limit what data we get back in response to our API request. For example, if we only want data for Boulder, Colorado, we can query just that subset of the data using the RESTful call. In the link below, note that the ?&county=Boulder part of the url makes the request to the API to only return data that are for Boulder County, Colorado.
Parameters associated with accessing data using this API are documented here.
Using the Colorado SODA API
SODA API allows us to write ‘queries’ that filter out the exact subset of the data that we want. Here’s the API
URL for population projections for females who live in Boulder that are age 20–40 for the years 2016–2025:
https://data.colorado.gov/resource/tv8u-hswn.json?$where=age between 20 and 40 and year between 2016 and 2025&county=Boulder&$select=year,age,femalepopulation
The data that are returned from an API request are called the response. The format of the returned data or the response is most often in the form of plain text ‘file’ such as
Data Tip: Many API’s allow us to specify the format of the data that we want returned in the response. The Colorado SODA API is no exception - check out the documentation.
Accessing API Data
The first thing that we need to do is create our API request string. Remember that this is a
URL with parameters parameters that specify which subset of the data that we want to access.
Note that we are using a new function -
paste0() - to paste together a complex URL string. This is useful because we may want to iterate over different subsets of the same data (ie reuse the base url or the endpoint but request different subsets using different URL parameters).
# Base URL path base_url = "https://data.colorado.gov/resource/tv8u-hswn.json?" full_url = paste0(base_url, "county=Boulder", "&$where=age between 20 and 40", "&$select=year,age,femalepopulation") # view full url full_url ##  "https://data.colorado.gov/resource/tv8u-hswn.json?county=Boulder&$where=age between 20 and 40&$select=year,age,femalepopulation"
After we’ve created the URL, we can get the data. There are a few ways to access the data however the most direct way is to
encodeURL()to replace spaces in our url with the asii value for space
- Use the
fromJSON()function in the rjson package to import that data into a data.frame object.
Let’s give it a try. First, we encode the URL to replace all spaces with the ascii value for a space which is
# encode the URL with characters for each space. full_url <- URLencode(full_url) full_url ##  "https://data.colorado.gov/resource/tv8u-hswn.json?county=Boulder&$where=age%20between%2020%20and%2040&$select=year,age,femalepopulation"
Then, we import the data directly into a data.frame using the
fromJSON() function that is in the rjson package.
library(rjson) # Convert JSON to data frame pop_proj_data_df <- fromJSON(getURL(full_url)) head(pop_proj_data_df, n = 2) ## age femalepopulation year ## 1 20 2751 1990 ## 2 21 2615 1990 typeof(pop_proj_data_df) ##  "list"
Data Tip: The
getForm() is another way to access API driven data. We are not going to cover this in this class however it is a good option that results in code that is a bit cleaner given the various parameters are passed to the function via argument like syntax.
base_url_example <- "https://data.colorado.gov/resource/tv8u-hswn.json?" getForm(base_url, county = "Boulder", age="BOULDER")
Also note that if we wanted to use
getURL(), we could do so as follows:
# get the data from the specified url using RCurl pop_proj_data_example <- getURL(URLencode(full_url))
Now that our data are in a data.frame format, we can clean them up. Let’s have a close look at the data structure. Are the values in the correct format to work with them quantitatively?
# view data structure str(pop_proj_data_df) ## 'data.frame': 1000 obs. of 3 variables: ## $ age : chr "20" "21" "22" "23" ... ## $ femalepopulation: chr "2751" "2615" "2167" "1798" ... ## $ year : chr "1990" "1990" "1990" "1990" ...
When we import the data from JSON, by default they import in string format. However, if we want to plot the data and manipulate the data quantitatively, we need our data to be in a numeric format. Let’s fix that next.
mutate_at from dplyr
We can uset the
mutate_at() function in a dplyr pipe to change the format of (or apply any function on) any columns within our data.frame. In this case we want to convert all of the columns to a numeric format.
mutate_at() we specify the column names that we want to convert in a vector followed by the function that we wish to apply to each column. THe function in this case is
Because we are using this function in a pipe, our code looks like this:
# turn columns to numeric and remove NA values pop_proj_data_df <- pop_proj_data_df %>% mutate_at(c( "age", "year", "femalepopulation"), as.numeric) str(pop_proj_data_df) ## 'data.frame': 1000 obs. of 3 variables: ## $ age : num 20 21 22 23 24 25 26 27 28 29 ... ## $ femalepopulation: num 2751 2615 2167 1798 1692 ... ## $ year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
Data Tip: Note that the code below, is much more VERBOSE version of what we did above, in a clean way using
mutate_at(). dplyr is a much more efficient way to convert the format of several columns of information!
# convert EACH row to a numeric format # note this is the clunky way to do what we did above with dplyr! pop_proj_data_df$age <- as.numeric(pop_proj_data_df$age) pop_proj_data_df$year <- as.numeric(pop_proj_data_df$year) pop_proj_data_df$femalepopulation <- as.numeric(pop_proj_data_df$femalepopulation) # OR use the apply function to convert all rows in the data.frame to numbers #pops <- as.data.frame(lapply(pop_proj_data_df, as.numeric))
Once we have converted our data to a numeric format, we can plot it using
# plot the data ggplot(pop_proj_data_df, aes(x = year, y = femalepopulation, group = factor(age), color = age)) + geom_line() + labs(x = "Year", y = "Female Population - Age 20-40", title = "Projected Female Population", subtitle = "Boulder, CO: 1990 - 2040")
Using the population projection data that we just used, create a plot of projected MALE population numbers as follows:
- Time span: 1990-2040
- Column category: malepopulation
- Age range: 60-80 years old
ggplot() to create your plot and be sure to label x and y axes and give the plot a descriptive title.