Introduction to programmatic data access in R - Earth analytics course module

Welcome to the first lesson in the Introduction to programmatic data access in R module. In this module, we introduce various ways to access, download and work with data programmatically. These methods include downloading text files directly from a website onto your computer and into R, reading in data stored in text format from a website, into a data.frame in R and finally, accessing subsets of particular data using REST API calls in R.

Lesson 1. Introduction to APIs

Learning Objectives

After completing this tutorial, you will be able to:

  • Describe the difference between human vs machine readable data structures.
  • Describe the difference between data returned using an API compared to downloading a text file directly.
  • Describe 2-3 components of a RESTful API call.
  • List 3 different ways to access data programaticailly from R: using download.file(), read.csv() and an API call.

What you need

You will need a computer with internet access to complete this lesson.

Access data programmatically

This week we will discuss programmatic access of data using:

  1. Direct downloads / import of data and
  2. Applied Programming Interfaces (APIs).

Up until this point, we have been downloading data from a website (in the case of this course, Figshare) independently. Then we work with the data in R. The data that we have downloaded are prepared specifically for this course. However, independently downloading and unzipping data, each week is not efficient and does not explicitly tie our data to our analysis.

We can automate the data download process using R. Automation is particularly useful when:

  • We want to download lots of data or particular subsets of data to support an analysis and
  • If and when there are programmatic ways to access and query the data online.

When we automate data access, download or retrieval and embed it in our code, we are directly linking our analysis to our data. Further combined with rmarkdown reports, code comments and expressive coding techniques, we are better documenting our workflow. IN short - by linking data access and download to our analysis - we are not only reminding our future selves of our process - we are also reminding our future self where (and how) we got the data in the first place!

3 Ways to access data

We can break up programmatic data access into 3 general categories:

  1. Data that we download directly by calling a specific URL and or by using the download.file() function call.
  2. Data that we directly import into R using a call to read.csv() or read.table().
  3. Data that we download using an API which makes a request to a data repository and returns requested data

2 key formats

The data that we access programmatically may be returned in one of two main formats:

  1. Tabular Human-readable file: Files that are tabular, include csv’s (Comma Separated Values) and even spreadsheets (Microsoft Excel, etc.). These files are organized into columns and rows and “flat” in structure rather than hierarchical.
  2. Structured Machine-readable files: Files that are sometimes stored in a text format but are hierarchical and structured in some way that optimizes machine readability.

Data Tip: there are non text formatted hierarchical data structures that we will not cover in this module. One example of this is the hdf5 data model (structure).

Download files programmatically

In week one of this course, we downloaded some data using the download.file() function. In this case, we accessed data programmatically, but we first downloaded it as a .csv to our computer, and then proceeded to open it and work with it in R.

# download text file to a specified location on our computer
download.file(url = "https://ndownloader.figshare.com/files/7010681",
             destfile = "data/week10/boulder-precip-aug-oct-2013.csv")
# read data into R
boulder_precip <- read.csv("data/week10/boulder-precip-aug-oct-2013.csv")
XDATEPRECIP
7562013-08-210.1
7572013-08-260.1
7582013-08-270.1
7592013-09-010.0
7602013-09-090.1
7612013-09-101.0
7622013-09-112.3
7632013-09-129.8
7642013-09-131.9
7652013-09-151.4
7662013-09-160.4
7672013-09-220.1
7682013-09-230.3
7692013-09-270.3
7702013-09-280.1
7712013-10-010.0
7722013-10-040.9
7732013-10-110.1

download.file()

When we use the download.file() function, we are telling R to save a copy of the file that we access online, locally, on our computer. Thus we only need to run this function once or more if the data change given we have the data stored locally in our working directory.

read.csv()

We can also use the read.csv() function to import data directly into R. We will learn more about this in the next lesson. Note that when we read data into R using read.csv() we are not saving a copy of our data locally, on our computer - we are importing the data directly into R. If you want a copy of the computes data to use for future analysis without directly importing it you will need to export the data to your working directory using write.csv().

Human readable data

Notice that the data that we downloaded above using download.file() are tabular and thus human-readable. The data are in a tabular structure, with rows and columns that we can quickly understand. R can import these data into a data.frame and we can work with it programmatically in R.

However, what happens if our data structure is more complex? For example, what if we wanted to store more information about each measured precipitation data point? Our table could get very wide very quickly making is less readable but also more computationally intensive to process.

We will talk about structured machine readable data structures later in this module which may be hard for humans to quickly digest when we look at them but are much more efficient to process - particular as our data get large.

What is an API?

An API (Applied Programming Interface) is an interface that sits on top of a computer based system and simplifies certain tasks such as - extracting subsets of data from a large repository or database.

Using Web-APIs

Web API’s allow us to access data available via an internet web interface.

Often we can access data from web APIs using a URL that contains sets of parameters that specifies the type and particular subset of data that we are interested in.

If you have worked with a database such as SQL or POSTSQL, or if you’ve ever queried data from a GIS system like ArcGIS, then you can compare the set of parameters associated with a URL string to a SQL query.

Web APIs are a way to strip away all the extraneous visual interface that you don’t care about and get the data that you want.

Why we use web APIs

Among other things, APIs allow us to:

  • Get information that would be time-consuming to get otherwise
  • Get information that you can’t get otherwise
  • Automate an analytical workflows that require continuously updated data
  • Access data using a more direct interface

3 parts of an API request

When we talk about API’s it’s important to understand two key components: the request and the response. The third part listed below is the intermediate step where the request is PROCESSED by the remote server.

  1. Data REQUEST: You try to access a URL in your browser that specifies a particular subset of data.
  2. Data processing: A web server somewhere uses that url to query a specified dataset.
  3. Data RESPONSE: That web server then sends you back some content.

The response may give you one of two things:

  1. Some data or
  2. An explanation of why your request failed
Restful response vs request.
First we create a request that queries the data repository for some data using the API. The repository gets the request and in turn processes it. Finally is responds. If our request was a good one, the response will include the data that we are interested in. Image Source

API endpoints

When we talk about an endpoint, we are referring to a datasource available through an API. These data may be census data, geospatial base maps, water quality or any type of data that has been made available through the API.

Restful web APIs

There are many different types of web APIs. One of the most common types is a REST, or RESTful, API. A RESTful API is a web API that uses URL arguments to specify what information you want returned through the API.

To put this all into perspective, next, we will explore a RESTful API interface example.

Colorado population projection data

The Colorado Information Marketplace is a data warehouse that provides access to a wide range of Colorado-specific open datasets available via a RESTful API called the Socrata Open Data API (SODA)

There are lots of API endpoints or data sets available through this API.

One endpoint contains Colorado Population Projection data. If you click on the link to the CO Population projection data, you will see data returned in a JSON format.

JSON is a structured, machine readable format. We will learn more about it in the next lesson.

Population data request and response

The CO population project data contain projected population estimates for males and females for every county in Colorado for every year from 1990 to 2040 for multiple age groups.

Phew! In the previous sentence, we just specified all of the variables stored within these data. These variables can be used to query the data! This is our data request.

Below, we see a small subset of the response that we get from a basic request with no URL parameters specified - https://data.colorado.gov/resource/tv8u-hswn.json. Notice that the response in this case is returned in JSON format.

[{"age":"0","county":"Adams","femalepopulation":"2404.00","fipscode":"1","malepopulation":"2354.00","totalpopulation":"4758","year":"1990"}
,{"age":"1","county":"Adams","femalepopulation":"2375.00","fipscode":"1","malepopulation":"2345.00","totalpopulation":"4720","year":"1990"}
,{"age":"2","county":"Adams","femalepopulation":"2219.00","fipscode":"1","malepopulation":"2413.00","totalpopulation":"4632","year":"1990"}
,{"age":"3","county":"Adams","femalepopulation":"2261.00","fipscode":"1","malepopulation":"2321.00","totalpopulation":"4582","year":"1990"}
,{"age":"4","county":"Adams","femalepopulation":"2302.00","fipscode":"1","malepopulation":"2433.00","totalpopulation":"4735","year":"1990"},
...
]

URL Parameters

Using URL parameters, we can define a more specific request to limit what data we get back in response to our API request. For example, we can query the data to only return data for Boulder County, Colorado using the RESTful call.

Data Tip: Note the ?&county=Boulder part of the url below. That is an important part of the API request that tells the API to only return a subset of the data - where county = Boulder. https://data.colorado.gov/resource/tv8u-hswn.json?&county=Boulder

Like this: https://data.colorado.gov/resource/tv8u-hswn.json?&county=Boulder.

Notice that when we visit the URL above and in turn request the data for Boulder County, we see that now the response is filtered to only include Boulder County data.

[{"age":"66","county":"Boulder","femalepopulation":"649","fipscode":"13","malepopulation":"596","totalpopulation":"1245","year":"1997"}
,{"age":"78","county":"Boulder","femalepopulation":"427","fipscode":"13","malepopulation":"258","totalpopulation":"685","year":"1992"}
,{"age":"85","county":"Boulder","femalepopulation":"265","fipscode":"13","malepopulation":"110","totalpopulation":"375","year":"1991"}
,{"age":"74","county":"Boulder","femalepopulation":"516","fipscode":"13","malepopulation":"373","totalpopulation":"889","year":"1996"},
...
]

Parameters associated with accessing data using this API are documented here.

Using the SODA RESTful API

The SODA RESTful API also allows us to specify more complex ‘queries’. Here’s the API URL for population projections for females who live in Boulder that are between the ages of 20–40 for the years 2016–2025:

https://data.colorado.gov/resource/tv8u-hswn.json?$where=age between 20 and 40 and year between 2016 and 2025&county=Boulder&$select=year,age,femalepopulation

[{"age":"32","femalepopulation":"2007","year":"2024"}
,{"age":"35","femalepopulation":"1950","year":"2016"}
,{"age":"37","femalepopulation":"2039","year":"2019"}
,{"age":"30","femalepopulation":"2087","year":"2025"}
,{"age":"26","femalepopulation":"1985","year":"2019"}
,{"age":"22","femalepopulation":"3207","year":"2016"}
...
]

Click here to view the full API response.

Breaking down an API string

Notice that the colorado.data.gov API URL above, starts with data.colorado.gov but then has various parameters attached to the end of the URL that specify the particular type of information that we are looking for.

A few of the parameters that we can see in the url below are listed below:

  • The Data set itself: /tv8u-hswn.json
  • AGE: where=age between 20 and 40
  • YEAR: year between 2016 and 2025
  • COUNTY: county=Boulder
  • Columns to get: select=year,age,femalepopulation

JSON structured text format API Response

The response data that are returned from this API are in a text format, structured using JSON.

Data Tip: Many APIs allow you to specify the file format that you want to be returned. Learn more about how this works with the CO data warehouse here.

Notice that the first few rows of data returned via the query above with a .csv suffix look like this:

"age","femalepopulation","year"
"32","2007","2024"
"35","1950","2016"
"37","2039","2019"
"30","2087","2025"
"26","1985","2019"
...

Here is a different application of the same type of API. Here, the website developers have built a tabular viewer that we can use to look at and interact with the population data. These data are the same data that we can download using the REST API url string above. However, the developers have wrapped the API in a cool interface that allows us to view the data directly, online.

We will work with these data in R directly in the following lessons, but for now just notice how the API access works in this case.

  1. Data REQUEST: You try to access a URL in your browser.
  2. Data processing: A web server somewhere uses that url to query a specified dataset.
  3. Data RESPONSE: That web server then sends you back some content.

Optional challenge

Explore creating SODA API calls to the Colorado data warehouse. Go to the bottom of the page and check out each variable that you can query on.

Additional Resources

More about JSON

Using APIs

So, how do we learn more about APIs? Below are some resources …

The documentation in the URL’s above describes the different types of requests that we can make to the data provider. For each request URL we need to specify the parameters and consider the response.

Updated:

Leave a Comment