Acquiring U.S. census data with Python and cenpy

There are several useful online sources for accessing census data provided both by the US census Bureau American Factfinder, and outside sources. These sources, however, are not conducive to large scale data aquisition and analysis. The Cenpy python package allows for programmitic access of this data through the Census Bureau’s API.

This tutorial outlines the use of the Cenpy package to search for, and acquire specific census data. Cenpy saves this data as a Pandas dataframe. These dataframes allow for easy access and analysis of data within python. For easy visualization of this data look into the GeoPandas package. This package builds on the base Pandas package to add tools for geospatial data analysis.

Objectives

  • Install Cenpy package
  • Search for desired census data
  • Download and store data

Dependencies

The Cenpy package depends on pandas and requests. Ensure that python and pip are already properly installed then use the following commands to install cenpy.

!pip install pandas
!pip install requests
!pip install cenpy
!pip install pysal
import pandas as pd
import cenpy as cen
import pysal

Finding Data

The cenpy explorer module allows you to view all of the available United States Census Bureau API’s.

datasets = list(cen.explorer.available(verbose=True).items())

# print first rows of the dataframe containing datasets
pd.DataFrame(datasets).head()
01
02012acs32012 American Community Survey: 3-Year Estimates
1NONEMP20132013 Nonemployer Statistics: Non Employer Stat...
2BDSFirmsTime Series Business Dynamics Statistics: Firm...
3POPESTprmagesex2013Vintage 2013 Population Estimates: Puerto Rico...
4POPESTcty2013Vintage 2013 Population Estimates: County Tota...

Passing the name of a specific API to explorer.explain() will give a description of the data available. For this example, we will use the 2012 American Community Service 1 year data (2012acs1).

dataset = '2012acs1'
cen.explorer.explain(dataset)
{'2012 American Community Survey: 1-Year Estimates': "The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years.  Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself.  The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups.  Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties.  For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units."}

The base module allows you to establish a connection with the desired API that will be used later to acquire data.

con = cen.base.Connection(dataset)
con
Connection to 2012 American Community Survey: 1-Year Estimates (ID: http://api.census.gov/data/id/2012acs1)

Acquiring Data

Geographical specification

Cenpy uses FIPS codes to specify the geographical extent of the data to be downloaded. The object con is our connection to the api, and the attribute geographies is a dictionary.

print(type(con))
print(type(con.geographies))
print(con.geographies.keys())
<class 'cenpy.remote.APIConnection'>
<class 'dict'>
dict_keys(['fips'])
# print head of data frame in the geographies dictionary
con.geographies['fips'].head()
geoLevelIdnameoptionalWithWCForrequires
0500congressional districtstate[state]
1060county subdivisionNaN[state, county]
2795public use microdata areaNaN[state]
3320metropolitan statistical area/micropolitan sta...NaN[state]
4310metropolitan statistical area/micropolitan sta...NaNNaN

geo_unit and geo_filter are both necessary arguments for the query() function. geo_unit specifies the scale at which data should be taken. geo_filter then creates a filter to ensure too much data is not downloaded. The following example will download data from all counties in Colorado (state FIPS codes are accessible here).

g_unit = 'county:*'
g_filter = {'state':'8'}

Specifying variables to extract

The other argument taken by query() is cols. This is a list of columns taken from the variables of the API. These variables can be displayed using the variables function, however, due to the number of variables it is easier to use the Social Explorer site to find data you are interested in.

var = con.variables
print('Number of variables in', dataset, ':', len(var))
con.variables.head()
Number of variables in 2012acs1 : 68401
conceptlabelpredicateOnlypredicateType
AIANHHNaNAmerican Indian Area/Alaska Native Area/Hawaii...NaNNaN
AIANHHFPNaNAmerican Indian Area/Alaska Native Area/Hawaii...NaNNaN
AIHHTLINaNAmerican Indian Trust Land/Hawaiian Home Land ...NaNNaN
AITSNaNAmerican Indian Tribal Subdivision (FIPS)NaNNaN
AITSCENaNAmerican Indian Tribal Subdivision (Census)NaNNaN

Related columns of data will always start with the same base prefix, so cenpy has an included function, varslike, that will create a list of column names that match the input pattern. It is also useful to add on the NAME and GEOID columns, as these will provide the name and geographic id of all data. In this example, we will use the B01001A, which gives data for sex by age within the desired geography. The identifier at the end corresponds to males or females of different age groups.

cols = con.varslike('B01001A_')
cols.extend(['NAME', 'GEOID'])

With the three necessary arguments, data can be downloaded and saved as a pandas dataframe.

data = con.query(cols, geo_unit=g_unit, geo_filter=g_filter)
# prints a deprecation warning because of how cenpy calls pandas
/home/max/anaconda3/lib/python3.5/site-packages/cenpy/remote.py:167: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  df[cols] = df[cols].convert_objects(convert_numeric=convert_numeric)

It is useful to replace the default index with the data from the NAME or GEOID column, as these will give a more useful description of the data.

data.index = data.NAME

# print first five rows and last five columns
data.ix[:5, -5:]
B01001A_030MB01001A_031EB01001A_031MNAMEGEOID
NAME
Adams County, Colorado6712483670Adams County, Colorado05000US08001
Arapahoe County, Colorado7015125688Arapahoe County, Colorado05000US08005
Boulder County, Colorado6362985645Boulder County, Colorado05000US08013
Denver County, Colorado6545408650Denver County, Colorado05000US08031
Douglas County, Colorado3841177366Douglas County, Colorado05000US08035

Topologically Integrated Geographic Encoding and Referencing (TIGER) data

The Census TIGER API provides geomotries for desired geographic regions. For instance, perhaps we want to have additional information on each county such as area.

cen.tiger.available()
[{'name': 'AIANNHA', 'type': 'MapServer'},
 {'name': 'CBSA', 'type': 'MapServer'},
 {'name': 'Hydro_LargeScale', 'type': 'MapServer'},
 {'name': 'Hydro', 'type': 'MapServer'},
 {'name': 'Labels', 'type': 'MapServer'},
 {'name': 'Legislative', 'type': 'MapServer'},
 {'name': 'Places_CouSub_ConCity_SubMCD', 'type': 'MapServer'},
 {'name': 'PUMA_TAD_TAZ_UGA_ZCTA', 'type': 'MapServer'},
 {'name': 'Region_Division', 'type': 'MapServer'},
 {'name': 'School', 'type': 'MapServer'},
 {'name': 'Special_Land_Use_Areas', 'type': 'MapServer'},
 {'name': 'State_County', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2013', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2014', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2015', 'type': 'MapServer'},
 {'name': 'tigerWMS_Census2010', 'type': 'MapServer'},
 {'name': 'tigerWMS_Current', 'type': 'MapServer'},
 {'name': 'tigerWMS_Econ2012', 'type': 'MapServer'},
 {'name': 'tigerWMS_PhysicalFeatures', 'type': 'MapServer'},
 {'name': 'Tracts_Blocks', 'type': 'MapServer'},
 {'name': 'Transportation_LargeScale', 'type': 'MapServer'},
 {'name': 'Transportation', 'type': 'MapServer'},
 {'name': 'TribalTracts', 'type': 'MapServer'},
 {'name': 'Urban', 'type': 'MapServer'},
 {'name': 'USLandmass', 'type': 'MapServer'}]

First, you must establish a connection to the TIGER API, then you can display the avaialable layers. No Tiger data is available for ACS 2012, so we will use the ACS 2013 for the sake of example, but ideally you will be able to find corresponding Tiger data.

con.set_mapservice('tigerWMS_ACS2013')

# print layers
con.mapservice.layers
{0: (ESRILayer) 2010 Census Public Use Microdata Areas,
 1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels,
 2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas,
 3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels,
 4: (ESRILayer) Tribal Census Tracts,
 5: (ESRILayer) Tribal Census Tracts Labels,
 6: (ESRILayer) Tribal Block Groups,
 7: (ESRILayer) Tribal Block Groups Labels,
 8: (ESRILayer) Census Tracts,
 9: (ESRILayer) Census Tracts Labels,
 10: (ESRILayer) Census Block Groups,
 11: (ESRILayer) Census Block Groups Labels,
 12: (ESRILayer) Unified School Districts,
 13: (ESRILayer) Unified School Districts Labels,
 14: (ESRILayer) Secondary School Districts,
 15: (ESRILayer) Secondary School Districts Labels,
 16: (ESRILayer) Elementary School Districts,
 17: (ESRILayer) Elementary School Districts Labels,
 18: (ESRILayer) Estates,
 19: (ESRILayer) Estates Labels,
 20: (ESRILayer) County Subdivisions,
 21: (ESRILayer) County Subdivisions Labels,
 22: (ESRILayer) Subbarrios,
 23: (ESRILayer) Subbarrios Labels,
 24: (ESRILayer) Consolidated Cities,
 25: (ESRILayer) Consolidated Cities Labels,
 26: (ESRILayer) Incorporated Places,
 27: (ESRILayer) Incorporated Places Labels,
 28: (ESRILayer) Census Designated Places,
 29: (ESRILayer) Census Designated Places Labels,
 30: (ESRILayer) Alaska Native Regional Corporations,
 31: (ESRILayer) Alaska Native Regional Corporations Labels,
 32: (ESRILayer) Tribal Subdivisions,
 33: (ESRILayer) Tribal Subdivisions Labels,
 34: (ESRILayer) Federal American Indian Reservations,
 35: (ESRILayer) Federal American Indian Reservations Labels,
 36: (ESRILayer) Off-Reservation Trust Lands,
 37: (ESRILayer) Off-Reservation Trust Lands Labels,
 38: (ESRILayer) State American Indian Reservations,
 39: (ESRILayer) State American Indian Reservations Labels,
 40: (ESRILayer) Hawaiian Home Lands,
 41: (ESRILayer) Hawaiian Home Lands Labels,
 42: (ESRILayer) Alaska Native Village Statistical Areas,
 43: (ESRILayer) Alaska Native Village Statistical Areas Labels,
 44: (ESRILayer) Oklahoma Tribal Statistical Areas,
 45: (ESRILayer) Oklahoma Tribal Statistical Areas Labels,
 46: (ESRILayer) State Designated Tribal Statistical Areas,
 47: (ESRILayer) State Designated Tribal Statistical Areas Labels,
 48: (ESRILayer) Tribal Designated Statistical Areas,
 49: (ESRILayer) Tribal Designated Statistical Areas Labels,
 50: (ESRILayer) American Indian Joint-Use Areas,
 51: (ESRILayer) American Indian Joint-Use Areas Labels,
 52: (ESRILayer) 113th Congressional Districts,
 53: (ESRILayer) 113th Congressional Districts Labels,
 54: (ESRILayer) 2013 State Legislative Districts - Upper,
 55: (ESRILayer) 2013 State Legislative Districts - Upper Labels,
 56: (ESRILayer) 2013 State Legislative Districts - Lower,
 57: (ESRILayer) 2013 State Legislative Districts - Lower Labels,
 58: (ESRILayer) Census Divisions,
 59: (ESRILayer) Census Divisions Labels,
 60: (ESRILayer) Census Regions,
 61: (ESRILayer) Census Regions Labels,
 62: (ESRILayer) 2010 Census Urbanized Areas,
 63: (ESRILayer) 2010 Census Urbanized Areas Labels,
 64: (ESRILayer) 2010 Census Urban Clusters,
 65: (ESRILayer) 2010 Census Urban Clusters Labels,
 66: (ESRILayer) Combined New England City and Town Areas,
 67: (ESRILayer) Combined New England City and Town Areas Labels,
 68: (ESRILayer) New England City and Town Area Divisions,
 69: (ESRILayer) New England City and Town Area  Divisions Labels,
 70: (ESRILayer) Metropolitan New England City and Town Areas,
 71: (ESRILayer) Metropolitan New England City and Town Areas Labels,
 72: (ESRILayer) Micropolitan New England City and Town Areas,
 73: (ESRILayer) Micropolitan New England City and Town Areas Labels,
 74: (ESRILayer) Combined Statistical Areas,
 75: (ESRILayer) Combined Statistical Areas Labels,
 76: (ESRILayer) Metropolitan Divisions,
 77: (ESRILayer) Metropolitan Divisions Labels,
 78: (ESRILayer) Metropolitan Statistical Areas,
 79: (ESRILayer) Metropolitan Statistical Areas Labels,
 80: (ESRILayer) Micropolitan Statistical Areas,
 81: (ESRILayer) Micropolitan Statistical Areas Labels,
 82: (ESRILayer) States,
 83: (ESRILayer) States Labels,
 84: (ESRILayer) Counties,
 85: (ESRILayer) Counties Labels}

The data retrieved earlier was at the county level, so we will use layer 84. Using the tiger connection, query() can retrieve the data, taking the layer and the geographic location as arguments.

geodata = con.mapservice.query(layer=84, where='STATE=8')
# preview geodata
geodata.ix[:5, :5]
AREALANDAREAWATERBASENAMECENTLATCENTLON
0188123798336592000Boulder+40.0924502-105.3577112
13962908954208401Denver+39.7620189-104.8765880
2617997605030284242Pueblo+38.1732359-104.5127778
3854784971411781Broomfield+39.9541268-105.0527108
4295800740316886462Delta+38.8613998-107.8631974
546057141298166134Cheyenne+38.8281780-102.6034141

This data can now be merged with the original data to create one pandas dataframe containing all of the relevant data.

newdata = pd.merge(data, geodata, left_on='county', right_on='COUNTY')
newdata.ix[:5, -5:]
NAME_yOBJECTIDOIDSTATEgeometry
0Adams County12262755370023431908<pysal.cg.shapes.Polygon object at 0x7f6173163...
1Arapahoe County29802755370378941408<pysal.cg.shapes.Polygon object at 0x7f617096c...
2Boulder County5122755370143507008<pysal.cg.shapes.Polygon object at 0x7f617448c...
3Denver County5292755370023432108<pysal.cg.shapes.Polygon object at 0x7f617448c...
4Douglas County27622755371165641608<pysal.cg.shapes.Polygon object at 0x7f6173058...
5El Paso County28782755370450295808<pysal.cg.shapes.Polygon object at 0x7f6171448...

Updated:

Leave a Comment