Lesson 3. Subset Time Series By Dates Python Using Pandas

Leah Wasser, Jenny Palomino, Chris Holdgraf, Martha Morrissey

Learning Objectives

Subset time series data using different options for time frames, including by year, month, and with a specified begin and end date.

Temporally Subset Data Using Pandas Dataframes

Sometimes a dataset contains a much larger timeframe than you need for your analysis or plot, and it can helpful to select, or subset, the data to the needed timeframe.

There are many ways to subset the data temporally in Python; one easy way to do this is to use pandas.

Pandas natively understands time operations if:

you tell it what column contains your time stamps (using the parameter parse_dates) and
you set the date column to be the index of the dataframe (using the parameter index_col).

On the previous page of this chapter, you already learned how to complete these steps during the read_csv() import into the pandas dataframe. On this page, you will learn how to use the datetime index to subset data from a pandas dataframe.

Import Packages and Get Data

You will use a slightly modified version of precipitation data (inches) downloaded from the National Centers for Environmental Information (formerly National Climate Data Center) Cooperative Observer Network (COOP) station 050843 in Boulder, CO. The data were collected from January 1, 2003 through December 31, 2013.

Your instructor has modified these data as follows:

aggregated the data to represent daily sum values.
added some no data values to allow you to practice handling missing data.
added new columns to this data that would not usually be there if you downloaded it directly:
- Year
- Julian day (i.e. the calendar day number)

To begin, import the necessary packages to work with pandas dataframe and download data.

You will continue to work with modules from pandas and matplotlib to plot dates more efficiently and with seaborn to make more attractive plots.

# Import necessary packages
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import earthpy as et

# Handle date time conversions between pandas and matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# Use white grid plot background from seaborn
sns.set(font_scale=1.5, style="whitegrid")

Download Curated Dataset From EarthPy

You will also download data from earthpy by specifiying a data key for the dataset called colorado-flood. This dataset has been curated by the earthpy team to include the precipitation data collected for 2003-2013 for Boulder, CO.

# Download the data
data = et.data.get_data('colorado-flood')

Note that when you download data using a data key in earthpy, the data are automatically downloaded to a new directory in the data directory under earth-analytics. The name of the directory will be the same as the name of the dataset, for example, colorado-flood.

For this dataset, there is a precipitation subdirectory within colorado-flood for the precipitation data.

# Set working directory
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

# Define relative path to file with daily precip total
file_path = os.path.join("data", "colorado-flood",
                         "precipitation",
                         "805325-precip-dailysum-2003-2013.csv")

Now that you have downloaded the dataset, you can import the file for the measurement station for Boulder, CO, and specify the:

no data values using the parameter na_values
date column using the parameter parse_dates
datetime index using the parameter index_col

# Import data using datetime and no data value
boulder_precip_2003_2013 = pd.read_csv(file_path,
                                       parse_dates=['DATE'],
                                       index_col= ['DATE'],
                                       na_values=['999.99'])

# View first few rows
boulder_precip_2003_2013.head()

	DAILY_PRECIP	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
DATE
2003-01-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	1
2003-01-05	NaN	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	5
2003-02-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	32
2003-02-02	NaN	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	33
2003-02-03	0.4	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	34

About the Precipitation Data

Viewing the structure of these data, you can see that different types of data are included in this file.

STATION and STATION_NAME: Identification of the COOP station.
ELEVATION, LATITUDE and LONGITUDE: The spatial location of the station.
DAILY_PRECIP: The total precipitation in inches. The metadata for this dataset notes that the value 999.99 indicates missing data. Also important, days with no precipitation are not included in the data.
YEAR: the year the data were collected
JULIAN: the JULIAN DAY the data were collected.

DATE is the date when the data were collected in the format: YYYY-MM-DD.

Notice that DATE is now the index value because you used the parse_date and index_col parameters when you imported the CSV file into a pandas dataframe.

Additional information about the data, known as metadata, is available in the PRECIP_HLY_documentation.pdf.

The metadata tell us that the no data value for these data is 999.99. IMPORTANT: your instructor has modified these data a bit for ease of teaching and learning. Specifically, data have been aggregated to represent daily sum values and some no data values have been added.

Data Tip You can download the original complete data subset with additional documentation here.

Even after reading documentation, it is always a good idea to explore data before working with them such as:

checking out the data types
calculating the summary statistics to get a sense of the data values (and make sure that “no data” values have been identified)
checking out the values in the datetime index.

# View dataframe info
boulder_precip_2003_2013.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 792 entries, 2003-01-01 to 2013-12-31
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   DAILY_PRECIP  788 non-null    float64
 1   STATION       792 non-null    object 
 2   STATION_NAME  792 non-null    object 
 3   ELEVATION     792 non-null    float64
 4   LATITUDE      792 non-null    float64
 5   LONGITUDE     792 non-null    float64
 6   YEAR          792 non-null    int64  
 7   JULIAN        792 non-null    int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 55.7+ KB

# View summary statistics
boulder_precip_2003_2013.describe()

	DAILY_PRECIP	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
count	788.000000	792.0	792.000000	792.000000	792.000000	792.000000
mean	0.247843	1650.5	40.033850	-105.281106	2007.967172	175.541667
std	0.462558	0.0	0.000045	0.000005	3.149287	98.536373
min	0.000000	1650.5	40.033800	-105.281110	2003.000000	1.000000
25%	0.100000	1650.5	40.033800	-105.281110	2005.000000	96.000000
50%	0.100000	1650.5	40.033890	-105.281110	2008.000000	167.000000
75%	0.300000	1650.5	40.033890	-105.281100	2011.000000	255.250000
max	9.800000	1650.5	40.033890	-105.281100	2013.000000	365.000000

# View index values of dataframe
boulder_precip_2003_2013.index

DatetimeIndex(['2003-01-01', '2003-01-05', '2003-02-01', '2003-02-02',
               '2003-02-03', '2003-02-05', '2003-02-06', '2003-02-07',
               '2003-02-10', '2003-02-18',
               ...
               '2013-11-01', '2013-11-09', '2013-11-21', '2013-11-27',
               '2013-12-01', '2013-12-04', '2013-12-22', '2013-12-23',
               '2013-12-29', '2013-12-31'],
              dtype='datetime64[ns]', name='DATE', length=792, freq=None)

Subset Pandas Dataframe By Year

Because you have a dataframe set up with an index, you can start to easily subset your data using the syntax:

df["index_date"]

Note that the datetime index value is accessed using quotation marks "" similar to how you query for text strings.

Using this syntax, you can select all of the data for the year 2013 by specifying the value that you want to select from the datetime index:

# Select 2013 data - view first few records
boulder_precip_2003_2013['2013'].head()

	DAILY_PRECIP	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
DATE
2013-01-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	1
2013-01-28	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	28
2013-01-29	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	29
2013-02-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	32
2013-02-14	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	45

# Select 2013 data - view last few records
boulder_precip_2003_2013['2013'].tail()

	DAILY_PRECIP	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
DATE
2013-12-04	0.4	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	338
2013-12-22	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	356
2013-12-23	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	357
2013-12-29	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	363
2013-12-31	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	365

Note that in the previous example, you are querying the datetime index directly, not querying the values from the Year column.

Subset Pandas Dataframe By Month

Using a datetime index with pandas makes it really easy to continue to select data using additional attributes of the index such as month.

This attribute of the datetime index can be accessed as:

df.index.month == value

where the month values are numeric values ranging from 1 to 12, representing January through December.

With this attribute, you can now employ the pandas syntax to filter values in a pandas dataframe using the syntax:

df[df.index.month == value]

# Select all December data - view first few rows
boulder_precip_2003_2013[boulder_precip_2003_2013.index.month == 12].head()

	DAILY_PRECIP	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
DATE
2003-12-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	335
2004-12-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2004	336
2004-12-22	0.2	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2004	357
2004-12-24	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2004	359
2005-12-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2005	335

# Select all December data - view last few rows
boulder_precip_2003_2013[boulder_precip_2003_2013.index.month == 12].tail()

	DAILY_PRECIP	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
DATE
2013-12-04	0.4	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	338
2013-12-22	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	356
2013-12-23	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	357
2013-12-29	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	363
2013-12-31	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.0338	-105.2811	2013	365

Notice that head() displays December records in 2003, while tail() displays December records in 2013.

Subset Pandas Dataframe By Day of Month

Similarly, you can the attribute day of the index to select all records for a specific day of the month as follows:

df.index.month == value

where the month values are numeric values ranging from 1 to 31, representing possible days of the month.

# Select data for 1st of month - view first rows
boulder_precip_2003_2013[boulder_precip_2003_2013.index.day == 1]

	DAILY_PRECIP	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
DATE
2003-01-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	1
2003-02-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	32
2003-03-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	60
2003-04-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	91
2003-05-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2003	121
...	...	...	...	...	...	...	...	...
2013-08-01	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.03380	-105.28110	2013	213
2013-09-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03380	-105.28110	2013	244
2013-10-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03380	-105.28110	2013	274
2013-11-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03380	-105.28110	2013	305
2013-12-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03380	-105.28110	2013	335

132 rows × 8 columns

Subset Pandas Dataframe Using Range of Dates

You can also subset the data using a specific date range using the syntax:

df["begin_index_date" : "end_index_date]

For example, you can subset the data to a desired time period such as May 1, 2005 - August 31 2005, and then save it to a new dataframe.

# Subset data to May-Aug 2005
precip_may_aug_2005 = boulder_precip_2003_2013['2005-05-01':'2005-08-31']

precip_may_aug_2005.head()

	DAILY_PRECIP	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	YEAR	JULIAN
DATE
2005-05-01	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2005	121
2005-05-11	1.2	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2005	131
2005-05-30	0.5	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2005	150
2005-05-31	0.1	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2005	151
2005-06-01	0.0	COOP:050843	BOULDER 2 CO US	1650.5	40.03389	-105.28111	2005	152

Check Minimum and Maximum Values of Index

Rather than just checking the results of head() and tail(), you can actually query the min and max values of the index as follows:

# Check min value of index 
print(precip_may_aug_2005.index.min())

# Check max value of index 
print(precip_may_aug_2005.index.max())

2005-05-01 00:00:00
2005-08-23 00:00:00

Plot Temporal Subsets From Pandas Dataframe

Once you have subsetted the data and saved it, you can plot the data from the new dataframe to focus in on the desired time period.

Once again, you will use .index.values to access the datetime index values for the plot.

# Create figure and plot space
fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.bar(precip_may_aug_2005.index.values,
       precip_may_aug_2005['DAILY_PRECIP'],
       color='purple')

# Set title and labels for axes
ax.set(xlabel="Date",
       ylabel="Precipitation (inches)",
       title="Daily Total Precipitation\nMay - Aug 2005 for Boulder Creek")

# Rotate tick marks on x-axis
plt.setp(ax.get_xticklabels(), rotation=45)

plt.show()

Bar plot showing daily total precipitation for Boulder Creek between May and Aug 2005.

Think of New Applications and Uses of Subsetting

Given what you have learned about using df.index.month and df.index.day to select data by the month or day of the month value:

What would you replace month or day with, in order to select data by year or even a specific week of the year?

Resample Time Series Data Dates in Python

Share on

Twitter Facebook Google+ LinkedIn

Earth Data Analytics Online Certificate

Intermediate earth data science textbook

intermediate-earth-data-science-textbook Home