Lesson 3. Create Data Workflows with Loops

Leah Wasser, Nathan Korinek

Learning Objectives

After completing this chapter, you will be able to:

Iterate through multiple directories to access data within.
Nest loops in each other to run more complicated loops.
Use for loops to combine multiple datasets into one Pandas DataFrame.

Build a Data Workflow

Often, the data that you need for your workflow is messy and sometimes difficult to access. It can be spread out over multiple files in many directories. Manually processing data in multiple files and directories manually is time consuming. It can be much easier to automate the process in Python with for loops! This lesson will cover using loops to automate data workflows where you have many different files that you need to work with.

import os
from glob import glob
import matplotlib.pyplot as plt
import pandas as pd
import earthpy as et

# Download data on average monthly temp for two California sites
file_url = "https://ndownloader.figshare.com/files/21894528"
data_path = et.data.get_data(url=file_url)
data_path

# Set working directory to earth-analytics
os.chdir(os.path.join(et.io.HOME,
                      "earth-analytics",
                      "data",
                      "earthpy-downloads"))

Downloading from https://ndownloader.figshare.com/files/21894528
Extracted output to /root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr

# List directories for the data
os.listdir(data_path)

['Sonoma', 'San-Diego']

The data downloaded above contain temperature data for two cities in California, USA: Sonoma and San Diego. Within each directory you will find temperature data for the year stored in individual .csv files. Like this:

Sonoma-1999-temp.csv
Sonoma-2000-temp.csv
Sonoma-2001-temp.csv

Your analysis goal is to create a plot of temperature across all years that you have data for in both locations. To create this plot, you will need to:

open each file
combine the data for each year into a single dataframe
add an attribute that specifies the location and year to the data.frame

The above steps will allow you to summarize the plot the data.

Explore Your Data Using Loops

It’s normal to open and look at your directory structure after you download new data. You can, however, also explore the data using glob to list directories and files within each directory. To begin, create a list of all the directories in your new download directory. If you remember from Chapter 12 of this textbook, you can use glob to create a list of files that match certain criteria. You can then iterate through those lists with for loops to perform an operation on each file.

Here, you create a list using glob of all of the folders in the data you downloaded, and then iterate through that list with a for loop to print out the name of each folder within the main directory.

# Print all dirs found by glob
data_dirs = os.path.join(data_path, "*")
all_dirs = glob(data_dirs)

for a_dir in all_dirs:
    print(a_dir)

/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego

Use `glob` Inside a `for` Loop

The folders printed out above are interesting, but they are not the files you want to work on. The files are another directory down, stored within the folders printed above. To access these files, you can use glob once again. Inside the for loop, you can use glob to create a list of all of the files within each folder in the original list.

Below you can see two different lists printed, one with all of the files stored in the San-Diego folder, and one with all of the files stored in the Sonoma folder.

# Create lists inside a for loop
for a_dir in all_dirs:
    all_files = os.path.join(a_dir, "*")
    print(glob(all_files))

['/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2001-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2002-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2003-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2000-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-1999-temp.csv']
['/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2002-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2000-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-1999-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2003-temp.csv', '/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2001-temp.csv']

Nesting `for` Loops

Now that you’ve created another list inside of a for loop, you can loop through that list too! Below, you create the list of files inside each directory with glob, and then you loop through that list to print out each individual file.

Nesting for loops is a powerful tool, but can also be computationally expensive. Each time the outer loop executes one cycle, the for loop that’s nested inside must complete it’s entire loop before the outer loop executes it’s next cycle. You can nest as many for loops as you want, but it’s good to keep this in mind. The more for loops that are nested, the longer each cycle of the outer for loop will take.

# Nesting for loops
for a_dir in all_dirs:
    dir_path = os.path.join(a_dir, "*")
    all_file_paths = (glob(dir_path))
    # Create a nested loop which loops through each directory
    for a_file_path in all_file_paths:
        print(a_file_path)

/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2001-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2002-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2003-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2000-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-1999-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2002-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2000-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-1999-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2003-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2001-temp.csv

Data Tip: Assigning Variables in for Loops

When assigning variables in Python, overwriting a variable with new data can be useful to update values and make your code more legible. This can be useful in for loops too, but can easily be done by accident as well. Below, you see that for every file in the two folders that contain your data, the line pd.read_csv() is run, and assigned to the variable data. However, with each new cycle of the for loop, that variable is getting reassigned to a new file’s data.

You can see that even though you opened up and read every single file into pandas, the only data stored in the data variable is from the last file that was read, which happens to be the 2003 data from the Sonoma folder.

To avoid this data being overwritten every cycle, a common work around is to create a list outside of the for loop, and append all of the values into that list. That way the data variable from each cycle will be stored in the list with the data from that cycle stored in it, preventing it from being overwritten.

for a_dir in all_dirs:
    dir_path = os.path.join(a_dir, "*")
    all_file_paths = (glob(dir_path))
    # Create a nested loop which loops through each directory
    for a_file_path in all_file_paths:
        print(a_file_path)
        # Read the file into a pandas dataframe and assign it to a variable
        temp_data_df = pd.read_csv(a_file_path)
temp_data_df

/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2001-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2002-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2003-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-2000-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma/Sonoma-1999-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2002-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2000-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-1999-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2003-temp.csv
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-2001-temp.csv

	Year	January	February	March	April	May	June	July	August	September	October	November	December
0	2001	62.2	61.7	63.7	63.6	67.5	72	73	73.5	73.2	71.1	66.9	63.3

Combining Pandas Datasets in a List

pandas has a lot of functionality that makes modifying large amounts of data much easier. One such function is the concat function, which can be used to combine multiple pandas DataFrames stored in a list into one DataFrame. You can combine this functionality with a list of DataFrames created and added to a list in a for loop, and create a final DataFrame with all of the data stored in the directory structure.

# Create a list outside of the for loop to store data
all_df = []

for a_dir in all_dirs:
    dir_path = os.path.join(a_dir, "*")
    all_file_paths = (glob(dir_path))
    for a_file_path in all_file_paths:
        temp_data_df = pd.read_csv(a_file_path)
        # Append the data to the list you made to prevent it from being overwritten
        all_df.append(temp_data_df)

# Combine all the dataframes stored in the all_df list into one pandas dataframe
pd.concat(all_df)

Year	January	February	March	April	May	June	July	August	September	October	November	December
2001	56.5	58.4	68.0	66.7	83.1	86.1	81.2	85.3	81.0	78.4	64.0	55.6
2002	55.8	62.9	63.5	66.9	74.5	84.2	82.8	82.1	84.5	75.7	67.6	57.4
2003	58.9	61.8	66.4	61.5	74.2	81.1	87.0	83.5	85.0	82.7	61.0	56.4
2000	56.8	58.0	64.8	69.5	75.0	79.5	81.4	83.7	83.4	71.8	60.7	58.5
1999	56.1	56.6	58.7	68.3	72.7	80.6	80.9	82.9	81.3	79.1	64.2	61.5
2002	63.0	66.2	63.6	63.7	66.1	68.7	71.8	73.4	75.5	68.4	71.4	63.7
2000	65.8	65.3	63.7	67.7	69.1	72.4	73.5	76.3	75.5	69.5	66.0	66.9
1999	65.2	64.9	63.6	64.6	64.4	66.9	72.7	72.7	71.4	77.7	67.4	67.6
2003	71.4	64.6	66.5	65.2	66.9	67.4	74.3	77.5	74.0	73.1	67.4	64.6
2001	62.2	61.7	63.7	63.6	67.5	72.0	73.0	73.5	73.2	71.1	66.9	63.3

Extracting Data from Pathnames

Often times when data is stored in many directories, the name of the folder that the data is stored in can have useful information. Some examples of this include:

The date the data was collected
The location where the data was collected
Physical attributes of the data
The resolution of the data

There are countless other examples of data that can be stored in a file pathname. Because of this, it can be useful to add data from a pathname to your data.

In this example, the two folders that store all of the files contain useful information in the folder names. Specifically, the names, San-Diego and Sonoma, are indicative of where the data was collected. You can add this information to a new column in the pandas DataFrames you are creating.

Data Tip: Extracting Values from Pathnames

You can extract information from a path name in many ways. In this exercise, we use os.path.basename() to get our information, but that is not always the best way to collect information from a path. This wouldn’t work if, for example, the infomation you wanted from a path wasn’t last folder name in your pathname. For more ways to extract information from a path, see this lesson modifying paths with glob and os.

You can extract the last folder name in your pathname by using os.path.basename(). This function will grab the last folder name in a pathname and return it as a string.

# Get the basename from a pathname

basename = os.path.basename(all_dirs[0])

print("The pathname without running os.path.basename():\n" + all_dirs[0])
print("\nThe infromation extracted from the pathname by os.path.basename():\n" + basename)

The pathname without running os.path.basename():
/root/earth-analytics/data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma

The infromation extracted from the pathname by os.path.basename():
Sonoma

# Read in filenames without modifying the pathnames

all_df = []
for a_dir in all_dirs:
    dir_path = os.path.join(a_dir, "*")
    all_file_paths = (glob(dir_path))
    for a_file_path in all_file_paths:
        temp_data_df = pd.read_csv(a_file_path)
        # Reading the pathname from each directory into the dataframe.
        # Notice how it reads in the entire pathname.
        temp_data_df["location"] = a_dir
        all_df.append(temp_data_df)

pd.concat(all_df)

Year	January	February	March	April	May	June	July	August	September	October	November	December	location
2001	56.5	58.4	68.0	66.7	83.1	86.1	81.2	85.3	81.0	78.4	64.0	55.6	/root/earth-analytics/data/earthpy-downloads/a...
2002	55.8	62.9	63.5	66.9	74.5	84.2	82.8	82.1	84.5	75.7	67.6	57.4	/root/earth-analytics/data/earthpy-downloads/a...
2003	58.9	61.8	66.4	61.5	74.2	81.1	87.0	83.5	85.0	82.7	61.0	56.4	/root/earth-analytics/data/earthpy-downloads/a...
2000	56.8	58.0	64.8	69.5	75.0	79.5	81.4	83.7	83.4	71.8	60.7	58.5	/root/earth-analytics/data/earthpy-downloads/a...
1999	56.1	56.6	58.7	68.3	72.7	80.6	80.9	82.9	81.3	79.1	64.2	61.5	/root/earth-analytics/data/earthpy-downloads/a...
2002	63.0	66.2	63.6	63.7	66.1	68.7	71.8	73.4	75.5	68.4	71.4	63.7	/root/earth-analytics/data/earthpy-downloads/a...
2000	65.8	65.3	63.7	67.7	69.1	72.4	73.5	76.3	75.5	69.5	66.0	66.9	/root/earth-analytics/data/earthpy-downloads/a...
1999	65.2	64.9	63.6	64.6	64.4	66.9	72.7	72.7	71.4	77.7	67.4	67.6	/root/earth-analytics/data/earthpy-downloads/a...
2003	71.4	64.6	66.5	65.2	66.9	67.4	74.3	77.5	74.0	73.1	67.4	64.6	/root/earth-analytics/data/earthpy-downloads/a...
2001	62.2	61.7	63.7	63.6	67.5	72.0	73.0	73.5	73.2	71.1	66.9	63.3	/root/earth-analytics/data/earthpy-downloads/a...

# Read in filenames using os to modify the pathnames

all_df = []
for a_dir in all_dirs:
    dir_path = os.path.join(a_dir, "*")
    all_file_paths = (glob(dir_path))
    for a_file_path in all_file_paths:
        temp_data_df = pd.read_csv(a_file_path)
        # Read the pathname from each directory into the dataframe. 
        # Notice how it reads just the last folder name into the column now.
        temp_data_df["location"] = os.path.basename(a_dir)
        all_df.append(temp_data_df)

all_data = pd.concat(all_df)
all_data

Year	January	February	March	April	May	June	July	August	September	October	November	December	location
2001	56.5	58.4	68.0	66.7	83.1	86.1	81.2	85.3	81.0	78.4	64.0	55.6	Sonoma
2002	55.8	62.9	63.5	66.9	74.5	84.2	82.8	82.1	84.5	75.7	67.6	57.4	Sonoma
2003	58.9	61.8	66.4	61.5	74.2	81.1	87.0	83.5	85.0	82.7	61.0	56.4	Sonoma
2000	56.8	58.0	64.8	69.5	75.0	79.5	81.4	83.7	83.4	71.8	60.7	58.5	Sonoma
1999	56.1	56.6	58.7	68.3	72.7	80.6	80.9	82.9	81.3	79.1	64.2	61.5	Sonoma
2002	63.0	66.2	63.6	63.7	66.1	68.7	71.8	73.4	75.5	68.4	71.4	63.7	San-Diego
2000	65.8	65.3	63.7	67.7	69.1	72.4	73.5	76.3	75.5	69.5	66.0	66.9	San-Diego
1999	65.2	64.9	63.6	64.6	64.4	66.9	72.7	72.7	71.4	77.7	67.4	67.6	San-Diego
2003	71.4	64.6	66.5	65.2	66.9	67.4	74.3	77.5	74.0	73.1	67.4	64.6	San-Diego
2001	62.2	61.7	63.7	63.6	67.5	72.0	73.0	73.5	73.2	71.1	66.9	63.3	San-Diego

Plot Your Temperature Data

With the new column that specifies where the data was collected, you can use the pandas groupby function to plot a chart for each location! The groupby function will create a new plot for each unique entry in a column. Since there are two locations in the location column, you can produce two graphs, one for each location.

Below, you can create a new column that is the mean of all of the monthly readings for the average temperature in each location. With that column, you can plot the average annual temperature of each location by year using the groupby function.

# Creating a dataframe that is just the columns for temperature from each month.
just_monthly_data_df = all_data.loc[:, "January":"December"]

# Get the average of each row in that dataframe and assigning the output to a new column in the original dataframe
all_data['annual_average_temp'] = just_monthly_data_df.mean(axis=1)
all_data

Year	January	February	March	April	May	June	July	August	September	October	November	December	location	annual_average_temp
2001	56.5	58.4	68.0	66.7	83.1	86.1	81.2	85.3	81.0	78.4	64.0	55.6	Sonoma	72.025000
2002	55.8	62.9	63.5	66.9	74.5	84.2	82.8	82.1	84.5	75.7	67.6	57.4	Sonoma	71.491667
2003	58.9	61.8	66.4	61.5	74.2	81.1	87.0	83.5	85.0	82.7	61.0	56.4	Sonoma	71.625000
2000	56.8	58.0	64.8	69.5	75.0	79.5	81.4	83.7	83.4	71.8	60.7	58.5	Sonoma	70.258333
1999	56.1	56.6	58.7	68.3	72.7	80.6	80.9	82.9	81.3	79.1	64.2	61.5	Sonoma	70.241667
2002	63.0	66.2	63.6	63.7	66.1	68.7	71.8	73.4	75.5	68.4	71.4	63.7	San-Diego	67.958333
2000	65.8	65.3	63.7	67.7	69.1	72.4	73.5	76.3	75.5	69.5	66.0	66.9	San-Diego	69.308333
1999	65.2	64.9	63.6	64.6	64.4	66.9	72.7	72.7	71.4	77.7	67.4	67.6	San-Diego	68.258333
2003	71.4	64.6	66.5	65.2	66.9	67.4	74.3	77.5	74.0	73.1	67.4	64.6	San-Diego	69.408333
2001	62.2	61.7	63.7	63.6	67.5	72.0	73.0	73.5	73.2	71.1	66.9	63.3	San-Diego	67.641667

# Sort the data by year for nicer plotting
all_data.sort_values("Year", inplace=True)

f, ax = plt.subplots()
for title, group in all_data.groupby('location'):
    group.groupby('location').plot(x='Year', 
                                   y='annual_average_temp',
                                   title="Annual Average Temperature ºF",
                                   label=title,
                                   style='.--',
                                   ax=ax, 
                                   alpha=.8)

    
ax.set(xlabel="Year", ylabel="Temperature (F)", xticks=[1999, 2000, 2001, 2002, 2003])

plt.show()

Plot showing average annual temperature for San Diego and Sonoma from 1999 to 2003.

Challenge 1: Plot Temperature by Maximum Value

Above, you summarized the year by the average temperature for each year, plotted the two locations to compare their average annual temperatures. In this challenge, plot the maximum annual temperature for each location annually. Your graph should look similar to the one above, but it should display the maximum annual temperature instead of the average. You can use a similar method to getting the average to get the maximum temperature. Using max instead of mean of the just_monthly_data_df should return a series of data that’s the maximum values, which you can then assign to a column in the main dataframe and plot like above.

# Create data and plot answer here

Challenge 1 Example Plot showing maximum annual temperature for San Diego and Sonoma from 1999 to 2003.

List Comprehensions Automate Data Tasks With Loops

Share on

Twitter Facebook Google+ LinkedIn

Earth Data Analytics Online Certificate

Intro to earth data science textbook

intro-to-earth-data-science-textbook Home

Lesson 3. Create Data Workflows with Loops

Learning Objectives

Build a Data Workflow

Explore Your Data Using Loops

Use `glob` Inside a `for` Loop

Nesting `for` Loops

Combining Pandas Datasets in a List

Extracting Data from Pathnames

Plot Your Temperature Data

Challenge 1: Plot Temperature by Maximum Value

Share on

Leave a Comment

You May Also Enjoy

Plot Data With Matplotlib

Calculate Seasonal Summary Values from Climate Data Variables Stored in NetCDF 4 Format: Work With MACA v2 Climate Data in Python

Calculate Summary Values Using Spatial Areas of Interest (AOIs) including Shapefiles for Climate Data Variables Stored in NetCDF 4 Format: Work With MACA v2 Climate Data in Python

How to Open and Process NetCDF 4 Data Format in Open Source Python

Earth Data Analytics Online Certificate

Intro to earth data science textbook

intro-to-earth-data-science-textbook Home

Learning Objectives

Build a Data Workflow

Explore Your Data Using Loops

Use glob Inside a for Loop

Nesting for Loops

Combining Pandas Datasets in a List

Extracting Data from Pathnames

Plot Your Temperature Data

Challenge 1: Plot Temperature by Maximum Value

Share on

Leave a Comment

You May Also Enjoy

Plot Data With Matplotlib

Calculate Seasonal Summary Values from Climate Data Variables Stored in NetCDF 4 Format: Work With MACA v2 Climate Data in Python

Calculate Summary Values Using Spatial Areas of Interest (AOIs) including Shapefiles for Climate Data Variables Stored in NetCDF 4 Format: Work With MACA v2 Climate Data in Python

How to Open and Process NetCDF 4 Data Format in Open Source Python

Use `glob` Inside a `for` Loop

Nesting `for` Loops