Land cover classification at the Mississippi Delta

In this notebook, you will use a k-means unsupervised clustering algorithm to group pixels by similar spectral signatures. k-means is an exploratory method for finding patterns in data. Because it is unsupervised, you don’t need any training data for the model. You also can’t measure how well it “performs” because the clusters will not correspond to any particular land cover class. However, we expect at least some of the clusters to be identifiable as different types of land cover.

You will use the harmonized Sentinal/Landsat multispectral dataset. You can access the data with an Earthdata account and the earthaccess library from NSIDC:

STEP 1: SET UP

TaskTry It
  1. Import all libraries you will need for this analysis
  2. Configure GDAL parameters to help avoid connection errors: python os.environ["GDAL_HTTP_MAX_RETRY"] = "5" os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

Below you can find code for a caching decorator which you can use in your code. To use the decorator:

@cached(key, override)
def do_something(*args, **kwargs):
    ...
    return item_to_cache

This decorator will pickle the results of running the do_something() function, and only run the code if the results don’t already exist. To override the caching, for example temporarily after making changes to your code, set override=True. Note that to use the caching decorator, you must write your own function to perform each task!

def cached(func_key, override=False):
    """
    A decorator to cache function results
    
    Parameters
    ==========
    key: str
      File basename used to save pickled results
    override: bool
      When True, re-compute even if the results are already stored
    """
    def compute_and_cache_decorator(compute_function):
        """
        Wrap the caching function
        
        Parameters
        ==========
        compute_function: function
          The function to run and cache results
        """
        def compute_and_cache(*args, **kwargs):
            """
            Perform a computation and cache, or load cached result.
            
            Parameters
            ==========
            args
              Positional arguments for the compute function
            kwargs
              Keyword arguments for the compute function
            """
            # Add an identifier from the particular function call
            if 'cache_key' in kwargs:
                key = '_'.join((func_key, kwargs['cache_key']))
            else:
                key = func_key

            if 'project' in kwargs:
                project = kwargs['project'] 
            else:
                project = et.Project()
            path = project.project_dir / 'jars' / f'{key}.pickle'
            
            # Check if the cache exists already or override caching
            if not os.path.exists(path) or override:
                # Make jars directory if needed
                os.makedirs(os.path.dirname(path), exist_ok=True)
                
                # Run the compute function as the user did
                result = compute_function(*args, **kwargs)
                
                # Pickle the object
                with open(path, 'wb') as file:
                    pickle.dump(result, file)
            else:
                # Unpickle the object
                with open(path, 'rb') as file:
                    result = pickle.load(file)
                    
            return result
        
        return compute_and_cache
    
    return compute_and_cache_decorator

STEP 2: STUDY SITE

For this analysis, you will use a watershed from the Water Boundary Dataset, HU12 watersheds (WBDHU12.shp).

TaskTry It
  1. Download the Water Boundary Dataset for region 8 (Mississippi)
  2. Select watershed 080902030506
  3. Generate a site map of the watershed

Try to use the caching decorator

Downloading from https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrography/WBD/HU2/Shape/WBD_08_HU2_Shape.zip
Extracted output to /home/runner/earth-analytics/data/WBD_08_HU2_Shape

We chose this watershed because it covers parts of New Orleans an is near the Mississippi Delta. Deltas are boundary areas between the land and the ocean, and as a result tend to contain a rich variety of different land cover and land use types.

Write a 2-3 sentence site description (with citations) of this area that helps to put your analysis in context.

YOUR SITE DESCRIPTION HERE

STEP 3: MULTISPECTRAL DATA

Search for data

TaskTry It
  1. Log in to the earthaccess service using your Earthdata credentials: python earthaccess.login(persist=True)
  2. Modify the following sample code to search for granules of the HLSL30 product overlapping the watershed boundary from May to October 2023 (there should be 76 granules): python results = earthaccess.search_data( short_name="...", cloud_hosted=True, bounding_box=tuple(gdf.total_bounds), temporal=("...", "..."), )
# Log in to earthaccess

# Search for HLS tiles

Compile information about each granule

I recommend building a GeoDataFrame, as this will allow you to plot the granules you are downloading and make sure they line up with your shapefile. You could also use a DataFrame, dictionary, or a custom object to store this information.

TaskTry It
  1. For each search result:
    1. Get the following information (HINT: look at the [‘umm’] values for each search result):
      • granule id (UR)
      • datetime
      • geometry (HINT: check out the shapely.geometry.Polygon class to convert points to a Polygon)
    2. Open the granule files. I recomment opening one granule at a time, e.g. with (earthaccess.open([result]).
    3. For each file (band), get the following information:
      • file handler returned from earthaccess.open()
      • tile id
      • band number
  2. Compile all the information you collected into a GeoDataFrame
# Loop through each granule

    # Get granule information

    # Get URL

    # Build metadata DataFrame rows

# Concatenate metadata DataFrame

Open, crop, and mask data

This will be the most resource-intensive step. I recommend caching your results using the cached decorator or by writing your own caching code. I also recommend testing this step with one or two dates before running the full computation.

This code should include at least one function including a numpy-style docstring. A good place to start would be a function for opening a single masked raster, applying the appropriate scale parameter, and cropping.

TaskTry It
  1. For each granule:
    1. Open the Fmask band, crop, and compute a quality mask for the granule. You can use the following code as a starting point, making sure that mask_bits contains the quality bits you want to consider: ```python # Expand into a new dimension of binary bits bits = ( np.unpackbits(da.astype(np.uint8), bitorder=‘little’) .reshape(da.shape + (-1,)) )

      # Select the required bits and check if any are flagged mask = np.prod(bits[…, mask_bits]==0, axis=-1) ```

    2. For each band that starts with ‘B’:

      1. Open the band, crop, and apply the scale factor
      2. Name the DataArray after the band using the .name attribute
      3. Apply the cloud mask using the .where() method
      4. Store the DataArray in your data structure (e.g. adding a GeoDataFrame column with the DataArray in it. Note that you will need to remove the rows for unused bands)
# Loop through each image

    # Open granule cloud cover

    # Compute cloud mask

    # Loop through each spectral band

        # Open, crop, and mask the band

        # Add the DataArray to the metadata DataFrame row

    # Reassemble the metadata DataFrame
Processing granule T15RYN 2024-06-07 16:31:11.509000+00:00
Processing granule T15RYP 2024-06-07 16:31:11.509000+00:00
Processing granule T16RBT 2024-06-07 16:31:11.509000+00:00
Processing granule T16RBU 2024-06-07 16:31:11.509000+00:00
Processing granule T15RYN 2024-06-15 16:31:19.154000+00:00
Processing granule T15RYP 2024-06-15 16:31:19.154000+00:00
Processing granule T16RBT 2024-06-15 16:31:19.154000+00:00
Processing granule T16RBU 2024-06-15 16:31:19.154000+00:00
Processing granule T15RYN 2024-06-23 16:31:21.277000+00:00
Processing granule T15RYP 2024-06-23 16:31:21.277000+00:00
Processing granule T16RBT 2024-06-23 16:31:21.277000+00:00
Processing granule T16RBU 2024-06-23 16:31:21.277000+00:00
Processing granule T15RYN 2024-07-01 16:31:17.338000+00:00
Processing granule T15RYP 2024-07-01 16:31:17.338000+00:00
Processing granule T16RBT 2024-07-01 16:31:17.338000+00:00
Processing granule T16RBU 2024-07-01 16:31:17.338000+00:00
Processing granule T15RYN 2024-07-09 16:31:29.187000+00:00
Processing granule T15RYP 2024-07-09 16:31:29.187000+00:00
Processing granule T16RBT 2024-07-09 16:31:29.187000+00:00
Processing granule T16RBU 2024-07-09 16:31:29.187000+00:00
Processing granule T15RYN 2024-07-17 16:31:33.271000+00:00
Processing granule T15RYP 2024-07-17 16:31:33.271000+00:00
Processing granule T16RBT 2024-07-17 16:31:33.271000+00:00
Processing granule T16RBU 2024-07-17 16:31:33.271000+00:00
Processing granule T15RYN 2024-07-25 16:31:43.265000+00:00
Processing granule T15RYP 2024-07-25 16:31:43.265000+00:00
Processing granule T16RBT 2024-07-25 16:31:43.265000+00:00
Processing granule T16RBU 2024-07-25 16:31:43.265000+00:00
Processing granule T15RYN 2024-08-02 16:31:36.413000+00:00
Processing granule T15RYP 2024-08-02 16:31:36.413000+00:00
Processing granule T16RBT 2024-08-02 16:31:36.413000+00:00
Processing granule T16RBU 2024-08-02 16:31:36.413000+00:00
Processing granule T15RYN 2024-08-10 16:31:42.335000+00:00
Processing granule T15RYP 2024-08-10 16:31:42.335000+00:00
Processing granule T16RBT 2024-08-10 16:31:42.335000+00:00
Processing granule T16RBU 2024-08-10 16:31:42.335000+00:00
Processing granule T15RYN 2024-08-18 16:31:46.256000+00:00
Processing granule T15RYP 2024-08-18 16:31:46.256000+00:00
Processing granule T16RBT 2024-08-18 16:31:46.256000+00:00
Processing granule T16RBU 2024-08-18 16:31:46.256000+00:00
Processing granule T15RYN 2024-08-26 16:31:51.172000+00:00
Processing granule T15RYP 2024-08-26 16:31:51.172000+00:00
Processing granule T16RBT 2024-08-26 16:31:51.172000+00:00
Processing granule T16RBU 2024-08-26 16:31:51.172000+00:00

Merge and Composite Data

You will notice for this watershed that: 1. The raster data for each date are spread across 4 granules 2. Any given image is incomplete because of clouds

TaskTry It
  1. For each band:

    1. For each date:

      1. Merge all 4 granules
      2. Mask any negative values created by interpolating from the nodata value of -9999 (rioxarray should account for this, but doesn’t appear to when merging. If you leave these values in they will create problems down the line)
    2. Concatenate the merged DataArrays along a new date dimension

    3. Take the mean in the date dimension to create a composite image that fills cloud gaps

    4. Add the band as a dimension, and give the DataArray a name

  2. Concatenate along the band dimension

# Merge and composite and image for each band

        # Merge granules for each date

        # Mask negative values

    # Composite images across dates
<xarray.DataArray 'reflectance' (band: 10, y: 556, x: 624)> Size: 14MB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
...
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]],
      shape=(10, 556, 624), dtype=float32)
Coordinates:
  * band         (band) int64 80B 1 2 3 4 5 6 7 9 10 11
  * y            (y) float64 4kB 3.304e+06 3.304e+06 ... 3.287e+06 3.287e+06
  * x            (x) float64 5kB 7.926e+05 7.926e+05 ... 8.112e+05 8.113e+05
    spatial_ref  int64 8B 0
Attributes: (12/35)
    ACCODE:                    Lasrc; Lasrc
    add_offset:                0.0
    arop_ave_xshift(meters):   0, 0
    arop_ave_yshift(meters):   0, 0
    arop_ncp:                  0, 0
    arop_rmse(meters):         0, 0
    ...                        ...
    TIRS_SSM_MODEL:            PRELIMINARY; PRELIMINARY
    TIRS_SSM_POSITION_STATUS:  ESTIMATED; ESTIMATED
    ULX:                       699960
    ULY:                       3300000
    USGS_SOFTWARE:             LPGS_16.4.0
    AREA_OR_POINT:             Area

STEP 4: K-MEANS

Cluster your data by spectral signature using the k-means algorithm.

TaskTry It
  1. Convert your DataArray into a tidy DataFrame of reflectance values (hint: check out the .to_dataframe() and .unstack() methods)
  2. Filter out all rows with no data (all 0s or any N/A values)
  3. Fit a k-means model. You can experiment with the number of groups to find what works best.
# Convert spectral DataArray to a tidy DataFrame

# Running the fit and predict functions at the same time.
# We can do this since we don't have target data.

# Add the predicted values back to the model DataFrame
band 1 2 3 4 5 6 7 9 clusters
y x
3.303783e+06 810148.062907 0.01560 0.0225 0.0409 0.0366 0.0478 0.0281 0.0181 0.0006 5
810178.062907 0.01895 0.0256 0.0396 0.0413 0.0426 0.0284 0.0220 0.0006 5
810208.062907 0.01915 0.0246 0.0387 0.0377 0.0384 0.0273 0.0236 0.0007 5
810238.062907 0.02040 0.0247 0.0440 0.0445 0.0629 0.0418 0.0266 0.0007 0
810268.062907 0.01815 0.0245 0.0437 0.0444 0.0618 0.0397 0.0259 0.0008 0
... ... ... ... ... ... ... ... ... ... ...
3.287163e+06 793768.062907 0.02650 0.0345 0.0548 0.0427 0.0218 0.0098 0.0074 0.0007 5
793798.062907 0.02790 0.0351 0.0549 0.0439 0.0221 0.0104 0.0076 0.0008 5
793828.062907 0.02580 0.0331 0.0534 0.0419 0.0194 0.0080 0.0059 0.0009 5
793858.062907 0.02570 0.0326 0.0521 0.0402 0.0182 0.0064 0.0046 0.0007 5
793888.062907 0.02550 0.0340 0.0541 0.0423 0.0199 0.0083 0.0060 0.0007 5

317917 rows × 9 columns

STEP 5: PLOT

TaskTry It

Create a plot that shows the k-means clusters next to an RGB image of the area. You may need to brighten your RGB image by multiplying it by 10. The code for reshaping and plotting the clusters is provided for you below, but you will have to create the RGB plot yourself!

So, what is .sortby(['x', 'y']) doing for us? Try the code without it and find out.

# Plot the k-means clusters
(
    rgb_plot
    + 
    model_df.clusters.to_xarray().sortby(['x', 'y']).hvplot(
        cmap="Colorblind", aspect='equal') 
)
RespondReflect and Respond

Don’t forget to interpret your plot!

YOUR PLOT HEADLINE AND DESCRIPTION HERE