Introduction to spatial regression in Python

Regression analysis allows you to model and predict some process based on its relationship to a specific dependent variable or variables. Often times, however, a standard regression model is insufficient for modeling data with a spatial dependency. This happens when the data is spatially autocorrelated. In these cases, you may want to instead use a spatial regression model.

Spatial regression analysis allows you to encorporate spatial dependencies into your model. This tutorial will outline how to use the PySAL Python package to use these spatial regression methods.

Objectives

  • Perform ordinary linear regression analysis
  • Evaluate spatial autocorrelation
  • Perform spatial regression analysis

Dependencies

  • PySAL
  • numpy
!pip install pysal
import pysal
import numpy as np
from pysal.spreg import ols
from pysal.spreg import ml_error
from pysal.spreg import ml_lag

The pysal package contains many sample data files that can be used to demonstrate the package’s abilities. For this example we will be analyzing Columbus home values with relation to income and crime by neighborhood. First we will run an ordinary least squares linear regression model to analyze the relationship between these variables.

This first section of code will read in the home values (dependent variable) into an array y and the income and crime values (independent variables) into a two dimmensional array X.

f = pysal.open(pysal.examples.get_path("columbus.dbf"),'r')
y = np.array(f.by_col['HOVAL'])
y.shape = (len(y),1)
X= []
X.append(f.by_col['INC'])
X.append(f.by_col['CRIME'])
X = np.array(X).T

Now that we have stored the values we are analyzing we can perform ordinary least squares (OLS) regression. This is done with the pysal.spreg module. Our instance of OLS, named ls, has many useful tools for reviewing the results of our test. In this case, we will use ls.summary to obtain of full summary of the results, but for more specific results you can look at some of the other options on the pysal.spreg page linked above.

ls = ols.OLS(y, X, name_y = 'home val', name_x = ['Income', 'Crime'], name_ds = 'Columbus')
print(ls.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :    Columbus
Weights matrix      :        None
Dependent Variable  :    home val                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           3
S.D. dependent var  :     18.4661                Degrees of Freedom    :          46
R-squared           :      0.3495
Adjusted R-squared  :      0.3212
Sum squared residual:   10647.015                F-statistic           :     12.3582
Sigma-square        :     231.457                Prob(F-statistic)     :   5.064e-05
S.E. of regression  :      15.214                Log likelihood        :    -201.368
Sigma-square ML     :     217.286                Akaike info criterion :     408.735
S.E of regression ML:     14.7406                Schwarz criterion     :     414.411

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      46.4281827      13.1917570       3.5194844       0.0009867
              Income       0.6289840       0.5359104       1.1736736       0.2465669
               Crime      -0.4848885       0.1826729      -2.6544086       0.0108745
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           12.538

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2          39.706           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                2           5.767           0.0559
Koenker-Bassett test              2           2.270           0.3214
================================ END OF REPORT =====================================

OLS regression assumes that each observation is independent from all others. However, in practice, locations that are near eachother are not likely to be independent.

We can evaluate spatial autocorrelation in the residuals with Moran’s I test. Our first step in that process is to create a spatial weights matrix. PySAL’s example data has a GAL file that we can read in directly to create this matrix.

w = pysal.open(pysal.examples.get_path("columbus.gal")).read()

We can pass this weights matrix to the pysal.Moran function, along with our model residuals (ls.u).

The observed value for I is much higher than the value we would expect if there was no spatial dependence. The p-value is the probability that we would observe the value of I that we did (or one greater) if there were no spatial dependence.

mi = pysal.Moran(ls.u, w, two_tailed=False)
print('Observed I:', mi.I, '\nExpected I:', mi.EI, '\n   p-value:', mi.p_norm)
Observed I: 0.171310158169 
Expected I: -0.020833333333333332 
   p-value: 0.0189304275212

We can use a spatial regression model to account for spatial non-independence. The spreg module has several different functions for creating a spatial regression model. In this example, we will use a spatial error model, but the implementation of a spatial lag model is similar.

spat_err = ml_error.ML_Error(y, X, w, 
                             name_y='home value', name_x=['income','crime'], 
                             name_w='columbus.gal', name_ds='columbus')
print(spat_err.summary)
REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL ERROR (METHOD = FULL)
-------------------------------------------------------------------
Data set            :    columbus
Weights matrix      :columbus.gal
Dependent Variable  :  home value                Number of Observations:          49
Mean dependent var  :     38.4362                Number of Variables   :           3
S.D. dependent var  :     18.4661                Degrees of Freedom    :          46
Pseudo R-squared    :      0.3495
Sigma-square ML     :     197.314                Log likelihood        :    -199.769
S.E of regression   :      14.047                Akaike info criterion :     405.537
                                                 Schwarz criterion     :     411.213

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      47.8090727      12.3906790       3.8584708       0.0001141
              income       0.7213778       0.5029284       1.4343549       0.1514710
               crime      -0.5576102       0.1788127      -3.1184037       0.0018183
              lambda       0.3578233       0.1705868       2.0976020       0.0359403
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================


/Users/majo3748/anaconda/lib/python3.5/site-packages/scipy/optimize/_minimize.py:596: RuntimeWarning: Method 'bounded' does not support relative tolerance in x; defaulting to absolute tolerance.
  "defaulting to absolute tolerance.", RuntimeWarning)

Updated:

Leave a Comment