Geographical-XGBoost documentation

An implementation of XGBoost for Geographical Anlaysis.

Geoxgboost is a Python library that implements the Geographical-XGBoost (G-XGBoost) algorithm for spatially local regression. G-XGBoost belongs to the family of Spatial Machine Learning algorithms and modifies the standard XGBoost algorithm (extreme gradient boosting trees) to handle spatial data and spatial heterogeneity.

G-XGBoost:

  • Applies the concept of geographically varying models in XGBoost: This means it creates local models that analyze data within a specified neighborhood using spatial weights.

  • Creates an ensemble of the global and local models: It utilizes both global and local models for training, validation, and prediction, leading to improved model accuracy.

  • Calculates local feature importance using spatial weights through the gain function.

Beyond being a predictive tool, G-XGBoost is also a valuable exploratory tool for identifying spatial heterogeneity. It evaluates how spatially weighted feature importance varies across different locations, enhancing the model’s interpretability. The theoretical presentation, mathematical formulation, and experimental results of G-XGBoost, across six regression models and six benchmark datasets, can be found in this paper: [Insert Paper Citation Here].

_images/documentation.png

Figure: G-XGBoost ensemble for spatially local regression. A different sub-model is built for every spatial unit (i), including only its neighboring units. The optimal bandwidth value (either distance or number of nearest neighbors) is defined by minimizing the cross validation criterion. Hyperparameters are selected using grid search through nested cross validation of the global model. G-XGBoost results from the ensemble (y_ens) of global (y_gl) and local (y_loc) models using the alpha weight (α) regularization hyperparameter. Feature importance is produced for the local models.

Installation

pip install geoxgboost

Tutorial

A comprehensive tutorial is available on GitHub, guiding users through the entire process, from project setup in PyCharm to running the demo. No prior Python knowledge is required.

The tutorial provides step-by-step instructions on how to:

  • Download and install PyCharm to create the Demo project.

  • Download the necessary data and install the geoxgboost library.

  • Run the geoxgboost algorithm.

  • Extract outputs in Excel format.

  • Understand the content of each output file.

Like any machine learning algorithm, G-XGBoost requires hyperparameter tuning. Guidelines for tuning are provided in the accompanying paper. The demo utilizes predefined hyperparameter values for convenience. However, users are encouraged to experiment with different hyperparameter combinations to optimize model performance. Geoxgboost facilitates hyperparameter tuning through a built-in function called create_param_grid, enabling efficient grid search. The functions, parameters, and examples of the geoxgboost package are available in https://geoxgboost.readthedocs.io/en/latest/

Demo data

Boston housing dataset

Download data from: https://github.com/geogreko/DemoGXGBoost/tree/main

The following files are included in the GitHub repository:

  1. Coords.csv: Coordinates of the spatial units.

  2. Data.csv: Dependent and independent variables.

  3. DataDescription.xlsx: Data description.

  4. GXGB_call_demo.py: Python script to analyze the Boston housing dataset.

  5. PredictCoords.csv: Coordinates of the spatial units for prediction.

  6. PredictData.csv: Values of the independent variables for the spatial units where predictions will be made.

  7. Tutorial_geoxgboost.pdf: A guide for using the demo.

How to cite

Grekousis G, (2025). Geographical-XGBoost: A new ensemble model for spatially local regression based on gradient-boosted trees. Journal of Geographical Systems. https://doi.org/10.1007/s10109-025-00465-4

Geoxgboost is freely available provided the above paper is cited.

geoxgboost package

geoxgboost module

This module implements Geographical-XGBoost for spatially local regression.

The module contains the following functions:

  • create_param_grid - Returns the grid of up to three hyperparameters.

  • nestedCV - Returns the optimized hyperparameters’ values and generalization error of XGBoost.

  • global_xgb - Returns the global XGBoost model.

  • optimize_bw - Returns the optimized bandwidth value.

  • gxgb - Returns geographical XGBoost, local prediction and related statistics.

  • predict_gxgb - Returns prediction in unseen data.

geoxgboost.geoxgboost.create_param_grid(Param1, Param1_Values, Param2=None, Param2_Values=None, Param3=None, Param3_Values=None)[source]

Creates a grid of up to three hyperparameters for tuning.

Examples

>>> Param1='n_estimators'
>>> Param1_Values = [100, 200, 300,500]
>>> Param2='learning_rate'
>>> Param2_Values = [0.1, 0.05,0.01]
>>> Param3='max_depth'
>>> Param3_Values = [2,3,4,6]
>>> create_param_grid(Param1,Param1_Values,Param2,Param2_Values,Param3,Param3_Values)
Parameters:
  • Param1 – 1st hyperparameter name e.g., ‘n_estimators’

  • Param1_Values – values for search e.g., [100, 200, 500]

  • Param2 – 2nd hyperparameter name e.g., ‘learning_rate’. Default=None.

  • Param2_Values – values for search e.g., [0.1, 0.05,0.01]. Default=None.

  • Param3 – 3rd hyperparameter name e.g., ‘max_depth’. Default=None.

  • Param3_Values – values for search e.g., [2,3,4,6]. Default=None.

Returns:

param_grid. Can be used in nestedCV function to fine tune hyperparameters

Change the argument of Param1, Param2, or Param3 with other hyperparamters available to tune XGBoost, such as: subsample, colsample_bytree, lambda, alpha etc.

For example:

Param1= ‘subsmample’

Param1_Values = [0.5, 0.7, 0.9]

A complete list of hyperparameters can be found here: https://xgboost.readthedocs.io/en/stable/parameter.html

Tip: This function can be iteratively repeated with different sets of hyperparameters. See an example in GXGB_call_demo.py at the DemoGXGBoost in GitHub.

geoxgboost.geoxgboost.global_xgb(X, y, params, feat_importance='gain', test_size=0.33, seed=7, path_save=False)[source]

Calculates global XGBoost

Parameters:
  • X – dataframe with the independent variables values

  • y – dataframe with the dependent variable values

  • params – hyperparameter values. Type:dictionary (NestedCV can be used to produce params)

  • feat_importance – type of feature importance: ‘gain’,weight’,cover’,‘total gain’,‘total cover’.Default=’gain’

  • test_size – size test (%). Default=0.33.

  • seed – seed value.Default=7

  • path_save – output folder. Default=False.

Returns:

global xgboost performance

geoxgboost.geoxgboost.gxgb(X, y, Coords, params, bw, Kernel='Adaptive', spatial_weights=False, feat_importance='gain', alpha_wt_type='varying', alpha_wt=1, test_size=0.3, seed=7, n_splits=5, path_save=False)[source]

Implements GeoXGBoost

Parameters:
  • X – dataframe with the independent variables values

  • y – dataframe with the dependent variable values

  • Coords – dataframe with the coordinates of spatial units

  • params – hyperparameter values

  • bw – bandwidth value

  • Kernel – ‘Adaptive’ or ‘Fixed’ kernel type to be used. Default= ‘Adaptive’.

  • spatial_weights – spatial weights matrix. Default= True.

  • feat_importance – type of feature importance. Available methods: ‘gain’,weight’,cover’,‘total gain’,‘total cover’.Default=’gain’

  • alpha_wt_type – type of alpha_wt. Available methods: ‘varying’, fixed’. Default=’varying’

  • alpha_wt – aplha weight value. It takes values between 0 and 1. Default=1.

  • test_size – size test (%). Default=0.33.

  • seed – seed value. Default=7.

  • n_splits – k-fold grid CV number of split, Default=5.

  • path_save – output folder. Default=False.

Returns:

local prediction and related statistics

geoxgboost.geoxgboost.nestedCV(X, y, param_grid, Param1, Param2=None, Param3=None, params=None, path_save=False, n_OuterSplits=5, n_InnerSplits=3)[source]

Applies nested cross validation for tuning up to three hyperparameters and calculating model generalization error.

Parameters:
  • X – dataframe with the independent variables values

  • y – dataframe with the dependent variable values

  • params – initial hyperparameter values. Type:dictionary

  • param_grid – grid values - output of param_grid function

  • Param1 – name of 1st hyperparameter used (same as param_grid function)

  • Param2 – name of 2nd hyperparameter used in param_grid function. Default=None.

  • Param3 – name of 3rd hyperparameter used in param_grid function. Default=None.

  • path_save – output folder. Default=False.

  • n_OuterSplits – number of outer splits. Default=5.

  • n_InnerSplits – number if inner splits Default=3.

Returns:

optimized hyperparameters’ values and generalization error of model through nestedCV

geoxgboost.geoxgboost.optimize_bw(X, y, Coords, params, bw_min, bw_max, step=1, Kernel='Adaptive', spatial_weights=True, n_splits=3, path_save=False)[source]

Finds optimal bandwidth value for defining spatial kernels

Examples

>>> optimize_bw(X,y, Coords, params, bw_min=30, bw_max=100,step=10)
Parameters:
  • X – dataframe with the independent variables values

  • y – dataframe with the dependent variable values

  • Coords – dataframe with the coordinates of spatial units

  • params – hyperparameter values

  • bw_min – min bandwidth value

  • bw_max – max bandwidth value

  • step – incremental step. Default=1.

  • Kernel – ‘Adaptive’ or ‘Fixed’ kernel type to be used. Default= ‘Adaptive’.

  • spatial_weights – spatial weights matrix. Default= True.

  • n_splits – k-fold grid CV number of split, Default=3.

  • path_save – output folder. Default=False.

Returns:

optimal bandwidth value

geoxgboost.geoxgboost.predict_gxgb(DataPredict, CoordsPredict, Coords, Output_GXGB_LocalModel, alpha_wt=0.5, alpha_wt_type='varying', path_save=False)[source]

Prediction in unseen data

Parameters:
  • DataPredict – dataframe containing the values of the independent variables referring to the spatial units in which the prediction will take place.

  • CoordsPredict – dataframe containing the coordinates of the spatial units in which the prediction will take place.

  • Coords – dataframe of coordinates of all spatial units that the original GXGB model was trained

  • Output_GXGB_LocalModel – the trained model that has been created through gxgb function

  • alpha_wt – the value of alpha weight. It ranges from 0 to 1. Default=0.5

  • alpha_wt_type – type of alpha_wt. Available methods: ‘varying’, fixed’. Default=’varying’

  • path_save – output folder. Default=False.

Returns:

prediction in unseen data.

Module contents