# Entropice **Geospatial Machine Learning for Arctic Permafrost Degradation.** Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions. ## Scientific Background ### Retrogressive Thaw Slumps Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are: - **Climate indicators**: Sensitive to warming temperatures and changing precipitation - **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways - **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions - **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change. ### The Challenge Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset. Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is: - Labor-intensive and limited in spatial/temporal coverage - Challenging due to cloud cover and seasonal visibility - Insufficient for pan-Arctic prediction at decision-relevant scales Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems: - Huge data transfer needed between satellite imagery providers and HPC where the models are run - Large energy consumtion in both data transfer and inference - Uncertainty about the quality of the results - Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist ### Our Approach Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature. Entropice addresses this by: 1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels) 2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales 3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies This hopefully leads to the following advances in permafrost research: - Better understanding of RTS occurance - Potential proxy for Ice-Rich permafrost - Reduction of compute waste of image segmentation pipelines - Better modelling by providing better starting conditions ### Entropy-Optimal Scalable Probabilistic Approximations (eSPA) eSPA is a probabilistic classification framework that: - Provides calibrated probability estimates (not just point predictions) - Handles imbalanced datasets common in geospatial phenomena - Captures uncertainty in predictions across poorly-sampled regions - Enables interpretable feature importance analysis This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability. ## Key Features - **Modular Data Pipeline**: Sequential processing stages from raw data to trained models - **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10) - **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation - **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions - **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools - **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources ## Quick Start ### Installation Requires Python 3.13 and CUDA 12 compatible GPU. ```bash pixi install ``` This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries. ### Running the Pipeline Execute the numbered scripts to process data and train models: ```bash scripts/00grids.sh # Generate spatial grids scripts/01darts.sh # Extract RTS labels from DARTS v2 scripts/02alphaearth.sh # Extract satellite embeddings scripts/03era5.sh # Process climate data scripts/04arcticdem.sh # Compute terrain features scripts/05train.sh # Train models ``` ### Visualizing Results Launch the interactive dashboard: ```bash pixi run dashboard ``` Explore training data distributions, cross-validation results, feature importance, and spatial predictions. ## Data Sources - **DARTS v2**: RTS labels (polygons with year, area, count) - **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years) - **ArcticDEM**: 32m resolution terrain elevation - **AlphaEarth**: 64-dimensional satellite image embeddings ## Project Structure - `src/entropice/`: Core modules (grids, data processors, training, inference) - `src/entropice/dashboard/`: Streamlit visualization application - `scripts/`: Data processing pipeline automation - `notebooks/`: Exploratory analysis (not version-controlled) ## Documentation - **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow - **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards ## Research Goals 1. **Predictive Modeling**: Estimate RTS density at unobserved locations 2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence 3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions 4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making ## License TODO ## Citation If you use Entropice in your research, please cite: ```txt TODO ```