145 lines
6.3 KiB
Markdown
Executable file
145 lines
6.3 KiB
Markdown
Executable file
# Entropice
|
|
|
|
**Geospatial Machine Learning for Arctic Permafrost Degradation.**
|
|
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**.
|
|
The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions.
|
|
|
|
## Scientific Background
|
|
|
|
### Retrogressive Thaw Slumps
|
|
|
|
Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are:
|
|
|
|
- **Climate indicators**: Sensitive to warming temperatures and changing precipitation
|
|
- **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways
|
|
- **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions
|
|
- **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release
|
|
|
|
Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change.
|
|
|
|
### The Challenge
|
|
|
|
Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset.
|
|
|
|
Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is:
|
|
|
|
- Labor-intensive and limited in spatial/temporal coverage
|
|
- Challenging due to cloud cover and seasonal visibility
|
|
- Insufficient for pan-Arctic prediction at decision-relevant scales
|
|
|
|
Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems:
|
|
|
|
- Huge data transfer needed between satellite imagery providers and HPC where the models are run
|
|
- Large energy consumtion in both data transfer and inference
|
|
- Uncertainty about the quality of the results
|
|
- Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist
|
|
|
|
### Our Approach
|
|
|
|
Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature.
|
|
|
|
Entropice addresses this by:
|
|
|
|
1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels)
|
|
2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales
|
|
3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies
|
|
|
|
This hopefully leads to the following advances in permafrost research:
|
|
|
|
- Better understanding of RTS occurance
|
|
- Potential proxy for Ice-Rich permafrost
|
|
- Reduction of compute waste of image segmentation pipelines
|
|
- Better modelling by providing better starting conditions
|
|
|
|
### Entropy-Optimal Scalable Probabilistic Approximations (eSPA)
|
|
|
|
eSPA is a probabilistic classification framework that:
|
|
|
|
- Provides calibrated probability estimates (not just point predictions)
|
|
- Handles imbalanced datasets common in geospatial phenomena
|
|
- Captures uncertainty in predictions across poorly-sampled regions
|
|
- Enables interpretable feature importance analysis
|
|
|
|
This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability.
|
|
|
|
## Key Features
|
|
|
|
- **Modular Data Pipeline**: Sequential processing stages from raw data to trained models
|
|
- **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10)
|
|
- **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation
|
|
- **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions
|
|
- **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools
|
|
- **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
Requires Python 3.13 and CUDA 12 compatible GPU.
|
|
|
|
```bash
|
|
pixi install
|
|
```
|
|
|
|
This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries.
|
|
|
|
### Running the Pipeline
|
|
|
|
Execute the numbered scripts to process data and train models:
|
|
|
|
```bash
|
|
scripts/00grids.sh # Generate spatial grids
|
|
scripts/01darts.sh # Extract RTS labels from DARTS v2
|
|
scripts/02alphaearth.sh # Extract satellite embeddings
|
|
scripts/03era5.sh # Process climate data
|
|
scripts/04arcticdem.sh # Compute terrain features
|
|
scripts/05train.sh # Train models
|
|
```
|
|
|
|
### Visualizing Results
|
|
|
|
Launch the interactive dashboard:
|
|
|
|
```bash
|
|
pixi run dashboard
|
|
```
|
|
|
|
Explore training data distributions, cross-validation results, feature importance, and spatial predictions.
|
|
|
|
## Data Sources
|
|
|
|
- **DARTS v2**: RTS labels (polygons with year, area, count)
|
|
- **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years)
|
|
- **ArcticDEM**: 32m resolution terrain elevation
|
|
- **AlphaEarth**: 64-dimensional satellite image embeddings
|
|
|
|
## Project Structure
|
|
|
|
- `src/entropice/`: Core modules (grids, data processors, training, inference)
|
|
- `src/entropice/dashboard/`: Streamlit visualization application
|
|
- `scripts/`: Data processing pipeline automation
|
|
- `notebooks/`: Exploratory analysis (not version-controlled)
|
|
|
|
## Documentation
|
|
|
|
- **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow
|
|
- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards
|
|
|
|
## Research Goals
|
|
|
|
1. **Predictive Modeling**: Estimate RTS density at unobserved locations
|
|
2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence
|
|
3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions
|
|
4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making
|
|
|
|
## License
|
|
|
|
TODO
|
|
|
|
## Citation
|
|
|
|
If you use Entropice in your research, please cite:
|
|
|
|
```txt
|
|
TODO
|
|
```
|