entropice/README.md

146 lines
6.3 KiB
Markdown
Raw Normal View History

2025-12-28 20:11:11 +01:00
# Entropice
2025-09-26 10:42:35 +02:00
2025-12-28 20:11:11 +01:00
**Geospatial Machine Learning for Arctic Permafrost Degradation.**
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**.
The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions.
2025-09-26 10:42:35 +02:00
2025-12-28 20:11:11 +01:00
## Scientific Background
2025-09-26 10:42:35 +02:00
2025-12-28 20:11:11 +01:00
### Retrogressive Thaw Slumps
Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are:
- **Climate indicators**: Sensitive to warming temperatures and changing precipitation
- **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways
- **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions
- **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release
Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change.
### The Challenge
Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset.
Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is:
- Labor-intensive and limited in spatial/temporal coverage
- Challenging due to cloud cover and seasonal visibility
- Insufficient for pan-Arctic prediction at decision-relevant scales
Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems:
- Huge data transfer needed between satellite imagery providers and HPC where the models are run
- Large energy consumtion in both data transfer and inference
- Uncertainty about the quality of the results
- Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist
### Our Approach
Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature.
Entropice addresses this by:
1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels)
2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales
3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies
This hopefully leads to the following advances in permafrost research:
- Better understanding of RTS occurance
- Potential proxy for Ice-Rich permafrost
- Reduction of compute waste of image segmentation pipelines
- Better modelling by providing better starting conditions
### Entropy-Optimal Scalable Probabilistic Approximations (eSPA)
eSPA is a probabilistic classification framework that:
- Provides calibrated probability estimates (not just point predictions)
- Handles imbalanced datasets common in geospatial phenomena
- Captures uncertainty in predictions across poorly-sampled regions
- Enables interpretable feature importance analysis
This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability.
## Key Features
- **Modular Data Pipeline**: Sequential processing stages from raw data to trained models
- **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10)
- **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation
- **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions
- **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools
- **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources
## Quick Start
### Installation
Requires Python 3.13 and CUDA 12 compatible GPU.
```bash
pixi install
```
This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries.
### Running the Pipeline
Execute the numbered scripts to process data and train models:
```bash
scripts/00grids.sh # Generate spatial grids
scripts/01darts.sh # Extract RTS labels from DARTS v2
scripts/02alphaearth.sh # Extract satellite embeddings
scripts/03era5.sh # Process climate data
scripts/04arcticdem.sh # Compute terrain features
scripts/05train.sh # Train models
2025-09-26 10:42:35 +02:00
```
2025-12-28 20:11:11 +01:00
### Visualizing Results
Launch the interactive dashboard:
```bash
pixi run dashboard
```
Explore training data distributions, cross-validation results, feature importance, and spatial predictions.
## Data Sources
- **DARTS v2**: RTS labels (polygons with year, area, count)
- **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years)
- **ArcticDEM**: 32m resolution terrain elevation
- **AlphaEarth**: 64-dimensional satellite image embeddings
## Project Structure
- `src/entropice/`: Core modules (grids, data processors, training, inference)
- `src/entropice/dashboard/`: Streamlit visualization application
- `scripts/`: Data processing pipeline automation
- `notebooks/`: Exploratory analysis (not version-controlled)
## Documentation
- **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow
- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards
## Research Goals
1. **Predictive Modeling**: Estimate RTS density at unobserved locations
2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence
3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions
4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making
## License
TODO
## Citation
If you use Entropice in your research, please cite:
```txt
TODO
```