Add some docs for copilot

This commit is contained in:
Tobias Hölzer 2025-12-28 20:11:11 +01:00
parent 1ee3d532fc
commit f8df10f687
9 changed files with 908 additions and 2027 deletions

172
README.md
View file

@ -1,45 +1,145 @@
# eSPA for RTS
# Entropice
Goal of this project is to utilize the entropy-optimal Scalable Probabilistic Approximations algorithm (eSPA) to create a model which can estimate the density of Retrogressive-Thaw-Slumps (RTS) across the globe with different levels of detail.
Hoping, that a successful training could gain new knowledge about RTX-proxies.
**Geospatial Machine Learning for Arctic Permafrost Degradation.**
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**.
The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions.
## Setup
## Scientific Background
```sh
uv sync
### Retrogressive Thaw Slumps
Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are:
- **Climate indicators**: Sensitive to warming temperatures and changing precipitation
- **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways
- **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions
- **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release
Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change.
### The Challenge
Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset.
Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is:
- Labor-intensive and limited in spatial/temporal coverage
- Challenging due to cloud cover and seasonal visibility
- Insufficient for pan-Arctic prediction at decision-relevant scales
Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems:
- Huge data transfer needed between satellite imagery providers and HPC where the models are run
- Large energy consumtion in both data transfer and inference
- Uncertainty about the quality of the results
- Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist
### Our Approach
Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature.
Entropice addresses this by:
1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels)
2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales
3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies
This hopefully leads to the following advances in permafrost research:
- Better understanding of RTS occurance
- Potential proxy for Ice-Rich permafrost
- Reduction of compute waste of image segmentation pipelines
- Better modelling by providing better starting conditions
### Entropy-Optimal Scalable Probabilistic Approximations (eSPA)
eSPA is a probabilistic classification framework that:
- Provides calibrated probability estimates (not just point predictions)
- Handles imbalanced datasets common in geospatial phenomena
- Captures uncertainty in predictions across poorly-sampled regions
- Enables interpretable feature importance analysis
This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability.
## Key Features
- **Modular Data Pipeline**: Sequential processing stages from raw data to trained models
- **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10)
- **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation
- **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions
- **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools
- **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources
## Quick Start
### Installation
Requires Python 3.13 and CUDA 12 compatible GPU.
```bash
pixi install
```
## Project Plan
This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries.
1. Create global hexagon grids with h3
2. Enrich the grids with data from various sources and with labels from DARTS v2
3. Use eSPA for simple classification: hex has [many slumps / some slumps / few slumps / no slumps]
4. use SPARTAn for regression: one for slumps density (area) and one for total number of slumps
### Running the Pipeline
### Data Sources and Engineering
Execute the numbered scripts to process data and train models:
- Labels
- `"year"`: Year of observation
- `"area"`: Total land-area of the hexagon
- `"rts_density"`: Area of RTS divided by total land-area
- `"rts_count"`: Number of single RTS instances
- ERA5 (starting 40 years from `"year"`)
- `"temp_yearXXXX_qY"`: Y-th quantile temperature of year XXXX. Used to enter the temperature distribution into the model.
- `"thawing_days_yearXXXX"`: Number of thawing-days of year XXXX.
- `"precip_yearXXXX_qY"`: Y-th quantile precipitation of year XXXX. Similar to temperature.
- `"temp_5year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 5 years difference.
- `"temp_10year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
- `"temp_diff_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
- ArcticDEM
- `"dissection_index"`: Dissection Index, (max - min) / max
- `"max_elevation"`: Maximum elevation
- `"elevationX_density"`: Area where the elevation is larger than X divided by the total land-area
- TCVIS
- ???
- Wildfire???
- Permafrost???
- GroundIceContent???
- Biome
```bash
scripts/00grids.sh # Generate spatial grids
scripts/01darts.sh # Extract RTS labels from DARTS v2
scripts/02alphaearth.sh # Extract satellite embeddings
scripts/03era5.sh # Process climate data
scripts/04arcticdem.sh # Compute terrain features
scripts/05train.sh # Train models
```
**About temporals** Every label has its own year - all temporal dependent data features, e.g. `"temp_5year_diff_XXXXtoXXXX_qY"` are calculated respective to that year.
The number of years added from a dataset is always the same, e.g. for ERA5 for an observation in 2024 the ERA5 data would start in 1984 and for an observation from 2023 in 1983.
### Visualizing Results
Launch the interactive dashboard:
```bash
pixi run dashboard
```
Explore training data distributions, cross-validation results, feature importance, and spatial predictions.
## Data Sources
- **DARTS v2**: RTS labels (polygons with year, area, count)
- **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years)
- **ArcticDEM**: 32m resolution terrain elevation
- **AlphaEarth**: 64-dimensional satellite image embeddings
## Project Structure
- `src/entropice/`: Core modules (grids, data processors, training, inference)
- `src/entropice/dashboard/`: Streamlit visualization application
- `scripts/`: Data processing pipeline automation
- `notebooks/`: Exploratory analysis (not version-controlled)
## Documentation
- **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow
- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards
## Research Goals
1. **Predictive Modeling**: Estimate RTS density at unobserved locations
2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence
3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions
4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making
## License
TODO
## Citation
If you use Entropice in your research, please cite:
```txt
TODO
```