Add some docs for copilot
This commit is contained in:
parent
1ee3d532fc
commit
f8df10f687
9 changed files with 908 additions and 2027 deletions
172
README.md
172
README.md
|
|
@ -1,45 +1,145 @@
|
|||
# eSPA for RTS
|
||||
# Entropice
|
||||
|
||||
Goal of this project is to utilize the entropy-optimal Scalable Probabilistic Approximations algorithm (eSPA) to create a model which can estimate the density of Retrogressive-Thaw-Slumps (RTS) across the globe with different levels of detail.
|
||||
Hoping, that a successful training could gain new knowledge about RTX-proxies.
|
||||
**Geospatial Machine Learning for Arctic Permafrost Degradation.**
|
||||
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**.
|
||||
The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions.
|
||||
|
||||
## Setup
|
||||
## Scientific Background
|
||||
|
||||
```sh
|
||||
uv sync
|
||||
### Retrogressive Thaw Slumps
|
||||
|
||||
Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are:
|
||||
|
||||
- **Climate indicators**: Sensitive to warming temperatures and changing precipitation
|
||||
- **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways
|
||||
- **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions
|
||||
- **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release
|
||||
|
||||
Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change.
|
||||
|
||||
### The Challenge
|
||||
|
||||
Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset.
|
||||
|
||||
Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is:
|
||||
|
||||
- Labor-intensive and limited in spatial/temporal coverage
|
||||
- Challenging due to cloud cover and seasonal visibility
|
||||
- Insufficient for pan-Arctic prediction at decision-relevant scales
|
||||
|
||||
Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems:
|
||||
|
||||
- Huge data transfer needed between satellite imagery providers and HPC where the models are run
|
||||
- Large energy consumtion in both data transfer and inference
|
||||
- Uncertainty about the quality of the results
|
||||
- Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist
|
||||
|
||||
### Our Approach
|
||||
|
||||
Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature.
|
||||
|
||||
Entropice addresses this by:
|
||||
|
||||
1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels)
|
||||
2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales
|
||||
3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies
|
||||
|
||||
This hopefully leads to the following advances in permafrost research:
|
||||
|
||||
- Better understanding of RTS occurance
|
||||
- Potential proxy for Ice-Rich permafrost
|
||||
- Reduction of compute waste of image segmentation pipelines
|
||||
- Better modelling by providing better starting conditions
|
||||
|
||||
### Entropy-Optimal Scalable Probabilistic Approximations (eSPA)
|
||||
|
||||
eSPA is a probabilistic classification framework that:
|
||||
|
||||
- Provides calibrated probability estimates (not just point predictions)
|
||||
- Handles imbalanced datasets common in geospatial phenomena
|
||||
- Captures uncertainty in predictions across poorly-sampled regions
|
||||
- Enables interpretable feature importance analysis
|
||||
|
||||
This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Modular Data Pipeline**: Sequential processing stages from raw data to trained models
|
||||
- **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10)
|
||||
- **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation
|
||||
- **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions
|
||||
- **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools
|
||||
- **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
Requires Python 3.13 and CUDA 12 compatible GPU.
|
||||
|
||||
```bash
|
||||
pixi install
|
||||
```
|
||||
|
||||
## Project Plan
|
||||
This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries.
|
||||
|
||||
1. Create global hexagon grids with h3
|
||||
2. Enrich the grids with data from various sources and with labels from DARTS v2
|
||||
3. Use eSPA for simple classification: hex has [many slumps / some slumps / few slumps / no slumps]
|
||||
4. use SPARTAn for regression: one for slumps density (area) and one for total number of slumps
|
||||
### Running the Pipeline
|
||||
|
||||
### Data Sources and Engineering
|
||||
Execute the numbered scripts to process data and train models:
|
||||
|
||||
- Labels
|
||||
- `"year"`: Year of observation
|
||||
- `"area"`: Total land-area of the hexagon
|
||||
- `"rts_density"`: Area of RTS divided by total land-area
|
||||
- `"rts_count"`: Number of single RTS instances
|
||||
- ERA5 (starting 40 years from `"year"`)
|
||||
- `"temp_yearXXXX_qY"`: Y-th quantile temperature of year XXXX. Used to enter the temperature distribution into the model.
|
||||
- `"thawing_days_yearXXXX"`: Number of thawing-days of year XXXX.
|
||||
- `"precip_yearXXXX_qY"`: Y-th quantile precipitation of year XXXX. Similar to temperature.
|
||||
- `"temp_5year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 5 years difference.
|
||||
- `"temp_10year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
|
||||
- `"temp_diff_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
|
||||
- ArcticDEM
|
||||
- `"dissection_index"`: Dissection Index, (max - min) / max
|
||||
- `"max_elevation"`: Maximum elevation
|
||||
- `"elevationX_density"`: Area where the elevation is larger than X divided by the total land-area
|
||||
- TCVIS
|
||||
- ???
|
||||
- Wildfire???
|
||||
- Permafrost???
|
||||
- GroundIceContent???
|
||||
- Biome
|
||||
```bash
|
||||
scripts/00grids.sh # Generate spatial grids
|
||||
scripts/01darts.sh # Extract RTS labels from DARTS v2
|
||||
scripts/02alphaearth.sh # Extract satellite embeddings
|
||||
scripts/03era5.sh # Process climate data
|
||||
scripts/04arcticdem.sh # Compute terrain features
|
||||
scripts/05train.sh # Train models
|
||||
```
|
||||
|
||||
**About temporals** Every label has its own year - all temporal dependent data features, e.g. `"temp_5year_diff_XXXXtoXXXX_qY"` are calculated respective to that year.
|
||||
The number of years added from a dataset is always the same, e.g. for ERA5 for an observation in 2024 the ERA5 data would start in 1984 and for an observation from 2023 in 1983.
|
||||
### Visualizing Results
|
||||
|
||||
Launch the interactive dashboard:
|
||||
|
||||
```bash
|
||||
pixi run dashboard
|
||||
```
|
||||
|
||||
Explore training data distributions, cross-validation results, feature importance, and spatial predictions.
|
||||
|
||||
## Data Sources
|
||||
|
||||
- **DARTS v2**: RTS labels (polygons with year, area, count)
|
||||
- **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years)
|
||||
- **ArcticDEM**: 32m resolution terrain elevation
|
||||
- **AlphaEarth**: 64-dimensional satellite image embeddings
|
||||
|
||||
## Project Structure
|
||||
|
||||
- `src/entropice/`: Core modules (grids, data processors, training, inference)
|
||||
- `src/entropice/dashboard/`: Streamlit visualization application
|
||||
- `scripts/`: Data processing pipeline automation
|
||||
- `notebooks/`: Exploratory analysis (not version-controlled)
|
||||
|
||||
## Documentation
|
||||
|
||||
- **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow
|
||||
- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards
|
||||
|
||||
## Research Goals
|
||||
|
||||
1. **Predictive Modeling**: Estimate RTS density at unobserved locations
|
||||
2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence
|
||||
3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions
|
||||
4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making
|
||||
|
||||
## License
|
||||
|
||||
TODO
|
||||
|
||||
## Citation
|
||||
|
||||
If you use Entropice in your research, please cite:
|
||||
|
||||
```txt
|
||||
TODO
|
||||
```
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue