Add some docs for copilot

This commit is contained in:
Tobias Hölzer 2025-12-28 20:11:11 +01:00
parent 1ee3d532fc
commit f8df10f687
9 changed files with 908 additions and 2027 deletions

176
.github/agents/Dashboard.agent.md vendored Normal file
View file

@ -0,0 +1,176 @@
---
description: 'Specialized agent for developing and enhancing the Streamlit dashboard for data and training analysis.'
name: Dashboard-Developer
argument-hint: 'Describe dashboard features, pages, visualizations, or improvements you want to add or modify'
tools: ['edit', 'runNotebooks', 'search', 'runCommands', 'usages', 'problems', 'changes', 'testFailure', 'fetch', 'githubRepo', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'ms-toolsai.jupyter/configureNotebook', 'ms-toolsai.jupyter/listNotebookPackages', 'ms-toolsai.jupyter/installNotebookPackages', 'todos', 'runSubagent', 'runTests']
---
# Dashboard Development Agent
You are a specialized agent for incrementally developing and enhancing the **Entropice Streamlit Dashboard** used to analyze geospatial machine learning data and training experiments.
## Your Responsibilities
### What You Should Do
1. **Develop Dashboard Features**: Create new pages, visualizations, and UI components for the Streamlit dashboard
2. **Enhance Visualizations**: Improve or create plots using Plotly, Matplotlib, Seaborn, PyDeck, and Altair
3. **Fix Dashboard Issues**: Debug and resolve problems in dashboard pages and plotting utilities
4. **Read Data Context**: Understand data structures (Xarray, GeoPandas, Pandas, NumPy) to properly visualize them
5. **Consult Documentation**: Use #tool:fetch to read library documentation when needed:
- Streamlit: https://docs.streamlit.io/
- Plotly: https://plotly.com/python/
- PyDeck: https://deckgl.readthedocs.io/
- Deck.gl: https://deck.gl/
- Matplotlib: https://matplotlib.org/
- Seaborn: https://seaborn.pydata.org/
- Xarray: https://docs.xarray.dev/
- GeoPandas: https://geopandas.org/
- Pandas: https://pandas.pydata.org/pandas-docs/
- NumPy: https://numpy.org/doc/stable/
6. **Understand Data Sources**: Read data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`, `dataset.py`, `training.py`, `inference.py`) to understand data structures—but **NEVER edit them**
### What You Should NOT Do
1. **Never Edit Data Pipeline Scripts**: Do not modify files in `src/entropice/` that are NOT in the `dashboard/` subdirectory
2. **Never Edit Training Scripts**: Do not modify `training.py`, `dataset.py`, or any model-related code outside the dashboard
3. **Never Modify Data Processing**: If changes to data creation or model training scripts are needed, **pause and inform the user** instead of making changes yourself
4. **Never Edit Configuration Files**: Do not modify `pyproject.toml`, pipeline scripts in `scripts/`, or configuration files
### Boundaries
If you identify that a dashboard improvement requires changes to:
- Data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
- Dataset assembly (`dataset.py`)
- Model training (`training.py`, `inference.py`)
- Pipeline automation scripts (`scripts/*.sh`)
**Stop immediately** and inform the user:
```
⚠️ This dashboard feature requires changes to the data pipeline/training code.
Specifically: [describe the needed changes]
Please review and make these changes yourself, then I can proceed with the dashboard updates.
```
## Dashboard Structure
The dashboard is located in `src/entropice/dashboard/` with the following structure:
```
dashboard/
├── app.py # Main Streamlit app with navigation
├── overview_page.py # Overview of training results
├── training_data_page.py # Training data visualizations
├── training_analysis_page.py # CV results and hyperparameter analysis
├── model_state_page.py # Feature importance and model state
├── inference_page.py # Spatial prediction visualizations
├── plots/ # Reusable plotting utilities
│ ├── colors.py # Color schemes
│ ├── hyperparameter_analysis.py
│ ├── inference.py
│ ├── model_state.py
│ ├── source_data.py
│ └── training_data.py
└── utils/ # Data loading and processing
├── data.py
└── training.py
```
## Key Technologies
- **Streamlit**: Web app framework
- **Plotly**: Interactive plots (preferred for most visualizations)
- **Matplotlib/Seaborn**: Statistical plots
- **PyDeck/Deck.gl**: Geospatial visualizations
- **Altair**: Declarative visualizations
- **Bokeh**: Alternative interactive plotting (already used in some places)
## Critical Code Standards
### Streamlit Best Practices
**❌ INCORRECT** (deprecated):
```python
st.plotly_chart(fig, use_container_width=True)
```
**✅ CORRECT** (current API):
```python
st.plotly_chart(fig, width='stretch')
```
**Common width values**:
- `width='stretch'` - Use full container width (replaces `use_container_width=True`)
- `width='content'` - Use content width (replaces `use_container_width=False`)
This applies to:
- `st.plotly_chart()`
- `st.altair_chart()`
- `st.vega_lite_chart()`
- `st.dataframe()`
- `st.image()`
### Data Structure Patterns
When working with Entropice data:
1. **Grid Data**: GeoDataFrames with H3/HEALPix cell IDs
2. **L2 Datasets**: Xarray datasets with XDGGS dimensions
3. **Training Results**: Pickled models, Parquet/NetCDF CV results
4. **Predictions**: GeoDataFrames with predicted classes/probabilities
### Visualization Guidelines
1. **Geospatial Data**: Use PyDeck for interactive maps, Plotly for static maps
2. **Time Series**: Prefer Plotly for interactivity
3. **Distributions**: Use Plotly or Seaborn
4. **Feature Importance**: Use Plotly bar charts
5. **Hyperparameter Analysis**: Use Plotly scatter/parallel coordinates
## Workflow
1. **Understand the Request**: Clarify what visualization or feature is needed
2. **Search for Context**: Use #tool:search to find relevant dashboard code and data structures
3. **Read Data Pipeline**: If needed, read (but don't edit) data pipeline scripts to understand data formats
4. **Consult Documentation**: Use #tool:fetch for library documentation when needed
5. **Implement Changes**: Edit dashboard files only
6. **Test Assumptions**: Check for errors with #tool:problems after edits
7. **Track Progress**: Use #tool:todos for multi-step dashboard development
## Example Interactions
### ✅ Good Requests (Within Scope)
- "Add a new page to visualize feature correlations"
- "Create a PyDeck map showing RTS predictions by grid cell"
- "Improve the hyperparameter analysis plot to show confidence intervals"
- "Add a Plotly histogram showing the distribution of RTS density"
- "Fix the deprecation warning about use_container_width"
### ⚠️ Boundary Cases (Requires User Approval)
User: "Add a new climate variable to the dashboard"
Agent Response:
```
⚠️ This requires changes to the data pipeline (era5.py) to extract the new variable.
Please add the variable to the ERA5 processing pipeline first, then I can add it to the dashboard visualizations.
```
## Progress Reporting
For complex dashboard development tasks:
1. Use #tool:todos to create a task list
2. Mark tasks as in-progress before starting
3. Mark completed immediately after finishing
4. Keep the user informed of progress
## Remember
- **Read-only for data pipeline**: You can read any file to understand data structures, but only edit `dashboard/` files
- **Documentation first**: When unsure about Streamlit/Plotly/PyDeck APIs, fetch documentation
- **Modern Streamlit API**: Always use `width='stretch'` instead of `use_container_width=True`
- **Pause when needed**: If data pipeline changes are required, stop and inform the user
You are here to make the dashboard better, not to change how data is created or models are trained. Stay within these boundaries and you'll be most helpful!

193
.github/copilot-instructions.md vendored Normal file
View file

@ -0,0 +1,193 @@
# Entropice - GitHub Copilot Instructions
## Project Overview
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
For project goals and setup, see [README.md](../README.md).
## Core Technologies
- **Python**: 3.13 (strict version requirement)
- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
- **Storage**: Zarr, Icechunk, Parquet, NetCDF
- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
## Code Style Guidelines
### Python Standards
- Follow **PEP 8** conventions
- Use **type hints** for all function signatures
- Write **numpy-style docstrings** for public functions
- Keep functions **focused and modular**
- Prefer descriptive variable names over abbreviations
### Geospatial Best Practices
- Use **EPSG:3413** (Arctic Stereographic) for computations
- Use **EPSG:4326** (WGS84) for visualization and library compatibility
- Store gridded data using **xarray with XDGGS** indexing
- Store tabular data as **Parquet**, array data as **Zarr**
- Leverage **Dask** for lazy evaluation of large datasets
- Use **GeoPandas** for vector operations
- Handle **antimeridian** correctly for polar regions
### Data Pipeline Conventions
- Follow the numbered script sequence: `00grids.sh``01darts.sh``02alphaearth.sh``03era5.sh``04arcticdem.sh``05train.sh`
- Each pipeline stage should produce **reproducible intermediate outputs**
- Use `src/entropice/paths.py` for consistent path management
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
### Storage Hierarchy
All data follows this structure:
```
DATA_DIR/
├── grids/ # H3/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
├── training-results/ # Models, CV results, predictions
└── watermask/ # Ocean mask (GeoParquet)
```
## Module Organization
### Core Modules (`src/entropice/`)
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
- **`dataset.py`**: Multi-source data integration and feature engineering
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
- **`inference.py`**: Batch prediction pipeline for trained models
- **`paths.py`**: Centralized path management
### Dashboard (`src/entropice/dashboard/`)
- Streamlit-based interactive visualization
- Modular pages: overview, training data, analysis, model state, inference
- Bokeh-based geospatial plotting utilities
- Run with: `pixi run dashboard`
### Scripts (`scripts/`)
- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
- Run entire pipeline for multiple grid configurations
- Each script uses CLIs from core modules
### Notebooks (`notebooks/`)
- Exploratory analysis and validation
- **NOT committed to git**
- Keep production code in `src/entropice/`
## Development Workflow
### Setup
```bash
pixi install # NOT pip install or conda install
```
### Running Tests
```bash
pixi run pytest
```
### Common Tasks
- **Generate grids**: Use `grids.py` CLI
- **Process labels**: Use `darts.py` CLI
- **Train models**: Use `training.py` CLI with TOML config
- **Run inference**: Use `inference.py` CLI
- **View results**: `pixi run dashboard`
## Key Design Patterns
### 1. XDGGS Indexing
All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
### 2. Lazy Evaluation
Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
### 3. GPU Acceleration
Prefer GPU-accelerated operations:
- CuPy for array operations
- cuML for Random Forest and KNN
- XGBoost GPU training
- PyTorch tensors when applicable
### 4. Configuration as Code
- Use dataclasses for typed configuration
- TOML files for training hyperparameters
- Cyclopts for CLI argument parsing
## Data Sources
- **DARTS v2**: RTS labels (year, area, count, density)
- **ERA5**: Climate data (40-year history, Arctic-aligned years)
- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
- **AlphaEarth**: 256-dimensional satellite embeddings
- **Watermask**: Ocean exclusion layer
## Model Support
Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
Training features:
- Randomized hyperparameter search
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
## Extension Points
To extend Entropice:
- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
- **Alternative labels**: Implement extractor following `darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `training.py`
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
## Important Notes
- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
- Arctic years run **October 1 to September 30** (not calendar years)
- Handle **antimeridian crossing** in polar regions
- Use **batch processing** for GPU memory management
- Notebooks are for exploration only - **keep production code in `src/`**
- Always use **absolute paths** or paths from `paths.py`
## Common Issues
- **Memory**: Use batch processing and Dask chunking for large datasets
- **GPU OOM**: Reduce batch size in inference or training
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
## References
For more details, consult:
- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
- [README.md](../README.md) - Project goals and setup instructions

56
.github/python.instructions.md vendored Normal file
View file

@ -0,0 +1,56 @@
---
description: 'Python coding conventions and guidelines'
applyTo: '**/*.py,**/*.ipynb'
---
# Python Coding Conventions
## Python Instructions
- Write clear and concise comments for each function.
- Ensure functions have descriptive names and include type hints.
- Provide docstrings following PEP 257 conventions.
- Use the `typing` module for type annotations (e.g., `List[str]`, `Dict[str, int]`).
- Break down complex functions into smaller, more manageable functions.
## General Instructions
- Always prioritize readability and clarity.
- For algorithm-related code, include explanations of the approach used.
- Write code with good maintainability practices, including comments on why certain design decisions were made.
- Handle edge cases and write clear exception handling.
- For libraries or external dependencies, mention their usage and purpose in comments.
- Use consistent naming conventions and follow language-specific best practices.
- Write concise, efficient, and idiomatic code that is also easily understandable.
## Code Style and Formatting
- Follow the **PEP 8** style guide for Python.
- Maintain proper indentation (use 4 spaces for each level of indentation).
- Ensure lines do not exceed 79 characters.
- Place function and class docstrings immediately after the `def` or `class` keyword.
- Use blank lines to separate functions, classes, and code blocks where appropriate.
## Edge Cases and Testing
- Always include test cases for critical paths of the application.
- Account for common edge cases like empty inputs, invalid data types, and large datasets.
- Include comments for edge cases and the expected behavior in those cases.
- Write unit tests for functions and document them with docstrings explaining the test cases.
## Example of Proper Documentation
```python
def calculate_area(radius: float) -> float:
"""
Calculate the area of a circle given the radius.
Parameters:
radius (float): The radius of the circle.
Returns:
float: The area of the circle, calculated as π * radius^2.
"""
import math
return math.pi * radius ** 2
```

217
ARCHITECTURE.md Normal file
View file

@ -0,0 +1,217 @@
# Entropice Architecture
## Overview
Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA).
The system processes multi-source geospatial data, aggregates it into discrete global grids, and trains probabilistic classifiers to estimate RTS occurrence patterns.
It allows for rigorous experimentation with the data and the models, modularizing and abstracting most parts of the data flow pipeline.
Thus, alternative models, datasets and target labels can be used to compare, but also to explore new methodologies other than RTS prediction or eSPA modelling.
## Core Architecture
### Data Flow Pipeline
```txt
Data Sources → Grid Aggregation → Feature Engineering → Dataset Assembly → Model Training → Inference → Visualization
```
The pipeline follows a sequential processing approach where each stage produces intermediate datasets that feed into subsequent stages:
1. **Grid Generation** (H3/HEALPix) → Parquet files
2. **Label Extraction** (DARTS) → Grid-enriched labels
3. **Feature Extraction** (ERA5, ArcticDEM, AlphaEarth) → L2 Datasets (XDGGS Xarray)
4. **Dataset Ensemble** → Training-ready tabular data
5. **Model Training** → Pickled classifiers + CV results
6. **Inference** → GeoDataFrames with predictions
7. **Dashboard** → Interactive visualizations
### System Components
#### 1. Spatial Grid System (`grids.py`)
- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
- Hex: 3, 4, 5, 6
- Healpix: 6, 7, 8, 9, 10
- **Functionality**:
- Generates cell IDs and geometries at configurable resolutions
- Applies watermask to exclude ocean areas
- Provides spatial indexing for efficient data aggregation
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas
#### 2. Label Management (`darts.py`)
- **Purpose**: Extracts RTS labels from DARTS v2 dataset
- **Processing**:
- Spatial overlay of DARTS polygons with grid cells
- Temporal aggregation across multiple observation years
- Computes RTS count, area, density, and coverage per cell
- Supports both standard DARTS and ML training labels
- **Output**: Grid-enriched parquet files with RTS metrics
#### 3. Feature Extractors
**ERA5 Climate Data (`era5.py`)**
- Downloads hourly climate variables from Copernicus Climate Data Store
- Computes daily aggregates: temperature extrema, precipitation, snow metrics
- Derives thaw/freeze metrics: degree days, thawing period length
- Temporal aggregations: yearly, seasonal, shoulder seasons
- Uses Arctic-aligned years (October 1 - September 30)
**ArcticDEM Terrain (`arcticdem.py`)**
- Processes 32m resolution Arctic elevation data
- Computes terrain derivatives: slope, aspect, curvature
- Calculates terrain indices: TPI, TRI, ruggedness, VRM
- Applies watermask clipping and GPU-accelerated convolutions
- Aggregates terrain statistics per grid cell
**AlphaEarth Embeddings (`alphaearth.py`)**
- Extracts 256-dimensional satellite image embeddings via Google Earth Engine
- Uses foundation models to capture visual patterns
- Partitions large grids using KMeans clustering
- Temporal sampling across multiple years
#### 4. Spatial Aggregation Framework (`aggregators.py`)
- **Core Capability**: Raster-to-vector aggregation engine
- **Methods**:
- Exact geometry-based aggregation using polygon overlay
- Grid-aligned batch processing for memory efficiency
- Statistical aggregations: mean, sum, std, min, max, median, quantiles
- **Optimization**:
- Antimeridian handling for polar regions
- Parallel processing with worker pools
- GPU acceleration via CuPy/CuML where applicable
#### 5. Dataset Assembly (`dataset.py`)
- **DatasetEnsemble Class**: Orchestrates multi-source data integration
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
- **Features**:
- Dynamic dataset configuration via dataclasses
- Multi-task support: binary classification, count, density estimation
- Target binning: quantile-based categorical bins
- Train/test splitting with spatial awareness
- GPU-accelerated data loading (PyTorch/CuPy)
- **Output**: Tabular feature matrices ready for scikit-learn API
#### 6. Model Training (`training.py`)
- **Supported Models**:
- **eSPA**: Entropy-optimal probabilistic classifier (primary)
- **XGBoost**: Gradient boosting (GPU-accelerated)
- **Random Forest**: cuML implementation
- **K-Nearest Neighbors**: cuML implementation
- **Training Strategy**:
- Randomized hyperparameter search (Scipy distributions)
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
- **Configuration**: TOML-based configuration with Cyclopts CLI
- **Output**: Pickled models, CV results, feature importance
#### 7. Inference (`inference.py`)
- Batch prediction pipeline for trained classifiers
- GPU memory management with configurable batch sizes
- Outputs GeoDataFrames with predicted classes and probabilities
- Supports all trained model types via unified interface
#### 8. Interactive Dashboard (`dashboard/`)
- **Technology**: Streamlit-based web application
- **Pages**:
- **Overview**: Result directory summary and metrics
- **Training Data**: Feature distribution visualizations
- **Training Analysis**: CV results, hyperparameter analysis
- **Model State**: Feature importance, decision boundaries
- **Inference**: Spatial prediction maps
- **Plotting Modules**: Bokeh-based interactive geospatial plots
## Key Design Patterns
### 1. XDGGS Integration
All geospatial data is indexed using discrete global grid systems (H3 or HEALPix) via the `xdggs` library, enabling:
- Consistent spatial indexing across data sources
- Efficient spatial joins and aggregations
- Multi-resolution analysis capabilities
### 2. Lazy Evaluation & Chunking
- Xarray/Dask for out-of-core computation
- Zarr/Icechunk for chunked storage
- Batch processing to manage GPU memory
### 3. GPU Acceleration
- CuPy for array operations
- cuML for machine learning (RF, KNN)
- XGBoost GPU training
- PyTorch tensor operations
### 4. Modular CLI Design
Each module exposes a Cyclopts-based CLI for standalone execution. Super-scripts utilize them to run parts of the data flow for all grid-level combinations:
```bash
scripts/00grids.sh # Grid generation
scripts/01darts.sh # Label extraction
scripts/02alphaearth.sh # Satellite embeddings
scripts/03era5.sh # Climate features
scripts/04arcticdem.sh # Terrain features
scripts/05train.sh # Model training
```
### 5. Configuration as Code
- Dataclasses for typed configuration
- TOML files for training hyperparameters
- Environment-based path management (`paths.py`)
## Data Storage Hierarchy
```sh
DATA_DIR/
├── grids/ # Hexagonal/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
├── training-results/ # Models, CV results, predictions (Parquet, NetCDF, Pickle)
└── watermask/ # Ocean mask (GeoParquet)
```
## Technology Stack
**Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
**Spatial**: H3, xdggs, xvec, Shapely, Rasterio
**ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
**GPU**: CuPy, PyTorch, CUDA
**Storage**: Zarr, Icechunk, Parquet, NetCDF
**Visualization**: Matplotlib, Seaborn, Bokeh, Streamlit, Cartopy, PyDeck, Altair, Plotly
**CLI**: Cyclopts, Rich
**External APIs**: Google Earth Engine, Copernicus Climate Data Store
## Performance Considerations
- **Memory Management**: Batch processing with configurable chunk sizes
- **Parallelization**: Multi-process aggregation, Dask distributed computing
- **GPU Utilization**: CuPy/cuML for array operations and ML training
- **Storage Optimization**: Blosc compression, Zarr chunking strategies
- **Spatial Indexing**: Grid-based partitioning for large-scale operations
## Extension Points
The architecture supports extension through:
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
- **Alternative Targets**: Implement label extractor following DARTS pattern
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
- **Grid Systems**: Support additional DGGS via xdggs integration

116
CONTRIBUTING.md Normal file
View file

@ -0,0 +1,116 @@
# Contributing to Entropice
Thank you for your interest in contributing to Entropice! This document provides guidelines for contributing to the project.
## Getting Started
### Prerequisites
- Python 3.13
- CUDA 12 compatible GPU (for full functionality)
- [Pixi package manager](https://pixi.sh/)
### Setup
```bash
pixi install
```
This will set up the complete environment including RAPIDS, PyTorch, and all geospatial dependencies.
## Development Workflow
### Code Organization
- **`src/entropice/`**: Core modules (grids, data sources, training, inference)
- **`src/entropice/dashboard/`**: Streamlit visualization dashboard
- **`scripts/`**: Data processing pipeline scripts (numbered 00-05)
- **`notebooks/`**: Exploratory analysis and validation notebooks
- **`tests/`**: Unit tests
### Key Modules
- `grids.py`: H3/HEALPix spatial grid systems
- `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`: Data source processors
- `dataset.py`: Dataset assembly and feature engineering
- `training.py`: Model training with eSPA/SPARTAn
- `inference.py`: Prediction generation
## Coding Standards
### Python Style
- Follow PEP 8 conventions
- Use type hints for function signatures
- Prefer numpy-style docstrings for public functions
- Keep functions focused and modular
### Geospatial Best Practices
- Use **xarray** with XDGGS for gridded data storage
- Store intermediate results as **Parquet** (tabular) or **Zarr** (arrays)
- Leverage **Dask** for lazy evaluation of large datasets
- Use **GeoPandas** for vector operations
- Use EPSG:3413 (Arctic Stereographic) coordinate reference system (CRS) for any computation on the data and EPSG:4326 (WGS84) for data visualization and compatability with some libraries
### Data Pipeline
- Follow the numbered script sequence: `00grids.sh``01darts.sh` → ... → `05train.sh`
- Each stage should produce reproducible intermediate outputs
- Document data dependencies in module docstrings
- Use `paths.py` for consistent path management
## Testing
Run tests for specific modules:
```bash
pixi run pytest
```
When adding features, include tests that verify:
- Correct handling of geospatial coordinates and projections
- Proper aggregation to grid cells
- Data integrity through pipeline stages
## Submitting Changes
### Pull Request Process
1. **Branch**: Create a feature branch from `main`
2. **Commit**: Write clear, descriptive commit messages
3. **Test**: Verify your changes don't break existing functionality
4. **Document**: Update relevant docstrings and documentation
5. **PR**: Submit a pull request with a clear description of changes
### Commit Messages
- Use present tense: "Add feature" not "Added feature"
- Reference issues when applicable: "Fix #123: Correct grid aggregation"
- Keep first line under 72 characters
## Working with Data
### Local Development
- Set `FAST_DATA_DIR` environment variable for data directory (default: `./data`)
### Notebooks
- Notebooks in `notebooks/` are for exploration and validation, they are not commited to git
- Keep production code in `src/entropice/`
## Dashboard Development
Run the dashboard locally:
```bash
pixi run dashboard
```
Dashboard code is in `src/entropice/dashboard/` with modular pages and plotting utilities.
## Questions?
For questions about the architecture, see `ARCHITECTURE.md`. For scientific background, see `README.md`.

172
README.md
View file

@ -1,45 +1,145 @@
# eSPA for RTS
# Entropice
Goal of this project is to utilize the entropy-optimal Scalable Probabilistic Approximations algorithm (eSPA) to create a model which can estimate the density of Retrogressive-Thaw-Slumps (RTS) across the globe with different levels of detail.
Hoping, that a successful training could gain new knowledge about RTX-proxies.
**Geospatial Machine Learning for Arctic Permafrost Degradation.**
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**.
The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions.
## Setup
## Scientific Background
```sh
uv sync
### Retrogressive Thaw Slumps
Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are:
- **Climate indicators**: Sensitive to warming temperatures and changing precipitation
- **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways
- **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions
- **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release
Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change.
### The Challenge
Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset.
Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is:
- Labor-intensive and limited in spatial/temporal coverage
- Challenging due to cloud cover and seasonal visibility
- Insufficient for pan-Arctic prediction at decision-relevant scales
Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems:
- Huge data transfer needed between satellite imagery providers and HPC where the models are run
- Large energy consumtion in both data transfer and inference
- Uncertainty about the quality of the results
- Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist
### Our Approach
Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature.
Entropice addresses this by:
1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels)
2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales
3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies
This hopefully leads to the following advances in permafrost research:
- Better understanding of RTS occurance
- Potential proxy for Ice-Rich permafrost
- Reduction of compute waste of image segmentation pipelines
- Better modelling by providing better starting conditions
### Entropy-Optimal Scalable Probabilistic Approximations (eSPA)
eSPA is a probabilistic classification framework that:
- Provides calibrated probability estimates (not just point predictions)
- Handles imbalanced datasets common in geospatial phenomena
- Captures uncertainty in predictions across poorly-sampled regions
- Enables interpretable feature importance analysis
This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability.
## Key Features
- **Modular Data Pipeline**: Sequential processing stages from raw data to trained models
- **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10)
- **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation
- **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions
- **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools
- **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources
## Quick Start
### Installation
Requires Python 3.13 and CUDA 12 compatible GPU.
```bash
pixi install
```
## Project Plan
This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries.
1. Create global hexagon grids with h3
2. Enrich the grids with data from various sources and with labels from DARTS v2
3. Use eSPA for simple classification: hex has [many slumps / some slumps / few slumps / no slumps]
4. use SPARTAn for regression: one for slumps density (area) and one for total number of slumps
### Running the Pipeline
### Data Sources and Engineering
Execute the numbered scripts to process data and train models:
- Labels
- `"year"`: Year of observation
- `"area"`: Total land-area of the hexagon
- `"rts_density"`: Area of RTS divided by total land-area
- `"rts_count"`: Number of single RTS instances
- ERA5 (starting 40 years from `"year"`)
- `"temp_yearXXXX_qY"`: Y-th quantile temperature of year XXXX. Used to enter the temperature distribution into the model.
- `"thawing_days_yearXXXX"`: Number of thawing-days of year XXXX.
- `"precip_yearXXXX_qY"`: Y-th quantile precipitation of year XXXX. Similar to temperature.
- `"temp_5year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 5 years difference.
- `"temp_10year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
- `"temp_diff_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
- ArcticDEM
- `"dissection_index"`: Dissection Index, (max - min) / max
- `"max_elevation"`: Maximum elevation
- `"elevationX_density"`: Area where the elevation is larger than X divided by the total land-area
- TCVIS
- ???
- Wildfire???
- Permafrost???
- GroundIceContent???
- Biome
```bash
scripts/00grids.sh # Generate spatial grids
scripts/01darts.sh # Extract RTS labels from DARTS v2
scripts/02alphaearth.sh # Extract satellite embeddings
scripts/03era5.sh # Process climate data
scripts/04arcticdem.sh # Compute terrain features
scripts/05train.sh # Train models
```
**About temporals** Every label has its own year - all temporal dependent data features, e.g. `"temp_5year_diff_XXXXtoXXXX_qY"` are calculated respective to that year.
The number of years added from a dataset is always the same, e.g. for ERA5 for an observation in 2024 the ERA5 data would start in 1984 and for an observation from 2023 in 1983.
### Visualizing Results
Launch the interactive dashboard:
```bash
pixi run dashboard
```
Explore training data distributions, cross-validation results, feature importance, and spatial predictions.
## Data Sources
- **DARTS v2**: RTS labels (polygons with year, area, count)
- **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years)
- **ArcticDEM**: 32m resolution terrain elevation
- **AlphaEarth**: 64-dimensional satellite image embeddings
## Project Structure
- `src/entropice/`: Core modules (grids, data processors, training, inference)
- `src/entropice/dashboard/`: Streamlit visualization application
- `scripts/`: Data processing pipeline automation
- `notebooks/`: Exploratory analysis (not version-controlled)
## Documentation
- **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow
- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards
## Research Goals
1. **Predictive Modeling**: Estimate RTS density at unobserved locations
2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence
3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions
4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making
## License
TODO
## Citation
If you use Entropice in your research, please cite:
```txt
TODO
```

View file

@ -284,7 +284,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
fig.update_traces(text=pivot_df.values, texttemplate="%{text:,}", textfont_size=10)
fig.update_layout(height=400)
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")
with tab2:
st.markdown("### Sample Counts Bar Chart")
@ -311,7 +311,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
# Update facet labels to be cleaner
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_xaxes(tickangle=-45)
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")
with tab3:
st.markdown("### Detailed Sample Counts")
@ -325,7 +325,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
for col in ["Samples (Coverage)", "Samples (Labels)", "Samples (Both)"]:
display_df[col] = display_df[col].apply(lambda x: f"{x:,}")
st.dataframe(display_df, hide_index=True, use_container_width=True)
st.dataframe(display_df, hide_index=True, width="stretch")
def render_feature_count_comparison(cache: DatasetAnalysisCache):
@ -363,7 +363,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
)
fig.update_layout(height=500, xaxis_tickangle=-45)
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")
# Add secondary metrics
col1, col2 = st.columns(2)
@ -384,7 +384,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
)
fig_cells.update_traces(texttemplate="%{text:,}", textposition="outside")
fig_cells.update_layout(xaxis_tickangle=-45, showlegend=False)
st.plotly_chart(fig_cells, use_container_width=True)
st.plotly_chart(fig_cells, width="stretch")
with col2:
fig_samples = px.bar(
@ -399,7 +399,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
)
fig_samples.update_traces(texttemplate="%{text:,}", textposition="outside")
fig_samples.update_layout(xaxis_tickangle=-45, showlegend=False)
st.plotly_chart(fig_samples, use_container_width=True)
st.plotly_chart(fig_samples, width="stretch")
with comp_tab2:
st.markdown("#### Feature Breakdown by Data Source")
@ -435,7 +435,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
)
fig.update_traces(textposition="inside", textinfo="percent")
fig.update_layout(showlegend=True, height=350)
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")
with comp_tab3:
st.markdown("#### Detailed Feature Count Comparison")
@ -452,7 +452,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
# Format boolean as Yes/No
display_df["AlphaEarth"] = display_df["AlphaEarth"].apply(lambda x: "" if x else "")
st.dataframe(display_df, hide_index=True, use_container_width=True)
st.dataframe(display_df, hide_index=True, width="stretch")
@st.fragment
@ -581,10 +581,10 @@ def render_feature_count_explorer(cache: DatasetAnalysisCache):
color_discrete_sequence=source_colors,
)
fig.update_traces(textposition="inside", textinfo="percent+label")
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")
# Show detailed table
st.dataframe(breakdown_df, hide_index=True, use_container_width=True)
st.dataframe(breakdown_df, hide_index=True, width="stretch")
# Detailed member information
with st.expander("📦 Detailed Source Information", expanded=False):

View file

@ -108,7 +108,7 @@ def render_class_distribution_histogram(predictions_gdf: gpd.GeoDataFrame, task:
xaxis={"tickangle": -45 if len(categories) > 3 else 0},
)
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")
# Show percentages in a table
with st.expander("📋 Detailed Class Distribution", expanded=False):
@ -119,7 +119,7 @@ def render_class_distribution_histogram(predictions_gdf: gpd.GeoDataFrame, task:
"Percentage": (class_counts.to_numpy() / len(predictions_gdf) * 100).round(2),
}
)
st.dataframe(distribution_df, hide_index=True, use_container_width=True)
st.dataframe(distribution_df, hide_index=True, width="stretch")
def render_spatial_distribution_stats(predictions_gdf: gpd.GeoDataFrame):
@ -394,7 +394,7 @@ def render_class_comparison(predictions_gdf: gpd.GeoDataFrame, task: str):
showlegend=True,
)
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")
with col2:
st.markdown("**Cumulative Distribution")
@ -426,4 +426,4 @@ def render_class_comparison(predictions_gdf: gpd.GeoDataFrame, task: str):
yaxis={"range": [0, 105]},
)
st.plotly_chart(fig, use_container_width=True)
st.plotly_chart(fig, width="stretch")

File diff suppressed because it is too large Load diff