Add some docs for copilot

2025-12-28 20:11:11 +01:00 · 2025-12-28 20:11:11 +01:00 · f8df10f687
commit f8df10f687
parent 1ee3d532fc
9 changed files with 908 additions and 2027 deletions
--- a/.github/agents/Dashboard.agent.md
+++ b/.github/agents/Dashboard.agent.md
@ -0,0 +1,176 @@
+---
+description: 'Specialized agent for developing and enhancing the Streamlit dashboard for data and training analysis.'
+name: Dashboard-Developer
+argument-hint: 'Describe dashboard features, pages, visualizations, or improvements you want to add or modify'
+tools: ['edit', 'runNotebooks', 'search', 'runCommands', 'usages', 'problems', 'changes', 'testFailure', 'fetch', 'githubRepo', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'ms-toolsai.jupyter/configureNotebook', 'ms-toolsai.jupyter/listNotebookPackages', 'ms-toolsai.jupyter/installNotebookPackages', 'todos', 'runSubagent', 'runTests']
+---
+
+# Dashboard Development Agent
+
+You are a specialized agent for incrementally developing and enhancing the **Entropice Streamlit Dashboard** used to analyze geospatial machine learning data and training experiments.
+
+## Your Responsibilities
+
+### What You Should Do
+
+1. **Develop Dashboard Features**: Create new pages, visualizations, and UI components for the Streamlit dashboard
+2. **Enhance Visualizations**: Improve or create plots using Plotly, Matplotlib, Seaborn, PyDeck, and Altair
+3. **Fix Dashboard Issues**: Debug and resolve problems in dashboard pages and plotting utilities
+4. **Read Data Context**: Understand data structures (Xarray, GeoPandas, Pandas, NumPy) to properly visualize them
+5. **Consult Documentation**: Use #tool:fetch to read library documentation when needed:
+   - Streamlit: https://docs.streamlit.io/
+   - Plotly: https://plotly.com/python/
+   - PyDeck: https://deckgl.readthedocs.io/
+   - Deck.gl: https://deck.gl/
+   - Matplotlib: https://matplotlib.org/
+   - Seaborn: https://seaborn.pydata.org/
+   - Xarray: https://docs.xarray.dev/
+   - GeoPandas: https://geopandas.org/
+   - Pandas: https://pandas.pydata.org/pandas-docs/
+   - NumPy: https://numpy.org/doc/stable/
+
+6. **Understand Data Sources**: Read data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`, `dataset.py`, `training.py`, `inference.py`) to understand data structures—but **NEVER edit them**
+
+### What You Should NOT Do
+
+1. **Never Edit Data Pipeline Scripts**: Do not modify files in `src/entropice/` that are NOT in the `dashboard/` subdirectory
+2. **Never Edit Training Scripts**: Do not modify `training.py`, `dataset.py`, or any model-related code outside the dashboard
+3. **Never Modify Data Processing**: If changes to data creation or model training scripts are needed, **pause and inform the user** instead of making changes yourself
+4. **Never Edit Configuration Files**: Do not modify `pyproject.toml`, pipeline scripts in `scripts/`, or configuration files
+
+### Boundaries
+
+If you identify that a dashboard improvement requires changes to:
+- Data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
+- Dataset assembly (`dataset.py`)
+- Model training (`training.py`, `inference.py`)
+- Pipeline automation scripts (`scripts/*.sh`)
+
+**Stop immediately** and inform the user:
+```
+⚠️ This dashboard feature requires changes to the data pipeline/training code.
+Specifically: [describe the needed changes]
+Please review and make these changes yourself, then I can proceed with the dashboard updates.
+```
+
+## Dashboard Structure
+
+The dashboard is located in `src/entropice/dashboard/` with the following structure:
+
+```
+dashboard/
+├── app.py                      # Main Streamlit app with navigation
+├── overview_page.py            # Overview of training results
+├── training_data_page.py       # Training data visualizations
+├── training_analysis_page.py   # CV results and hyperparameter analysis
+├── model_state_page.py         # Feature importance and model state
+├── inference_page.py           # Spatial prediction visualizations
+├── plots/                      # Reusable plotting utilities
+│   ├── colors.py               # Color schemes
+│   ├── hyperparameter_analysis.py
+│   ├── inference.py
+│   ├── model_state.py
+│   ├── source_data.py
+│   └── training_data.py
+└── utils/                      # Data loading and processing
+    ├── data.py
+    └── training.py
+```
+
+## Key Technologies
+
+- **Streamlit**: Web app framework
+- **Plotly**: Interactive plots (preferred for most visualizations)
+- **Matplotlib/Seaborn**: Statistical plots
+- **PyDeck/Deck.gl**: Geospatial visualizations
+- **Altair**: Declarative visualizations
+- **Bokeh**: Alternative interactive plotting (already used in some places)
+
+## Critical Code Standards
+
+### Streamlit Best Practices
+
+**❌ INCORRECT** (deprecated):
+```python
+st.plotly_chart(fig, use_container_width=True)
+```
+
+**✅ CORRECT** (current API):
+```python
+st.plotly_chart(fig, width='stretch')
+```
+
+**Common width values**:
+- `width='stretch'` - Use full container width (replaces `use_container_width=True`)
+- `width='content'` - Use content width (replaces `use_container_width=False`)
+
+This applies to:
+- `st.plotly_chart()`
+- `st.altair_chart()`
+- `st.vega_lite_chart()`
+- `st.dataframe()`
+- `st.image()`
+
+### Data Structure Patterns
+
+When working with Entropice data:
+
+1. **Grid Data**: GeoDataFrames with H3/HEALPix cell IDs
+2. **L2 Datasets**: Xarray datasets with XDGGS dimensions
+3. **Training Results**: Pickled models, Parquet/NetCDF CV results
+4. **Predictions**: GeoDataFrames with predicted classes/probabilities
+
+### Visualization Guidelines
+
+1. **Geospatial Data**: Use PyDeck for interactive maps, Plotly for static maps
+2. **Time Series**: Prefer Plotly for interactivity
+3. **Distributions**: Use Plotly or Seaborn
+4. **Feature Importance**: Use Plotly bar charts
+5. **Hyperparameter Analysis**: Use Plotly scatter/parallel coordinates
+
+## Workflow
+
+1. **Understand the Request**: Clarify what visualization or feature is needed
+2. **Search for Context**: Use #tool:search to find relevant dashboard code and data structures
+3. **Read Data Pipeline**: If needed, read (but don't edit) data pipeline scripts to understand data formats
+4. **Consult Documentation**: Use #tool:fetch for library documentation when needed
+5. **Implement Changes**: Edit dashboard files only
+6. **Test Assumptions**: Check for errors with #tool:problems after edits
+7. **Track Progress**: Use #tool:todos for multi-step dashboard development
+
+## Example Interactions
+
+### ✅ Good Requests (Within Scope)
+
+- "Add a new page to visualize feature correlations"
+- "Create a PyDeck map showing RTS predictions by grid cell"
+- "Improve the hyperparameter analysis plot to show confidence intervals"
+- "Add a Plotly histogram showing the distribution of RTS density"
+- "Fix the deprecation warning about use_container_width"
+
+### ⚠️ Boundary Cases (Requires User Approval)
+
+User: "Add a new climate variable to the dashboard"
+Agent Response:
+```
+⚠️ This requires changes to the data pipeline (era5.py) to extract the new variable.
+Please add the variable to the ERA5 processing pipeline first, then I can add it to the dashboard visualizations.
+```
+
+## Progress Reporting
+
+For complex dashboard development tasks:
+
+1. Use #tool:todos to create a task list
+2. Mark tasks as in-progress before starting
+3. Mark completed immediately after finishing
+4. Keep the user informed of progress
+
+## Remember
+
+- **Read-only for data pipeline**: You can read any file to understand data structures, but only edit `dashboard/` files
+- **Documentation first**: When unsure about Streamlit/Plotly/PyDeck APIs, fetch documentation
+- **Modern Streamlit API**: Always use `width='stretch'` instead of `use_container_width=True`
+- **Pause when needed**: If data pipeline changes are required, stop and inform the user
+
+You are here to make the dashboard better, not to change how data is created or models are trained. Stay within these boundaries and you'll be most helpful!
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@ -0,0 +1,193 @@
+# Entropice - GitHub Copilot Instructions
+
+## Project Overview
+
+Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
+
+For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
+For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+For project goals and setup, see [README.md](../README.md).
+
+## Core Technologies
+
+- **Python**: 3.13 (strict version requirement)
+- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
+- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
+- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
+- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
+- **Storage**: Zarr, Icechunk, Parquet, NetCDF
+- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
+
+## Code Style Guidelines
+
+### Python Standards
+
+- Follow **PEP 8** conventions
+- Use **type hints** for all function signatures
+- Write **numpy-style docstrings** for public functions
+- Keep functions **focused and modular**
+- Prefer descriptive variable names over abbreviations
+
+### Geospatial Best Practices
+
+- Use **EPSG:3413** (Arctic Stereographic) for computations
+- Use **EPSG:4326** (WGS84) for visualization and library compatibility
+- Store gridded data using **xarray with XDGGS** indexing
+- Store tabular data as **Parquet**, array data as **Zarr**
+- Leverage **Dask** for lazy evaluation of large datasets
+- Use **GeoPandas** for vector operations
+- Handle **antimeridian** correctly for polar regions
+
+### Data Pipeline Conventions
+
+- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
+- Each pipeline stage should produce **reproducible intermediate outputs**
+- Use `src/entropice/paths.py` for consistent path management
+- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
+
+### Storage Hierarchy
+
+All data follows this structure:
+```
+DATA_DIR/
+├── grids/                    # H3/HEALPix tessellations (GeoParquet)
+├── darts/                    # RTS labels (GeoParquet)
+├── era5/                     # Climate data (Zarr)
+├── arcticdem/                # Terrain data (Icechunk Zarr)
+├── alphaearth/               # Satellite embeddings (Zarr)
+├── datasets/                 # L2 XDGGS datasets (Zarr)
+├── training-results/         # Models, CV results, predictions
+└── watermask/                # Ocean mask (GeoParquet)
+```
+
+## Module Organization
+
+### Core Modules (`src/entropice/`)
+
+- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
+- **`darts.py`**: RTS label extraction from DARTS v2 dataset
+- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
+- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
+- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
+- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
+- **`dataset.py`**: Multi-source data integration and feature engineering
+- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
+- **`inference.py`**: Batch prediction pipeline for trained models
+- **`paths.py`**: Centralized path management
+
+### Dashboard (`src/entropice/dashboard/`)
+
+- Streamlit-based interactive visualization
+- Modular pages: overview, training data, analysis, model state, inference
+- Bokeh-based geospatial plotting utilities
+- Run with: `pixi run dashboard`
+
+### Scripts (`scripts/`)
+
+- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
+- Run entire pipeline for multiple grid configurations
+- Each script uses CLIs from core modules
+
+### Notebooks (`notebooks/`)
+
+- Exploratory analysis and validation
+- **NOT committed to git**
+- Keep production code in `src/entropice/`
+
+## Development Workflow
+
+### Setup
+
+```bash
+pixi install  # NOT pip install or conda install
+```
+
+### Running Tests
+
+```bash
+pixi run pytest
+```
+
+### Common Tasks
+
+- **Generate grids**: Use `grids.py` CLI
+- **Process labels**: Use `darts.py` CLI
+- **Train models**: Use `training.py` CLI with TOML config
+- **Run inference**: Use `inference.py` CLI
+- **View results**: `pixi run dashboard`
+
+## Key Design Patterns
+
+### 1. XDGGS Indexing
+
+All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
+
+### 2. Lazy Evaluation
+
+Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
+
+### 3. GPU Acceleration
+
+Prefer GPU-accelerated operations:
+- CuPy for array operations
+- cuML for Random Forest and KNN
+- XGBoost GPU training
+- PyTorch tensors when applicable
+
+### 4. Configuration as Code
+
+- Use dataclasses for typed configuration
+- TOML files for training hyperparameters
+- Cyclopts for CLI argument parsing
+
+## Data Sources
+
+- **DARTS v2**: RTS labels (year, area, count, density)
+- **ERA5**: Climate data (40-year history, Arctic-aligned years)
+- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
+- **AlphaEarth**: 256-dimensional satellite embeddings
+- **Watermask**: Ocean exclusion layer
+
+## Model Support
+
+Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
+Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
+
+Training features:
+- Randomized hyperparameter search
+- K-Fold cross-validation
+- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
+
+## Extension Points
+
+To extend Entropice:
+
+- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
+- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
+- **Alternative labels**: Implement extractor following `darts.py` pattern
+- **New models**: Add scikit-learn compatible estimators to `training.py`
+- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
+
+## Important Notes
+
+- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
+- Arctic years run **October 1 to September 30** (not calendar years)
+- Handle **antimeridian crossing** in polar regions
+- Use **batch processing** for GPU memory management
+- Notebooks are for exploration only - **keep production code in `src/`**
+- Always use **absolute paths** or paths from `paths.py`
+
+## Common Issues
+
+- **Memory**: Use batch processing and Dask chunking for large datasets
+- **GPU OOM**: Reduce batch size in inference or training
+- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
+- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
+- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
+
+## References
+
+For more details, consult:
+- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
+- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
+- [README.md](../README.md) - Project goals and setup instructions
--- a/.github/python.instructions.md
+++ b/.github/python.instructions.md
@ -0,0 +1,56 @@
+---
+description: 'Python coding conventions and guidelines'
+applyTo: '**/*.py,**/*.ipynb'
+---
+
+# Python Coding Conventions
+
+## Python Instructions
+
+- Write clear and concise comments for each function.
+- Ensure functions have descriptive names and include type hints.
+- Provide docstrings following PEP 257 conventions.
+- Use the `typing` module for type annotations (e.g., `List[str]`, `Dict[str, int]`).
+- Break down complex functions into smaller, more manageable functions.
+
+## General Instructions
+
+- Always prioritize readability and clarity.
+- For algorithm-related code, include explanations of the approach used.
+- Write code with good maintainability practices, including comments on why certain design decisions were made.
+- Handle edge cases and write clear exception handling.
+- For libraries or external dependencies, mention their usage and purpose in comments.
+- Use consistent naming conventions and follow language-specific best practices.
+- Write concise, efficient, and idiomatic code that is also easily understandable.
+
+## Code Style and Formatting
+
+- Follow the **PEP 8** style guide for Python.
+- Maintain proper indentation (use 4 spaces for each level of indentation).
+- Ensure lines do not exceed 79 characters.
+- Place function and class docstrings immediately after the `def` or `class` keyword.
+- Use blank lines to separate functions, classes, and code blocks where appropriate.
+
+## Edge Cases and Testing
+
+- Always include test cases for critical paths of the application.
+- Account for common edge cases like empty inputs, invalid data types, and large datasets.
+- Include comments for edge cases and the expected behavior in those cases.
+- Write unit tests for functions and document them with docstrings explaining the test cases.
+
+## Example of Proper Documentation
+
+```python
+def calculate_area(radius: float) -> float:
+    """
+    Calculate the area of a circle given the radius.
+
+    Parameters:
+      radius (float): The radius of the circle.
+
+    Returns:
+      float: The area of the circle, calculated as π * radius^2.
+    """
+    import math
+    return math.pi * radius ** 2
+```
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -0,0 +1,217 @@
+# Entropice Architecture
+
+## Overview
+
+Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA).
+The system processes multi-source geospatial data, aggregates it into discrete global grids, and trains probabilistic classifiers to estimate RTS occurrence patterns.
+It allows for rigorous experimentation with the data and the models, modularizing and abstracting most parts of the data flow pipeline.
+Thus, alternative models, datasets and target labels can be used to compare, but also to explore new methodologies other than RTS prediction or eSPA modelling.
+
+## Core Architecture
+
+### Data Flow Pipeline
+
+```txt
+Data Sources → Grid Aggregation → Feature Engineering → Dataset Assembly → Model Training → Inference → Visualization
+```
+
+The pipeline follows a sequential processing approach where each stage produces intermediate datasets that feed into subsequent stages:
+
+1. **Grid Generation** (H3/HEALPix) → Parquet files
+2. **Label Extraction** (DARTS) → Grid-enriched labels
+3. **Feature Extraction** (ERA5, ArcticDEM, AlphaEarth) → L2 Datasets (XDGGS Xarray)
+4. **Dataset Ensemble** → Training-ready tabular data
+5. **Model Training** → Pickled classifiers + CV results
+6. **Inference** → GeoDataFrames with predictions
+7. **Dashboard** → Interactive visualizations
+
+### System Components
+
+#### 1. Spatial Grid System (`grids.py`)
+
+- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
+- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
+  - Hex: 3, 4, 5, 6
+  - Healpix: 6, 7, 8, 9, 10
+- **Functionality**:
+  - Generates cell IDs and geometries at configurable resolutions
+  - Applies watermask to exclude ocean areas
+  - Provides spatial indexing for efficient data aggregation
+- **Output**: GeoDataFrames with cell IDs, geometries, and land areas
+
+#### 2. Label Management (`darts.py`)
+
+- **Purpose**: Extracts RTS labels from DARTS v2 dataset
+- **Processing**:
+  - Spatial overlay of DARTS polygons with grid cells
+  - Temporal aggregation across multiple observation years
+  - Computes RTS count, area, density, and coverage per cell
+  - Supports both standard DARTS and ML training labels
+- **Output**: Grid-enriched parquet files with RTS metrics
+
+#### 3. Feature Extractors
+
+**ERA5 Climate Data (`era5.py`)**
+
+- Downloads hourly climate variables from Copernicus Climate Data Store
+- Computes daily aggregates: temperature extrema, precipitation, snow metrics
+- Derives thaw/freeze metrics: degree days, thawing period length
+- Temporal aggregations: yearly, seasonal, shoulder seasons
+- Uses Arctic-aligned years (October 1 - September 30)
+
+**ArcticDEM Terrain (`arcticdem.py`)**
+
+- Processes 32m resolution Arctic elevation data
+- Computes terrain derivatives: slope, aspect, curvature
+- Calculates terrain indices: TPI, TRI, ruggedness, VRM
+- Applies watermask clipping and GPU-accelerated convolutions
+- Aggregates terrain statistics per grid cell
+
+**AlphaEarth Embeddings (`alphaearth.py`)**
+
+- Extracts 256-dimensional satellite image embeddings via Google Earth Engine
+- Uses foundation models to capture visual patterns
+- Partitions large grids using KMeans clustering
+- Temporal sampling across multiple years
+
+#### 4. Spatial Aggregation Framework (`aggregators.py`)
+
+- **Core Capability**: Raster-to-vector aggregation engine
+- **Methods**:
+  - Exact geometry-based aggregation using polygon overlay
+  - Grid-aligned batch processing for memory efficiency
+  - Statistical aggregations: mean, sum, std, min, max, median, quantiles
+- **Optimization**:
+  - Antimeridian handling for polar regions
+  - Parallel processing with worker pools
+  - GPU acceleration via CuPy/CuML where applicable
+
+#### 5. Dataset Assembly (`dataset.py`)
+
+- **DatasetEnsemble Class**: Orchestrates multi-source data integration
+- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
+- **Features**:
+  - Dynamic dataset configuration via dataclasses
+  - Multi-task support: binary classification, count, density estimation
+  - Target binning: quantile-based categorical bins
+  - Train/test splitting with spatial awareness
+  - GPU-accelerated data loading (PyTorch/CuPy)
+- **Output**: Tabular feature matrices ready for scikit-learn API
+
+#### 6. Model Training (`training.py`)
+
+- **Supported Models**:
+  - **eSPA**: Entropy-optimal probabilistic classifier (primary)
+  - **XGBoost**: Gradient boosting (GPU-accelerated)
+  - **Random Forest**: cuML implementation
+  - **K-Nearest Neighbors**: cuML implementation
+- **Training Strategy**:
+  - Randomized hyperparameter search (Scipy distributions)
+  - K-Fold cross-validation
+  - Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
+- **Configuration**: TOML-based configuration with Cyclopts CLI
+- **Output**: Pickled models, CV results, feature importance
+
+#### 7. Inference (`inference.py`)
+
+- Batch prediction pipeline for trained classifiers
+- GPU memory management with configurable batch sizes
+- Outputs GeoDataFrames with predicted classes and probabilities
+- Supports all trained model types via unified interface
+
+#### 8. Interactive Dashboard (`dashboard/`)
+
+- **Technology**: Streamlit-based web application
+- **Pages**:
+  - **Overview**: Result directory summary and metrics
+  - **Training Data**: Feature distribution visualizations
+  - **Training Analysis**: CV results, hyperparameter analysis
+  - **Model State**: Feature importance, decision boundaries
+  - **Inference**: Spatial prediction maps
+- **Plotting Modules**: Bokeh-based interactive geospatial plots
+
+## Key Design Patterns
+
+### 1. XDGGS Integration
+
+All geospatial data is indexed using discrete global grid systems (H3 or HEALPix) via the `xdggs` library, enabling:
+
+- Consistent spatial indexing across data sources
+- Efficient spatial joins and aggregations
+- Multi-resolution analysis capabilities
+
+### 2. Lazy Evaluation & Chunking
+
+- Xarray/Dask for out-of-core computation
+- Zarr/Icechunk for chunked storage
+- Batch processing to manage GPU memory
+
+### 3. GPU Acceleration
+
+- CuPy for array operations
+- cuML for machine learning (RF, KNN)
+- XGBoost GPU training
+- PyTorch tensor operations
+
+### 4. Modular CLI Design
+
+Each module exposes a Cyclopts-based CLI for standalone execution. Super-scripts utilize them to run parts of the data flow for all grid-level combinations:
+
+```bash
+scripts/00grids.sh      # Grid generation
+scripts/01darts.sh      # Label extraction
+scripts/02alphaearth.sh # Satellite embeddings
+scripts/03era5.sh       # Climate features
+scripts/04arcticdem.sh  # Terrain features
+scripts/05train.sh      # Model training
+```
+
+### 5. Configuration as Code
+
+- Dataclasses for typed configuration
+- TOML files for training hyperparameters
+- Environment-based path management (`paths.py`)
+
+## Data Storage Hierarchy
+
+```sh
+DATA_DIR/
+├── grids/                    # Hexagonal/HEALPix tessellations (GeoParquet)
+├── darts/                    # RTS labels (GeoParquet)
+├── era5/                     # Climate data (Zarr)
+├── arcticdem/                # Terrain data (Icechunk Zarr)
+├── alphaearth/               # Satellite embeddings (Zarr)
+├── datasets/                 # L2 XDGGS datasets (Zarr)
+├── training-results/         # Models, CV results, predictions (Parquet, NetCDF, Pickle)
+└── watermask/                # Ocean mask (GeoParquet)
+```
+
+## Technology Stack
+
+**Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas  
+**Spatial**: H3, xdggs, xvec, Shapely, Rasterio  
+**ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)  
+**GPU**: CuPy, PyTorch, CUDA  
+**Storage**: Zarr, Icechunk, Parquet, NetCDF  
+**Visualization**: Matplotlib, Seaborn, Bokeh, Streamlit, Cartopy, PyDeck, Altair, Plotly  
+**CLI**: Cyclopts, Rich  
+**External APIs**: Google Earth Engine, Copernicus Climate Data Store
+
+## Performance Considerations
+
+- **Memory Management**: Batch processing with configurable chunk sizes
+- **Parallelization**: Multi-process aggregation, Dask distributed computing
+- **GPU Utilization**: CuPy/cuML for array operations and ML training
+- **Storage Optimization**: Blosc compression, Zarr chunking strategies
+- **Spatial Indexing**: Grid-based partitioning for large-scale operations
+
+## Extension Points
+
+The architecture supports extension through:
+
+- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
+- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
+- **Alternative Targets**: Implement label extractor following DARTS pattern
+- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
+- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
+- **Grid Systems**: Support additional DGGS via xdggs integration
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,116 @@
+# Contributing to Entropice
+
+Thank you for your interest in contributing to Entropice! This document provides guidelines for contributing to the project.
+
+## Getting Started
+
+### Prerequisites
+
+- Python 3.13
+- CUDA 12 compatible GPU (for full functionality)
+- [Pixi package manager](https://pixi.sh/)
+
+### Setup
+
+```bash
+pixi install
+```
+
+This will set up the complete environment including RAPIDS, PyTorch, and all geospatial dependencies.
+
+## Development Workflow
+
+### Code Organization
+
+- **`src/entropice/`**: Core modules (grids, data sources, training, inference)
+- **`src/entropice/dashboard/`**: Streamlit visualization dashboard
+- **`scripts/`**: Data processing pipeline scripts (numbered 00-05)
+- **`notebooks/`**: Exploratory analysis and validation notebooks
+- **`tests/`**: Unit tests
+
+### Key Modules
+
+- `grids.py`: H3/HEALPix spatial grid systems
+- `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`: Data source processors
+- `dataset.py`: Dataset assembly and feature engineering
+- `training.py`: Model training with eSPA/SPARTAn
+- `inference.py`: Prediction generation
+
+## Coding Standards
+
+### Python Style
+
+- Follow PEP 8 conventions
+- Use type hints for function signatures
+- Prefer numpy-style docstrings for public functions
+- Keep functions focused and modular
+
+### Geospatial Best Practices
+
+- Use **xarray** with XDGGS for gridded data storage
+- Store intermediate results as **Parquet** (tabular) or **Zarr** (arrays)
+- Leverage **Dask** for lazy evaluation of large datasets
+- Use **GeoPandas** for vector operations
+- Use EPSG:3413 (Arctic Stereographic) coordinate reference system (CRS) for any computation on the data and EPSG:4326 (WGS84) for data visualization and compatability with some libraries
+
+### Data Pipeline
+
+- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → ... → `05train.sh`
+- Each stage should produce reproducible intermediate outputs
+- Document data dependencies in module docstrings
+- Use `paths.py` for consistent path management
+
+## Testing
+
+Run tests for specific modules:
+
+```bash
+pixi run pytest
+```
+
+When adding features, include tests that verify:
+
+- Correct handling of geospatial coordinates and projections
+- Proper aggregation to grid cells
+- Data integrity through pipeline stages
+
+## Submitting Changes
+
+### Pull Request Process
+
+1. **Branch**: Create a feature branch from `main`
+2. **Commit**: Write clear, descriptive commit messages
+3. **Test**: Verify your changes don't break existing functionality
+4. **Document**: Update relevant docstrings and documentation
+5. **PR**: Submit a pull request with a clear description of changes
+
+### Commit Messages
+
+- Use present tense: "Add feature" not "Added feature"
+- Reference issues when applicable: "Fix #123: Correct grid aggregation"
+- Keep first line under 72 characters
+
+## Working with Data
+
+### Local Development
+
+- Set `FAST_DATA_DIR` environment variable for data directory (default: `./data`)
+
+### Notebooks
+
+- Notebooks in `notebooks/` are for exploration and validation, they are not commited to git
+- Keep production code in `src/entropice/`
+
+## Dashboard Development
+
+Run the dashboard locally:
+
+```bash
+pixi run dashboard
+```
+
+Dashboard code is in `src/entropice/dashboard/` with modular pages and plotting utilities.
+
+## Questions?
+
+For questions about the architecture, see `ARCHITECTURE.md`. For scientific background, see `README.md`.
--- a/README.md
+++ b/README.md
@ -1,45 +1,145 @@
-# eSPA for RTS
+# Entropice

-Goal of this project is to utilize the entropy-optimal Scalable Probabilistic Approximations algorithm (eSPA) to create a model which can estimate the density of Retrogressive-Thaw-Slumps (RTS) across the globe with different levels of detail.
-Hoping, that a successful training could gain new knowledge about RTX-proxies.
+**Geospatial Machine Learning for Arctic Permafrost Degradation.**
+Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**.
+The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions.

-## Setup
+## Scientific Background

-```sh
-uv sync
+### Retrogressive Thaw Slumps
+
+Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are:
+
+- **Climate indicators**: Sensitive to warming temperatures and changing precipitation
+- **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways
+- **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions
+- **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release
+
+Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change.
+
+### The Challenge
+
+Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset.
+
+Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is:
+
+- Labor-intensive and limited in spatial/temporal coverage
+- Challenging due to cloud cover and seasonal visibility
+- Insufficient for pan-Arctic prediction at decision-relevant scales
+
+Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems:
+
+- Huge data transfer needed between satellite imagery providers and HPC where the models are run
+- Large energy consumtion in both data transfer and inference
+- Uncertainty about the quality of the results
+- Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist
+
+### Our Approach
+
+Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature.
+
+Entropice addresses this by:
+
+1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels)
+2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales
+3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies
+
+This hopefully leads to the following advances in permafrost research:
+
+- Better understanding of RTS occurance
+  - Potential proxy for Ice-Rich permafrost
+- Reduction of compute waste of image segmentation pipelines
+- Better modelling by providing better starting conditions
+
+### Entropy-Optimal Scalable Probabilistic Approximations (eSPA)
+
+eSPA is a probabilistic classification framework that:
+
+- Provides calibrated probability estimates (not just point predictions)
+- Handles imbalanced datasets common in geospatial phenomena
+- Captures uncertainty in predictions across poorly-sampled regions
+- Enables interpretable feature importance analysis
+
+This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability.
+
+## Key Features
+
+- **Modular Data Pipeline**: Sequential processing stages from raw data to trained models
+- **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10)
+- **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation
+- **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions
+- **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools
+- **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources
+
+## Quick Start
+
+### Installation
+
+Requires Python 3.13 and CUDA 12 compatible GPU.
+
+```bash
+pixi install
 ```

-## Project Plan
+This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries.

-1. Create global hexagon grids with h3
-2. Enrich the grids with data from various sources and with labels from DARTS v2
-3. Use eSPA for simple classification: hex has [many slumps / some slumps / few slumps / no slumps]
-4. use SPARTAn for regression: one for slumps density (area) and one for total number of slumps
+### Running the Pipeline

-### Data Sources and Engineering
+Execute the numbered scripts to process data and train models:

- Labels
-  - `"year"`: Year of observation
-  - `"area"`: Total land-area of the hexagon
-  - `"rts_density"`: Area of RTS divided by total land-area
-  - `"rts_count"`: Number of single RTS instances
- ERA5 (starting 40 years from `"year"`)
-  - `"temp_yearXXXX_qY"`: Y-th quantile temperature of year XXXX. Used to enter the temperature distribution into the model.
-  - `"thawing_days_yearXXXX"`: Number of thawing-days of year XXXX.
-  - `"precip_yearXXXX_qY"`: Y-th quantile precipitation of year XXXX. Similar to temperature.
-  - `"temp_5year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 5 years difference.
-  - `"temp_10year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
-  - `"temp_diff_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
- ArcticDEM
-  - `"dissection_index"`: Dissection Index, (max - min) / max
-  - `"max_elevation"`: Maximum elevation
-  - `"elevationX_density"`: Area where the elevation is larger than X divided by the total land-area
- TCVIS
-  - ???
- Wildfire???
- Permafrost???
- GroundIceContent???
- Biome
+```bash
+scripts/00grids.sh      # Generate spatial grids
+scripts/01darts.sh      # Extract RTS labels from DARTS v2
+scripts/02alphaearth.sh # Extract satellite embeddings
+scripts/03era5.sh       # Process climate data
+scripts/04arcticdem.sh  # Compute terrain features
+scripts/05train.sh      # Train models
+```

-**About temporals** Every label has its own year - all temporal dependent data features, e.g. `"temp_5year_diff_XXXXtoXXXX_qY"` are calculated respective to that year.
-The number of years added from a dataset is always the same, e.g. for ERA5 for an observation in 2024 the ERA5 data would start in 1984 and for an observation from 2023 in 1983.
+### Visualizing Results
+
+Launch the interactive dashboard:
+
+```bash
+pixi run dashboard
+```
+
+Explore training data distributions, cross-validation results, feature importance, and spatial predictions.
+
+## Data Sources
+
+- **DARTS v2**: RTS labels (polygons with year, area, count)
+- **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years)
+- **ArcticDEM**: 32m resolution terrain elevation
+- **AlphaEarth**: 64-dimensional satellite image embeddings
+
+## Project Structure
+
+- `src/entropice/`: Core modules (grids, data processors, training, inference)
+- `src/entropice/dashboard/`: Streamlit visualization application
+- `scripts/`: Data processing pipeline automation
+- `notebooks/`: Exploratory analysis (not version-controlled)
+
+## Documentation
+
+- **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow
+- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards
+
+## Research Goals
+
+1. **Predictive Modeling**: Estimate RTS density at unobserved locations
+2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence
+3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions
+4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making
+
+## License
+
+TODO
+
+## Citation
+
+If you use Entropice in your research, please cite:
+
+```txt
+TODO
+```
--- a/src/entropice/dashboard/overview_page.py
+++ b/src/entropice/dashboard/overview_page.py
@ -284,7 +284,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
            fig.update_traces(text=pivot_df.values, texttemplate="%{text:,}", textfont_size=10)

            fig.update_layout(height=400)
-            st.plotly_chart(fig, use_container_width=True)
+            st.plotly_chart(fig, width="stretch")

    with tab2:
        st.markdown("### Sample Counts Bar Chart")
@ -311,7 +311,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
        # Update facet labels to be cleaner
        fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
        fig.update_xaxes(tickangle=-45)
-        st.plotly_chart(fig, use_container_width=True)
+        st.plotly_chart(fig, width="stretch")

    with tab3:
        st.markdown("### Detailed Sample Counts")
@ -325,7 +325,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
        for col in ["Samples (Coverage)", "Samples (Labels)", "Samples (Both)"]:
            display_df[col] = display_df[col].apply(lambda x: f"{x:,}")

-        st.dataframe(display_df, hide_index=True, use_container_width=True)
+        st.dataframe(display_df, hide_index=True, width="stretch")


 def render_feature_count_comparison(cache: DatasetAnalysisCache):
@ -363,7 +363,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
        )

        fig.update_layout(height=500, xaxis_tickangle=-45)
-        st.plotly_chart(fig, use_container_width=True)
+        st.plotly_chart(fig, width="stretch")

        # Add secondary metrics
        col1, col2 = st.columns(2)
@ -384,7 +384,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
            )
            fig_cells.update_traces(texttemplate="%{text:,}", textposition="outside")
            fig_cells.update_layout(xaxis_tickangle=-45, showlegend=False)
-            st.plotly_chart(fig_cells, use_container_width=True)
+            st.plotly_chart(fig_cells, width="stretch")

        with col2:
            fig_samples = px.bar(
@ -399,7 +399,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
            )
            fig_samples.update_traces(texttemplate="%{text:,}", textposition="outside")
            fig_samples.update_layout(xaxis_tickangle=-45, showlegend=False)
-            st.plotly_chart(fig_samples, use_container_width=True)
+            st.plotly_chart(fig_samples, width="stretch")

    with comp_tab2:
        st.markdown("#### Feature Breakdown by Data Source")
@ -435,7 +435,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
                        )
                        fig.update_traces(textposition="inside", textinfo="percent")
                        fig.update_layout(showlegend=True, height=350)
-                        st.plotly_chart(fig, use_container_width=True)
+                        st.plotly_chart(fig, width="stretch")

    with comp_tab3:
        st.markdown("#### Detailed Feature Count Comparison")
@ -452,7 +452,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
        # Format boolean as Yes/No
        display_df["AlphaEarth"] = display_df["AlphaEarth"].apply(lambda x: "✓" if x else "✗")

-        st.dataframe(display_df, hide_index=True, use_container_width=True)
+        st.dataframe(display_df, hide_index=True, width="stretch")


@st.fragment
@ -581,10 +581,10 @@ def render_feature_count_explorer(cache: DatasetAnalysisCache):
            color_discrete_sequence=source_colors,
        )
        fig.update_traces(textposition="inside", textinfo="percent+label")
-        st.plotly_chart(fig, use_container_width=True)
+        st.plotly_chart(fig, width="stretch")

        # Show detailed table
-        st.dataframe(breakdown_df, hide_index=True, use_container_width=True)
+        st.dataframe(breakdown_df, hide_index=True, width="stretch")

        # Detailed member information
        with st.expander("📦 Detailed Source Information", expanded=False):
--- a/src/entropice/dashboard/plots/inference.py
+++ b/src/entropice/dashboard/plots/inference.py
@ -108,7 +108,7 @@ def render_class_distribution_histogram(predictions_gdf: gpd.GeoDataFrame, task:
        xaxis={"tickangle": -45 if len(categories) > 3 else 0},
    )

-    st.plotly_chart(fig, use_container_width=True)
+    st.plotly_chart(fig, width="stretch")

    # Show percentages in a table
    with st.expander("📋 Detailed Class Distribution", expanded=False):
@ -119,7 +119,7 @@ def render_class_distribution_histogram(predictions_gdf: gpd.GeoDataFrame, task:
                "Percentage": (class_counts.to_numpy() / len(predictions_gdf) * 100).round(2),
            }
        )
-        st.dataframe(distribution_df, hide_index=True, use_container_width=True)
+        st.dataframe(distribution_df, hide_index=True, width="stretch")


 def render_spatial_distribution_stats(predictions_gdf: gpd.GeoDataFrame):
@ -394,7 +394,7 @@ def render_class_comparison(predictions_gdf: gpd.GeoDataFrame, task: str):
            showlegend=True,
        )

-        st.plotly_chart(fig, use_container_width=True)
+        st.plotly_chart(fig, width="stretch")

    with col2:
        st.markdown("**Cumulative Distribution")
@ -426,4 +426,4 @@ def render_class_comparison(predictions_gdf: gpd.GeoDataFrame, task: str):
            yaxis={"range": [0, 105]},
        )

-        st.plotly_chart(fig, use_container_width=True)
+        st.plotly_chart(fig, width="stretch")
--- a/src/entropice/training_analysis_dashboard.py
+++ b/src/entropice/training_analysis_dashboard.py