# Entropice - GitHub Copilot Instructions ## Project Overview Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns. For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md). For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md). For project goals and setup, see [README.md](../README.md). ## Core Technologies - **Python**: 3.13 (strict version requirement) - **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly) - **GPU**: CUDA 12 with RAPIDS (CuPy, cuML) - **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio - **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML - **Storage**: Zarr, Icechunk, Parquet, NetCDF - **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy ## Code Style Guidelines ### Python Standards - Follow **PEP 8** conventions - Use **type hints** for all function signatures - Write **numpy-style docstrings** for public functions - Keep functions **focused and modular** - Prefer descriptive variable names over abbreviations ### Geospatial Best Practices - Use **EPSG:3413** (Arctic Stereographic) for computations - Use **EPSG:4326** (WGS84) for visualization and library compatibility - Store gridded data using **xarray with XDGGS** indexing - Store tabular data as **Parquet**, array data as **Zarr** - Leverage **Dask** for lazy evaluation of large datasets - Use **GeoPandas** for vector operations - Handle **antimeridian** correctly for polar regions ### Data Pipeline Conventions - Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh` - Each pipeline stage should produce **reproducible intermediate outputs** - Use `src/entropice/paths.py` for consistent path management - Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`) ### Storage Hierarchy All data follows this structure: ``` DATA_DIR/ ├── grids/ # H3/HEALPix tessellations (GeoParquet) ├── darts/ # RTS labels (GeoParquet) ├── era5/ # Climate data (Zarr) ├── arcticdem/ # Terrain data (Icechunk Zarr) ├── alphaearth/ # Satellite embeddings (Zarr) ├── datasets/ # L2 XDGGS datasets (Zarr) ├── training-results/ # Models, CV results, predictions └── watermask/ # Ocean mask (GeoParquet) ``` ## Module Organization ### Core Modules (`src/entropice/`) - **`grids.py`**: H3/HEALPix spatial grid generation with watermask - **`darts.py`**: RTS label extraction from DARTS v2 dataset - **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30) - **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data - **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine - **`aggregators.py`**: Raster-to-vector spatial aggregation engine - **`dataset.py`**: Multi-source data integration and feature engineering - **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN - **`inference.py`**: Batch prediction pipeline for trained models - **`paths.py`**: Centralized path management ### Dashboard (`src/entropice/dashboard/`) - Streamlit-based interactive visualization - Modular pages: overview, training data, analysis, model state, inference - Bokeh-based geospatial plotting utilities - Run with: `pixi run dashboard` ### Scripts (`scripts/`) - Numbered pipeline scripts (`00grids.sh` through `05train.sh`) - Run entire pipeline for multiple grid configurations - Each script uses CLIs from core modules ### Notebooks (`notebooks/`) - Exploratory analysis and validation - **NOT committed to git** - Keep production code in `src/entropice/` ## Development Workflow ### Setup ```bash pixi install # NOT pip install or conda install ``` ### Running Tests ```bash pixi run pytest ``` ### Common Tasks - **Generate grids**: Use `grids.py` CLI - **Process labels**: Use `darts.py` CLI - **Train models**: Use `training.py` CLI with TOML config - **Run inference**: Use `inference.py` CLI - **View results**: `pixi run dashboard` ## Key Design Patterns ### 1. XDGGS Indexing All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources. ### 2. Lazy Evaluation Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets. ### 3. GPU Acceleration Prefer GPU-accelerated operations: - CuPy for array operations - cuML for Random Forest and KNN - XGBoost GPU training - PyTorch tensors when applicable ### 4. Configuration as Code - Use dataclasses for typed configuration - TOML files for training hyperparameters - Cyclopts for CLI argument parsing ## Data Sources - **DARTS v2**: RTS labels (year, area, count, density) - **ERA5**: Climate data (40-year history, Arctic-aligned years) - **ArcticDEM**: 32m resolution terrain (slope, aspect, indices) - **AlphaEarth**: 64-dimensional satellite embeddings - **Watermask**: Ocean exclusion layer ## Model Support Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations) Alternatives: XGBoost, Random Forest, K-Nearest Neighbors Training features: - Randomized hyperparameter search - K-Fold cross-validation - Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall) ## Extension Points To extend Entropice: - **New data source**: Follow patterns in `era5.py` or `arcticdem.py` - **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py` - **Alternative labels**: Implement extractor following `darts.py` pattern - **New models**: Add scikit-learn compatible estimators to `training.py` - **Dashboard pages**: Add Streamlit pages to `dashboard/` module ## Important Notes - Grid resolutions: **H3** (3-6), **HEALPix** (6-10) - Arctic years run **October 1 to September 30** (not calendar years) - Handle **antimeridian crossing** in polar regions - Use **batch processing** for GPU memory management - Notebooks are for exploration only - **keep production code in `src/`** - Always use **absolute paths** or paths from `paths.py` ## Common Issues - **Memory**: Use batch processing and Dask chunking for large datasets - **GPU OOM**: Reduce batch size in inference or training - **Antimeridian**: Use proper handling in `aggregators.py` for polar grids - **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep) - **CRS**: Compute in EPSG:3413, visualize in EPSG:4326 ## References For more details, consult: - [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns - [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards - [README.md](../README.md) - Project goals and setup instructions