entropice/.github/copilot-instructions.md

7.7 KiB

Entropice - GitHub Copilot Instructions

Project Overview

Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.

For detailed architecture information, see ARCHITECTURE.md. For contributing guidelines, see CONTRIBUTING.md. For project goals and setup, see README.md.

Core Technologies

  • Python: 3.13 (strict version requirement)
  • Package Manager: Pixi (not pip/conda directly)
  • GPU: CUDA 12 with RAPIDS (CuPy, cuML)
  • Geospatial: xarray, xdggs, GeoPandas, H3, Rasterio
  • ML: scikit-learn, XGBoost, entropy (eSPA), cuML
  • Storage: Zarr, Icechunk, Parquet, NetCDF
  • Visualization: Streamlit, Bokeh, Matplotlib, Cartopy

Code Style Guidelines

Python Standards

  • Follow PEP 8 conventions
  • Use type hints for all function signatures
  • Write numpy-style docstrings for public functions
  • Keep functions focused and modular
  • Prefer descriptive variable names over abbreviations

Geospatial Best Practices

  • Use EPSG:3413 (Arctic Stereographic) for computations
  • Use EPSG:4326 (WGS84) for visualization and library compatibility
  • Store gridded data using xarray with XDGGS indexing
  • Store tabular data as Parquet, array data as Zarr
  • Leverage Dask for lazy evaluation of large datasets
  • Use GeoPandas for vector operations
  • Handle antimeridian correctly for polar regions

Data Pipeline Conventions

  • Follow the numbered script sequence: 00grids.sh01darts.sh02alphaearth.sh03era5.sh04arcticdem.sh05train.sh
  • Each pipeline stage should produce reproducible intermediate outputs
  • Use src/entropice/utils/paths.py for consistent path management
  • Environment variable FAST_DATA_DIR controls data directory location (default: ./data)

Storage Hierarchy

All data follows this structure:

DATA_DIR/
├── grids/                    # H3/HEALPix tessellations (GeoParquet)
├── darts/                    # RTS labels (GeoParquet)
├── era5/                     # Climate data (Zarr)
├── arcticdem/                # Terrain data (Icechunk Zarr)
├── alphaearth/               # Satellite embeddings (Zarr)
├── datasets/                 # L2 XDGGS datasets (Zarr)
├── training-results/         # Models, CV results, predictions
└── watermask/                # Ocean mask (GeoParquet)

Module Organization

Core Modules (src/entropice/)

The codebase is organized into four main packages:

  • entropice.ingest: Data ingestion from external sources
  • entropice.spatial: Spatial operations and grid management
  • entropice.ml: Machine learning workflows
  • entropice.utils: Common utilities

Data Ingestion (src/entropice/ingest/)

  • darts.py: RTS label extraction from DARTS v2 dataset
  • era5.py: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
  • arcticdem.py: Terrain analysis from 32m Arctic elevation data
  • alphaearth.py: Satellite image embeddings via Google Earth Engine

Spatial Operations (src/entropice/spatial/)

  • grids.py: H3/HEALPix spatial grid generation with watermask
  • aggregators.py: Raster-to-vector spatial aggregation engine
  • watermask.py: Ocean masking utilities
  • xvec.py: Extended vector operations for xarray

Machine Learning (src/entropice/ml/)

  • dataset.py: Multi-source data integration and feature engineering
  • training.py: Model training with eSPA, XGBoost, Random Forest, KNN
  • inference.py: Batch prediction pipeline for trained models

Utilities (src/entropice/utils/)

  • paths.py: Centralized path management
  • codecs.py: Custom codecs for data serialization

Dashboard (src/entropice/dashboard/)

  • Streamlit-based interactive visualization
  • Modular pages: overview, training data, analysis, model state, inference
  • Bokeh-based geospatial plotting utilities
  • Run with: pixi run dashboard

Scripts (scripts/)

  • Numbered pipeline scripts (00grids.sh through 05train.sh)
  • Run entire pipeline for multiple grid configurations
  • Each script uses CLIs from core modules

Notebooks (notebooks/)

  • Exploratory analysis and validation
  • NOT committed to git
  • Keep production code in src/entropice/

Development Workflow

Setup

pixi install  # NOT pip install or conda install

Running Tests

pixi run pytest

Common Tasks

  • Generate grids: Use spatial/grids.py CLI
  • Process labels: Use ingest/darts.py CLI
  • Train models: Use ml/training.py CLI with TOML config
  • Run inference: Use ml/inference.py CLI
  • View results: pixi run dashboard

Key Design Patterns

1. XDGGS Indexing

All geospatial data uses discrete global grid systems (H3 or HEALPix) via xdggs library for consistent spatial indexing across sources.

2. Lazy Evaluation

Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.

3. GPU Acceleration

Prefer GPU-accelerated operations:

  • CuPy for array operations
  • cuML for Random Forest and KNN
  • XGBoost GPU training
  • PyTorch tensors when applicable

4. Configuration as Code

  • Use dataclasses for typed configuration
  • TOML files for training hyperparameters
  • Cyclopts for CLI argument parsing

Data Sources

  • DARTS v2: RTS labels (year, area, count, density)
  • ERA5: Climate data (40-year history, Arctic-aligned years)
  • ArcticDEM: 32m resolution terrain (slope, aspect, indices)
  • AlphaEarth: 64-dimensional satellite embeddings
  • Watermask: Ocean exclusion layer

Model Support

Primary: eSPA (entropy-optimal Scalable Probabilistic Approximations) Alternatives: XGBoost, Random Forest, K-Nearest Neighbors

Training features:

  • Randomized hyperparameter search
  • K-Fold cross-validation
  • Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)

Extension Points

To extend Entropice:

  • New data source: Follow patterns in ingest/era5.py or ingest/arcticdem.py
  • Custom aggregations: Add to _Aggregations dataclass in spatial/aggregators.py
  • Alternative labels: Implement extractor following ingest/darts.py pattern
  • New models: Add scikit-learn compatible estimators to ml/training.py
  • Dashboard pages: Add Streamlit pages to dashboard/ module

Important Notes

  • Grid resolutions: H3 (3-6), HEALPix (6-10)
  • Arctic years run October 1 to September 30 (not calendar years)
  • Handle antimeridian crossing in polar regions
  • Use batch processing for GPU memory management
  • Notebooks are for exploration only - keep production code in src/
  • Always use absolute paths or paths from utils/paths.py

Common Issues

  • Memory: Use batch processing and Dask chunking for large datasets
  • GPU OOM: Reduce batch size in inference or training
  • Antimeridian: Use proper handling in spatial/aggregators.py for polar grids
  • Temporal alignment: ERA5 uses Arctic-aligned years (Oct-Sep)
  • CRS: Compute in EPSG:3413, visualize in EPSG:4326

References

For more details, consult: