# Entropice - GitHub Copilot Instructions

## Project Overview

Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.

For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
For project goals and setup, see [README.md](../README.md).

## Core Technologies

- **Python**: 3.13 (strict version requirement)
- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
- **Storage**: Zarr, Icechunk, Parquet, NetCDF
- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy

## Code Style Guidelines

### Python Standards

- Follow **PEP 8** conventions
- Use **type hints** for all function signatures
- Write **numpy-style docstrings** for public functions
- Keep functions **focused and modular**
- Prefer descriptive variable names over abbreviations

### Geospatial Best Practices

- Use **EPSG:3413** (Arctic Stereographic) for computations
- Use **EPSG:4326** (WGS84) for visualization and library compatibility
- Store gridded data using **xarray with XDGGS** indexing
- Store tabular data as **Parquet**, array data as **Zarr**
- Leverage **Dask** for lazy evaluation of large datasets
- Use **GeoPandas** for vector operations
- Handle **antimeridian** correctly for polar regions

### Data Pipeline Conventions

- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
- Each pipeline stage should produce **reproducible intermediate outputs**
- Use `src/entropice/paths.py` for consistent path management
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)

### Storage Hierarchy

All data follows this structure:
```
DATA_DIR/
├── grids/                    # H3/HEALPix tessellations (GeoParquet)
├── darts/                    # RTS labels (GeoParquet)
├── era5/                     # Climate data (Zarr)
├── arcticdem/                # Terrain data (Icechunk Zarr)
├── alphaearth/               # Satellite embeddings (Zarr)
├── datasets/                 # L2 XDGGS datasets (Zarr)
├── training-results/         # Models, CV results, predictions
└── watermask/                # Ocean mask (GeoParquet)
```

## Module Organization

### Core Modules (`src/entropice/`)

- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
- **`dataset.py`**: Multi-source data integration and feature engineering
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
- **`inference.py`**: Batch prediction pipeline for trained models
- **`paths.py`**: Centralized path management

### Dashboard (`src/entropice/dashboard/`)

- Streamlit-based interactive visualization
- Modular pages: overview, training data, analysis, model state, inference
- Bokeh-based geospatial plotting utilities
- Run with: `pixi run dashboard`

### Scripts (`scripts/`)

- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
- Run entire pipeline for multiple grid configurations
- Each script uses CLIs from core modules

### Notebooks (`notebooks/`)

- Exploratory analysis and validation
- **NOT committed to git**
- Keep production code in `src/entropice/`

## Development Workflow

### Setup

```bash
pixi install  # NOT pip install or conda install
```

### Running Tests

```bash
pixi run pytest
```

### Common Tasks

- **Generate grids**: Use `grids.py` CLI
- **Process labels**: Use `darts.py` CLI
- **Train models**: Use `training.py` CLI with TOML config
- **Run inference**: Use `inference.py` CLI
- **View results**: `pixi run dashboard`

## Key Design Patterns

### 1. XDGGS Indexing

All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.

### 2. Lazy Evaluation

Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.

### 3. GPU Acceleration

Prefer GPU-accelerated operations:
- CuPy for array operations
- cuML for Random Forest and KNN
- XGBoost GPU training
- PyTorch tensors when applicable

### 4. Configuration as Code

- Use dataclasses for typed configuration
- TOML files for training hyperparameters
- Cyclopts for CLI argument parsing

## Data Sources

- **DARTS v2**: RTS labels (year, area, count, density)
- **ERA5**: Climate data (40-year history, Arctic-aligned years)
- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
- **AlphaEarth**: 64-dimensional satellite embeddings
- **Watermask**: Ocean exclusion layer

## Model Support

Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
Alternatives: XGBoost, Random Forest, K-Nearest Neighbors

Training features:
- Randomized hyperparameter search
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)

## Extension Points

To extend Entropice:

- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
- **Alternative labels**: Implement extractor following `darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `training.py`
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module

## Important Notes

- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
- Arctic years run **October 1 to September 30** (not calendar years)
- Handle **antimeridian crossing** in polar regions
- Use **batch processing** for GPU memory management
- Notebooks are for exploration only - **keep production code in `src/`**
- Always use **absolute paths** or paths from `paths.py`

## Common Issues

- **Memory**: Use batch processing and Dask chunking for large datasets
- **GPU OOM**: Reduce batch size in inference or training
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326

## References

For more details, consult:
- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
- [README.md](../README.md) - Project goals and setup instructions