# Entropice - Copilot Instructions

## Project Context

This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.

## Code Style

- Follow PEP 8 conventions with 120 character line length
- Use type hints for all function signatures
- Use google-style docstrings for public functions
- Keep functions focused and modular
- Use `ruff` for linting and formatting, `ty` for type checking

## Technology Stack

- **Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
- **Spatial**: H3, xdggs, xvec for discrete global grid systems
- **ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
- **GPU**: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
- **Storage**: Zarr, Icechunk, Parquet for intermediate data
- **CLI**: Cyclopts with dataclass-based configurations

## Execution Guidelines

- Always use `pixi run` to execute Python commands and scripts
- Environment variables: `SCIPY_ARRAY_API=1`, `FAST_DATA_DIR=./data`

## Geospatial Best Practices

- Use EPSG:3413 (Arctic Stereographic) for computations
- Use EPSG:4326 (WGS84) for visualization and compatibility
- Store gridded data as XDGGS Xarray datasets (Zarr format)
- Store tabular data as GeoParquet
- Handle antimeridian issues in polar regions
- Leverage Xarray/Dask for lazy evaluation and chunked processing

## Architecture Patterns

- Modular CLI design: each module exposes standalone Cyclopts CLI
- Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
- GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
- Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization

## Data Storage Hierarchy

```
DATA_DIR/
├── grids/           # H3/HEALPix tessellations (GeoParquet)
├── darts/           # RTS labels (GeoParquet)
├── era5/            # Climate data (Zarr)
├── arcticdem/       # Terrain data (Icechunk Zarr)
├── alphaearth/      # Satellite embeddings (Zarr)
├── datasets/        # L2 XDGGS datasets (Zarr)
└── training-results/ # Models, CV results, predictions
```

## Key Modules

- `entropice.spatial`: Grid generation and raster-to-vector aggregation
- `entropice.ingest`: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)
- `entropice.ml`: Dataset assembly, training, inference
- `entropice.dashboard`: Streamlit visualization app
- `entropice.utils`: Paths, codecs, types

## Organisation of the Dashboard

- `entropice.dashboard.app`: Main Streamlit app, entry point, imports the pages
- `entropice.dashboard.pages`: Individual dashboard pages - functions which handle data loading and building each page, always called `XXX_page()`
- `entropice.dashboard.sections`: Reusable Streamlit sections - functions which build parts of pages, always called `render_XXX()`
- `entropice.dashboard.plots`: Plotting functions for the dashboard - functions which create plots, always called `create_XXX()` and return plotly figures
- `entropice.dashboard.utils`: Dashboard-specific utilities, also contain data loading functions

### Note on deprecated use_container_width parameter of streamlit functions

**❌ INCORRECT** (deprecated):
```python
st.plotly_chart(fig, use_container_width=True)
```

**✅ CORRECT** (current API):
```python
st.plotly_chart(fig, width='stretch')
```

**Common width values**:
- `width='stretch'` - Use full container width (replaces `use_container_width=True`)
- `width='content'` - Use content width (replaces `use_container_width=False`)

This applies to:
- `st.plotly_chart()`
- `st.altair_chart()`
- `st.vega_lite_chart()`
- `st.dataframe()`
- `st.image()`

## Testing & Notebooks

- Production code belongs in `src/entropice/`, not notebooks
- Notebooks in `notebooks/` are for exploration only (not version-controlled)
- Use `pytest` for testing geospatial correctness and data integrity