entropice/.github/copilot-instructions.md

70 lines
2.9 KiB
Markdown

# Entropice - Copilot Instructions
## Project Context
This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.
## Code Style
- Follow PEP 8 conventions with 120 character line length
- Use type hints for all function signatures
- Use google-style docstrings for public functions
- Keep functions focused and modular
- Use `ruff` for linting and formatting, `ty` for type checking
## Technology Stack
- **Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
- **Spatial**: H3, xdggs, xvec for discrete global grid systems
- **ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
- **GPU**: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
- **Storage**: Zarr, Icechunk, Parquet for intermediate data
- **CLI**: Cyclopts with dataclass-based configurations
## Execution Guidelines
- Always use `pixi run` to execute Python commands and scripts
- Environment variables: `SCIPY_ARRAY_API=1`, `FAST_DATA_DIR=./data`
## Geospatial Best Practices
- Use EPSG:3413 (Arctic Stereographic) for computations
- Use EPSG:4326 (WGS84) for visualization and compatibility
- Store gridded data as XDGGS Xarray datasets (Zarr format)
- Store tabular data as GeoParquet
- Handle antimeridian issues in polar regions
- Leverage Xarray/Dask for lazy evaluation and chunked processing
## Architecture Patterns
- Modular CLI design: each module exposes standalone Cyclopts CLI
- Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
- GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
- Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization
## Data Storage Hierarchy
```
DATA_DIR/
├── grids/ # H3/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
└── training-results/ # Models, CV results, predictions
```
## Key Modules
- `entropice.spatial`: Grid generation and raster-to-vector aggregation
- `entropice.ingest`: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)
- `entropice.ml`: Dataset assembly, training, inference
- `entropice.dashboard`: Streamlit visualization app
- `entropice.utils`: Paths, codecs, types
## Testing & Notebooks
- Production code belongs in `src/entropice/`, not notebooks
- Notebooks in `notebooks/` are for exploration only (not version-controlled)
- Use `pytest` for testing geospatial correctness and data integrity