entropice/.github/copilot-instructions.md

2.9 KiB

Entropice - Copilot Instructions

Project Context

This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.

Code Style

  • Follow PEP 8 conventions with 120 character line length
  • Use type hints for all function signatures
  • Use google-style docstrings for public functions
  • Keep functions focused and modular
  • Use ruff for linting and formatting, ty for type checking

Technology Stack

  • Core: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
  • Spatial: H3, xdggs, xvec for discrete global grid systems
  • ML: scikit-learn, XGBoost, cuML, entropy (eSPA)
  • GPU: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
  • Storage: Zarr, Icechunk, Parquet for intermediate data
  • CLI: Cyclopts with dataclass-based configurations

Execution Guidelines

  • Always use pixi run to execute Python commands and scripts
  • Environment variables: SCIPY_ARRAY_API=1, FAST_DATA_DIR=./data

Geospatial Best Practices

  • Use EPSG:3413 (Arctic Stereographic) for computations
  • Use EPSG:4326 (WGS84) for visualization and compatibility
  • Store gridded data as XDGGS Xarray datasets (Zarr format)
  • Store tabular data as GeoParquet
  • Handle antimeridian issues in polar regions
  • Leverage Xarray/Dask for lazy evaluation and chunked processing

Architecture Patterns

  • Modular CLI design: each module exposes standalone Cyclopts CLI
  • Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
  • GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
  • Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization

Data Storage Hierarchy

DATA_DIR/
├── grids/           # H3/HEALPix tessellations (GeoParquet)
├── darts/           # RTS labels (GeoParquet)
├── era5/            # Climate data (Zarr)
├── arcticdem/       # Terrain data (Icechunk Zarr)
├── alphaearth/      # Satellite embeddings (Zarr)
├── datasets/        # L2 XDGGS datasets (Zarr)
└── training-results/ # Models, CV results, predictions

Key Modules

  • entropice.spatial: Grid generation and raster-to-vector aggregation
  • entropice.ingest: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)
  • entropice.ml: Dataset assembly, training, inference
  • entropice.dashboard: Streamlit visualization app
  • entropice.utils: Paths, codecs, types

Testing & Notebooks

  • Production code belongs in src/entropice/, not notebooks
  • Notebooks in notebooks/ are for exploration only (not version-controlled)
  • Use pytest for testing geospatial correctness and data integrity