2.9 KiB
2.9 KiB
Entropice - Copilot Instructions
Project Context
This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.
Code Style
- Follow PEP 8 conventions with 120 character line length
- Use type hints for all function signatures
- Use google-style docstrings for public functions
- Keep functions focused and modular
- Use
rufffor linting and formatting,tyfor type checking
Technology Stack
- Core: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
- Spatial: H3, xdggs, xvec for discrete global grid systems
- ML: scikit-learn, XGBoost, cuML, entropy (eSPA)
- GPU: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
- Storage: Zarr, Icechunk, Parquet for intermediate data
- CLI: Cyclopts with dataclass-based configurations
Execution Guidelines
- Always use
pixi runto execute Python commands and scripts - Environment variables:
SCIPY_ARRAY_API=1,FAST_DATA_DIR=./data
Geospatial Best Practices
- Use EPSG:3413 (Arctic Stereographic) for computations
- Use EPSG:4326 (WGS84) for visualization and compatibility
- Store gridded data as XDGGS Xarray datasets (Zarr format)
- Store tabular data as GeoParquet
- Handle antimeridian issues in polar regions
- Leverage Xarray/Dask for lazy evaluation and chunked processing
Architecture Patterns
- Modular CLI design: each module exposes standalone Cyclopts CLI
- Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
- GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
- Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization
Data Storage Hierarchy
DATA_DIR/
├── grids/ # H3/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
└── training-results/ # Models, CV results, predictions
Key Modules
entropice.spatial: Grid generation and raster-to-vector aggregationentropice.ingest: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)entropice.ml: Dataset assembly, training, inferenceentropice.dashboard: Streamlit visualization appentropice.utils: Paths, codecs, types
Testing & Notebooks
- Production code belongs in
src/entropice/, not notebooks - Notebooks in
notebooks/are for exploration only (not version-controlled) - Use
pytestfor testing geospatial correctness and data integrity