70 lines
2.9 KiB
Markdown
70 lines
2.9 KiB
Markdown
# Entropice - Copilot Instructions
|
|
|
|
## Project Context
|
|
|
|
This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.
|
|
|
|
## Code Style
|
|
|
|
- Follow PEP 8 conventions with 120 character line length
|
|
- Use type hints for all function signatures
|
|
- Use google-style docstrings for public functions
|
|
- Keep functions focused and modular
|
|
- Use `ruff` for linting and formatting, `ty` for type checking
|
|
|
|
## Technology Stack
|
|
|
|
- **Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
|
|
- **Spatial**: H3, xdggs, xvec for discrete global grid systems
|
|
- **ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
|
|
- **GPU**: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
|
|
- **Storage**: Zarr, Icechunk, Parquet for intermediate data
|
|
- **CLI**: Cyclopts with dataclass-based configurations
|
|
|
|
## Execution Guidelines
|
|
|
|
- Always use `pixi run` to execute Python commands and scripts
|
|
- Environment variables: `SCIPY_ARRAY_API=1`, `FAST_DATA_DIR=./data`
|
|
|
|
## Geospatial Best Practices
|
|
|
|
- Use EPSG:3413 (Arctic Stereographic) for computations
|
|
- Use EPSG:4326 (WGS84) for visualization and compatibility
|
|
- Store gridded data as XDGGS Xarray datasets (Zarr format)
|
|
- Store tabular data as GeoParquet
|
|
- Handle antimeridian issues in polar regions
|
|
- Leverage Xarray/Dask for lazy evaluation and chunked processing
|
|
|
|
## Architecture Patterns
|
|
|
|
- Modular CLI design: each module exposes standalone Cyclopts CLI
|
|
- Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
|
|
- GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
|
|
- Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization
|
|
|
|
## Data Storage Hierarchy
|
|
|
|
```
|
|
DATA_DIR/
|
|
├── grids/ # H3/HEALPix tessellations (GeoParquet)
|
|
├── darts/ # RTS labels (GeoParquet)
|
|
├── era5/ # Climate data (Zarr)
|
|
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
|
├── alphaearth/ # Satellite embeddings (Zarr)
|
|
├── datasets/ # L2 XDGGS datasets (Zarr)
|
|
└── training-results/ # Models, CV results, predictions
|
|
```
|
|
|
|
## Key Modules
|
|
|
|
- `entropice.spatial`: Grid generation and raster-to-vector aggregation
|
|
- `entropice.ingest`: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)
|
|
- `entropice.ml`: Dataset assembly, training, inference
|
|
- `entropice.dashboard`: Streamlit visualization app
|
|
- `entropice.utils`: Paths, codecs, types
|
|
|
|
## Testing & Notebooks
|
|
|
|
- Production code belongs in `src/entropice/`, not notebooks
|
|
- Notebooks in `notebooks/` are for exploration only (not version-controlled)
|
|
- Use `pytest` for testing geospatial correctness and data integrity
|