4 KiB
4 KiB
Entropice - Copilot Instructions
Project Context
This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.
Code Style
- Follow PEP 8 conventions with 120 character line length
- Use type hints for all function signatures
- Use google-style docstrings for public functions
- Keep functions focused and modular
- Use
rufffor linting and formatting,tyfor type checking
Technology Stack
- Core: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
- Spatial: H3, xdggs, xvec for discrete global grid systems
- ML: scikit-learn, XGBoost, cuML, entropy (eSPA)
- GPU: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
- Storage: Zarr, Icechunk, Parquet for intermediate data
- CLI: Cyclopts with dataclass-based configurations
Execution Guidelines
- Always use
pixi runto execute Python commands and scripts - Environment variables:
SCIPY_ARRAY_API=1,FAST_DATA_DIR=./data
Geospatial Best Practices
- Use EPSG:3413 (Arctic Stereographic) for computations
- Use EPSG:4326 (WGS84) for visualization and compatibility
- Store gridded data as XDGGS Xarray datasets (Zarr format)
- Store tabular data as GeoParquet
- Handle antimeridian issues in polar regions
- Leverage Xarray/Dask for lazy evaluation and chunked processing
Architecture Patterns
- Modular CLI design: each module exposes standalone Cyclopts CLI
- Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
- GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
- Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization
Data Storage Hierarchy
DATA_DIR/
├── grids/ # H3/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
└── training-results/ # Models, CV results, predictions
Key Modules
entropice.spatial: Grid generation and raster-to-vector aggregationentropice.ingest: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)entropice.ml: Dataset assembly, training, inferenceentropice.dashboard: Streamlit visualization appentropice.utils: Paths, codecs, types
Organisation of the Dashboard
entropice.dashboard.app: Main Streamlit app, entry point, imports the pagesentropice.dashboard.pages: Individual dashboard pages - functions which handle data loading and building each page, always calledXXX_page()entropice.dashboard.sections: Reusable Streamlit sections - functions which build parts of pages, always calledrender_XXX()entropice.dashboard.plots: Plotting functions for the dashboard - functions which create plots, always calledcreate_XXX()and return plotly figuresentropice.dashboard.utils: Dashboard-specific utilities, also contain data loading functions
Note on deprecated use_container_width parameter of streamlit functions
❌ INCORRECT (deprecated):
st.plotly_chart(fig, use_container_width=True)
✅ CORRECT (current API):
st.plotly_chart(fig, width='stretch')
Common width values:
width='stretch'- Use full container width (replacesuse_container_width=True)width='content'- Use content width (replacesuse_container_width=False)
This applies to:
st.plotly_chart()st.altair_chart()st.vega_lite_chart()st.dataframe()st.image()
Testing & Notebooks
- Production code belongs in
src/entropice/, not notebooks - Notebooks in
notebooks/are for exploration only (not version-controlled) - Use
pytestfor testing geospatial correctness and data integrity