# Entropice - Copilot Instructions ## Project Context This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers. ## Code Style - Follow PEP 8 conventions with 120 character line length - Use type hints for all function signatures - Use google-style docstrings for public functions - Keep functions focused and modular - Use `ruff` for linting and formatting, `ty` for type checking ## Technology Stack - **Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas - **Spatial**: H3, xdggs, xvec for discrete global grid systems - **ML**: scikit-learn, XGBoost, cuML, entropy (eSPA) - **GPU**: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations - **Storage**: Zarr, Icechunk, Parquet for intermediate data - **CLI**: Cyclopts with dataclass-based configurations ## Execution Guidelines - Always use `pixi run` to execute Python commands and scripts - Environment variables: `SCIPY_ARRAY_API=1`, `FAST_DATA_DIR=./data` ## Geospatial Best Practices - Use EPSG:3413 (Arctic Stereographic) for computations - Use EPSG:4326 (WGS84) for visualization and compatibility - Store gridded data as XDGGS Xarray datasets (Zarr format) - Store tabular data as GeoParquet - Handle antimeridian issues in polar regions - Leverage Xarray/Dask for lazy evaluation and chunked processing ## Architecture Patterns - Modular CLI design: each module exposes standalone Cyclopts CLI - Configuration as code: use dataclasses for typed configs, TOML for hyperparameters - GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management - Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization ## Data Storage Hierarchy ``` DATA_DIR/ ├── grids/ # H3/HEALPix tessellations (GeoParquet) ├── darts/ # RTS labels (GeoParquet) ├── era5/ # Climate data (Zarr) ├── arcticdem/ # Terrain data (Icechunk Zarr) ├── alphaearth/ # Satellite embeddings (Zarr) ├── datasets/ # L2 XDGGS datasets (Zarr) └── training-results/ # Models, CV results, predictions ``` ## Key Modules - `entropice.spatial`: Grid generation and raster-to-vector aggregation - `entropice.ingest`: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth) - `entropice.ml`: Dataset assembly, training, inference - `entropice.dashboard`: Streamlit visualization app - `entropice.utils`: Paths, codecs, types ## Organisation of the Dashboard - `entropice.dashboard.app`: Main Streamlit app, entry point, imports the pages - `entropice.dashboard.pages`: Individual dashboard pages - functions which handle data loading and building each page, always called `XXX_page()` - `entropice.dashboard.sections`: Reusable Streamlit sections - functions which build parts of pages, always called `render_XXX()` - `entropice.dashboard.plots`: Plotting functions for the dashboard - functions which create plots, always called `create_XXX()` and return plotly figures - `entropice.dashboard.utils`: Dashboard-specific utilities, also contain data loading functions ### Note on deprecated use_container_width parameter of streamlit functions **❌ INCORRECT** (deprecated): ```python st.plotly_chart(fig, use_container_width=True) ``` **✅ CORRECT** (current API): ```python st.plotly_chart(fig, width='stretch') ``` **Common width values**: - `width='stretch'` - Use full container width (replaces `use_container_width=True`) - `width='content'` - Use content width (replaces `use_container_width=False`) This applies to: - `st.plotly_chart()` - `st.altair_chart()` - `st.vega_lite_chart()` - `st.dataframe()` - `st.image()` ## Testing & Notebooks - Production code belongs in `src/entropice/`, not notebooks - Notebooks in `notebooks/` are for exploration only (not version-controlled) - Use `pytest` for testing geospatial correctness and data integrity