8.1 KiB
Entropice - GitHub Copilot Instructions
Project Overview
Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
For detailed architecture information, see ARCHITECTURE.md. For contributing guidelines, see CONTRIBUTING.md. For project goals and setup, see README.md.
Core Technologies
- Python: 3.13 (strict version requirement)
- Package Manager: Pixi (not pip/conda directly)
- GPU: CUDA 12 with RAPIDS (CuPy, cuML)
- Geospatial: xarray, xdggs, GeoPandas, H3, Rasterio
- ML: scikit-learn, XGBoost, entropy (eSPA), cuML
- Storage: Zarr, Icechunk, Parquet, NetCDF
- Visualization: Streamlit, Bokeh, Matplotlib, Cartopy
Code Style Guidelines
Python Standards
- Follow PEP 8 conventions
- Use type hints for all function signatures
- Write numpy-style docstrings for public functions
- Keep functions focused and modular
- Prefer descriptive variable names over abbreviations
Geospatial Best Practices
- Use EPSG:3413 (Arctic Stereographic) for computations
- Use EPSG:4326 (WGS84) for visualization and library compatibility
- Store gridded data using xarray with XDGGS indexing
- Store tabular data as Parquet, array data as Zarr
- Leverage Dask for lazy evaluation of large datasets
- Use GeoPandas for vector operations
- Handle antimeridian correctly for polar regions
Data Pipeline Conventions
- Follow the numbered script sequence:
00grids.sh→01darts.sh→02alphaearth.sh→03era5.sh→04arcticdem.sh→05train.sh - Each pipeline stage should produce reproducible intermediate outputs
- Use
src/entropice/utils/paths.pyfor consistent path management - Environment variable
FAST_DATA_DIRcontrols data directory location (default:./data)
Storage Hierarchy
All data follows this structure:
DATA_DIR/
├── grids/ # H3/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
├── training-results/ # Models, CV results, predictions
└── watermask/ # Ocean mask (GeoParquet)
Module Organization
Core Modules (src/entropice/)
The codebase is organized into four main packages:
entropice.ingest: Data ingestion from external sourcesentropice.spatial: Spatial operations and grid managemententropice.ml: Machine learning workflowsentropice.utils: Common utilities
Data Ingestion (src/entropice/ingest/)
darts.py: RTS label extraction from DARTS v2 datasetera5.py: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)arcticdem.py: Terrain analysis from 32m Arctic elevation dataalphaearth.py: Satellite image embeddings via Google Earth Engine
Spatial Operations (src/entropice/spatial/)
grids.py: H3/HEALPix spatial grid generation with watermaskaggregators.py: Raster-to-vector spatial aggregation enginewatermask.py: Ocean masking utilitiesxvec.py: Extended vector operations for xarray
Machine Learning (src/entropice/ml/)
dataset.py: Multi-source data integration and feature engineeringtraining.py: Model training with eSPA, XGBoost, Random Forest, KNNinference.py: Batch prediction pipeline for trained models
Utilities (src/entropice/utils/)
paths.py: Centralized path managementcodecs.py: Custom codecs for data serialization
Dashboard (src/entropice/dashboard/)
- Streamlit-based interactive visualization
- Modular pages: overview, training data, analysis, model state, inference
- Bokeh-based geospatial plotting utilities
- Run with:
pixi run dashboard
Scripts (scripts/)
- Numbered pipeline scripts (
00grids.shthrough05train.sh) - Run entire pipeline for multiple grid configurations
- Each script uses CLIs from core modules
Notebooks (notebooks/)
- Exploratory analysis and validation
- NOT committed to git
- Keep production code in
src/entropice/
Development Workflow
Setup
pixi install # NOT pip install or conda install
Running Tests
pixi run pytest
Running Python Commands
Always use pixi run to execute Python commands to use the correct environment:
pixi run python script.py
pixi run python -c "import entropice"
Common Tasks
Important: Always use pixi run prefix for Python commands to ensure correct environment.
- Generate grids: Use
pixi run create-gridorspatial/grids.pyCLI - Process labels: Use
pixi run dartsoringest/darts.pyCLI - Train models: Use
pixi run trainwith TOML config orml/training.pyCLI - Run inference: Use
ml/inference.pyCLI - View results:
pixi run dashboard
Key Design Patterns
1. XDGGS Indexing
All geospatial data uses discrete global grid systems (H3 or HEALPix) via xdggs library for consistent spatial indexing across sources.
2. Lazy Evaluation
Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
3. GPU Acceleration
Prefer GPU-accelerated operations:
- CuPy for array operations
- cuML for Random Forest and KNN
- XGBoost GPU training
- PyTorch tensors when applicable
4. Configuration as Code
- Use dataclasses for typed configuration
- TOML files for training hyperparameters
- Cyclopts for CLI argument parsing
Data Sources
- DARTS v2: RTS labels (year, area, count, density)
- ERA5: Climate data (40-year history, Arctic-aligned years)
- ArcticDEM: 32m resolution terrain (slope, aspect, indices)
- AlphaEarth: 64-dimensional satellite embeddings
- Watermask: Ocean exclusion layer
Model Support
Primary: eSPA (entropy-optimal Scalable Probabilistic Approximations) Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
Training features:
- Randomized hyperparameter search
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
Extension Points
To extend Entropice:
- New data source: Follow patterns in
ingest/era5.pyoringest/arcticdem.py - Custom aggregations: Add to
_Aggregationsdataclass inspatial/aggregators.py - Alternative labels: Implement extractor following
ingest/darts.pypattern - New models: Add scikit-learn compatible estimators to
ml/training.py - Dashboard pages: Add Streamlit pages to
dashboard/module
Important Notes
- Always use
pixi runprefix for Python commands (not plainpython) - Grid resolutions: H3 (3-6), HEALPix (6-10)
- Arctic years run October 1 to September 30 (not calendar years)
- Handle antimeridian crossing in polar regions
- Use batch processing for GPU memory management
- Notebooks are for exploration only - keep production code in
src/ - Always use absolute paths or paths from
utils/paths.py
Common Issues
- Memory: Use batch processing and Dask chunking for large datasets
- GPU OOM: Reduce batch size in inference or training
- Antimeridian: Use proper handling in
spatial/aggregators.pyfor polar grids - Temporal alignment: ERA5 uses Arctic-aligned years (Oct-Sep)
- CRS: Compute in EPSG:3413, visualize in EPSG:4326
References
For more details, consult:
- ARCHITECTURE.md - System architecture and design patterns
- CONTRIBUTING.md - Development workflow and standards
- README.md - Project goals and setup instructions