7 KiB
Entropice - GitHub Copilot Instructions
Project Overview
Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
For detailed architecture information, see ARCHITECTURE.md. For contributing guidelines, see CONTRIBUTING.md. For project goals and setup, see README.md.
Core Technologies
- Python: 3.13 (strict version requirement)
- Package Manager: Pixi (not pip/conda directly)
- GPU: CUDA 12 with RAPIDS (CuPy, cuML)
- Geospatial: xarray, xdggs, GeoPandas, H3, Rasterio
- ML: scikit-learn, XGBoost, entropy (eSPA), cuML
- Storage: Zarr, Icechunk, Parquet, NetCDF
- Visualization: Streamlit, Bokeh, Matplotlib, Cartopy
Code Style Guidelines
Python Standards
- Follow PEP 8 conventions
- Use type hints for all function signatures
- Write numpy-style docstrings for public functions
- Keep functions focused and modular
- Prefer descriptive variable names over abbreviations
Geospatial Best Practices
- Use EPSG:3413 (Arctic Stereographic) for computations
- Use EPSG:4326 (WGS84) for visualization and library compatibility
- Store gridded data using xarray with XDGGS indexing
- Store tabular data as Parquet, array data as Zarr
- Leverage Dask for lazy evaluation of large datasets
- Use GeoPandas for vector operations
- Handle antimeridian correctly for polar regions
Data Pipeline Conventions
- Follow the numbered script sequence:
00grids.sh→01darts.sh→02alphaearth.sh→03era5.sh→04arcticdem.sh→05train.sh - Each pipeline stage should produce reproducible intermediate outputs
- Use
src/entropice/paths.pyfor consistent path management - Environment variable
FAST_DATA_DIRcontrols data directory location (default:./data)
Storage Hierarchy
All data follows this structure:
DATA_DIR/
├── grids/ # H3/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
├── training-results/ # Models, CV results, predictions
└── watermask/ # Ocean mask (GeoParquet)
Module Organization
Core Modules (src/entropice/)
grids.py: H3/HEALPix spatial grid generation with watermaskdarts.py: RTS label extraction from DARTS v2 datasetera5.py: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)arcticdem.py: Terrain analysis from 32m Arctic elevation dataalphaearth.py: Satellite image embeddings via Google Earth Engineaggregators.py: Raster-to-vector spatial aggregation enginedataset.py: Multi-source data integration and feature engineeringtraining.py: Model training with eSPA, XGBoost, Random Forest, KNNinference.py: Batch prediction pipeline for trained modelspaths.py: Centralized path management
Dashboard (src/entropice/dashboard/)
- Streamlit-based interactive visualization
- Modular pages: overview, training data, analysis, model state, inference
- Bokeh-based geospatial plotting utilities
- Run with:
pixi run dashboard
Scripts (scripts/)
- Numbered pipeline scripts (
00grids.shthrough05train.sh) - Run entire pipeline for multiple grid configurations
- Each script uses CLIs from core modules
Notebooks (notebooks/)
- Exploratory analysis and validation
- NOT committed to git
- Keep production code in
src/entropice/
Development Workflow
Setup
pixi install # NOT pip install or conda install
Running Tests
pixi run pytest
Common Tasks
- Generate grids: Use
grids.pyCLI - Process labels: Use
darts.pyCLI - Train models: Use
training.pyCLI with TOML config - Run inference: Use
inference.pyCLI - View results:
pixi run dashboard
Key Design Patterns
1. XDGGS Indexing
All geospatial data uses discrete global grid systems (H3 or HEALPix) via xdggs library for consistent spatial indexing across sources.
2. Lazy Evaluation
Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
3. GPU Acceleration
Prefer GPU-accelerated operations:
- CuPy for array operations
- cuML for Random Forest and KNN
- XGBoost GPU training
- PyTorch tensors when applicable
4. Configuration as Code
- Use dataclasses for typed configuration
- TOML files for training hyperparameters
- Cyclopts for CLI argument parsing
Data Sources
- DARTS v2: RTS labels (year, area, count, density)
- ERA5: Climate data (40-year history, Arctic-aligned years)
- ArcticDEM: 32m resolution terrain (slope, aspect, indices)
- AlphaEarth: 64-dimensional satellite embeddings
- Watermask: Ocean exclusion layer
Model Support
Primary: eSPA (entropy-optimal Scalable Probabilistic Approximations) Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
Training features:
- Randomized hyperparameter search
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
Extension Points
To extend Entropice:
- New data source: Follow patterns in
era5.pyorarcticdem.py - Custom aggregations: Add to
_Aggregationsdataclass inaggregators.py - Alternative labels: Implement extractor following
darts.pypattern - New models: Add scikit-learn compatible estimators to
training.py - Dashboard pages: Add Streamlit pages to
dashboard/module
Important Notes
- Grid resolutions: H3 (3-6), HEALPix (6-10)
- Arctic years run October 1 to September 30 (not calendar years)
- Handle antimeridian crossing in polar regions
- Use batch processing for GPU memory management
- Notebooks are for exploration only - keep production code in
src/ - Always use absolute paths or paths from
paths.py
Common Issues
- Memory: Use batch processing and Dask chunking for large datasets
- GPU OOM: Reduce batch size in inference or training
- Antimeridian: Use proper handling in
aggregators.pyfor polar grids - Temporal alignment: ERA5 uses Arctic-aligned years (Oct-Sep)
- CRS: Compute in EPSG:3413, visualize in EPSG:4326
References
For more details, consult:
- ARCHITECTURE.md - System architecture and design patterns
- CONTRIBUTING.md - Development workflow and standards
- README.md - Project goals and setup instructions