Tobias Hölzer 907e907856 Streamline the markdown files

2025-12-28 20:15:44 +01:00

7 KiB

Raw Blame History

Entropice - GitHub Copilot Instructions

Project Overview

Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.

For detailed architecture information, see ARCHITECTURE.md. For contributing guidelines, see CONTRIBUTING.md. For project goals and setup, see README.md.

Core Technologies

Python: 3.13 (strict version requirement)
Package Manager: Pixi (not pip/conda directly)
GPU: CUDA 12 with RAPIDS (CuPy, cuML)
Geospatial: xarray, xdggs, GeoPandas, H3, Rasterio
ML: scikit-learn, XGBoost, entropy (eSPA), cuML
Storage: Zarr, Icechunk, Parquet, NetCDF
Visualization: Streamlit, Bokeh, Matplotlib, Cartopy

Code Style Guidelines

Python Standards

Follow PEP 8 conventions
Use type hints for all function signatures
Write numpy-style docstrings for public functions
Keep functions focused and modular
Prefer descriptive variable names over abbreviations

Geospatial Best Practices

Use EPSG:3413 (Arctic Stereographic) for computations
Use EPSG:4326 (WGS84) for visualization and library compatibility
Store gridded data using xarray with XDGGS indexing
Store tabular data as Parquet, array data as Zarr
Leverage Dask for lazy evaluation of large datasets
Use GeoPandas for vector operations
Handle antimeridian correctly for polar regions

Data Pipeline Conventions

Follow the numbered script sequence: 00grids.sh → 01darts.sh → 02alphaearth.sh → 03era5.sh → 04arcticdem.sh → 05train.sh
Each pipeline stage should produce reproducible intermediate outputs
Use src/entropice/paths.py for consistent path management
Environment variable FAST_DATA_DIR controls data directory location (default: ./data)

Storage Hierarchy

All data follows this structure:

DATA_DIR/
├── grids/                    # H3/HEALPix tessellations (GeoParquet)
├── darts/                    # RTS labels (GeoParquet)
├── era5/                     # Climate data (Zarr)
├── arcticdem/                # Terrain data (Icechunk Zarr)
├── alphaearth/               # Satellite embeddings (Zarr)
├── datasets/                 # L2 XDGGS datasets (Zarr)
├── training-results/         # Models, CV results, predictions
└── watermask/                # Ocean mask (GeoParquet)

Module Organization

Core Modules (`src/entropice/`)

grids.py: H3/HEALPix spatial grid generation with watermask
darts.py: RTS label extraction from DARTS v2 dataset
era5.py: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
arcticdem.py: Terrain analysis from 32m Arctic elevation data
alphaearth.py: Satellite image embeddings via Google Earth Engine
aggregators.py: Raster-to-vector spatial aggregation engine
dataset.py: Multi-source data integration and feature engineering
training.py: Model training with eSPA, XGBoost, Random Forest, KNN
inference.py: Batch prediction pipeline for trained models
paths.py: Centralized path management

Dashboard (`src/entropice/dashboard/`)

Streamlit-based interactive visualization
Modular pages: overview, training data, analysis, model state, inference
Bokeh-based geospatial plotting utilities
Run with: pixi run dashboard

Scripts (`scripts/`)

Numbered pipeline scripts (00grids.sh through 05train.sh)
Run entire pipeline for multiple grid configurations
Each script uses CLIs from core modules

Notebooks (`notebooks/`)

Exploratory analysis and validation
NOT committed to git
Keep production code in src/entropice/

Development Workflow

Setup

pixi install  # NOT pip install or conda install

Running Tests

pixi run pytest

Common Tasks

Generate grids: Use grids.py CLI
Process labels: Use darts.py CLI
Train models: Use training.py CLI with TOML config
Run inference: Use inference.py CLI
View results: pixi run dashboard

Key Design Patterns

1. XDGGS Indexing

All geospatial data uses discrete global grid systems (H3 or HEALPix) via xdggs library for consistent spatial indexing across sources.

2. Lazy Evaluation

Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.

3. GPU Acceleration

Prefer GPU-accelerated operations:

CuPy for array operations
cuML for Random Forest and KNN
XGBoost GPU training
PyTorch tensors when applicable

4. Configuration as Code

Use dataclasses for typed configuration
TOML files for training hyperparameters
Cyclopts for CLI argument parsing

Data Sources

DARTS v2: RTS labels (year, area, count, density)
ERA5: Climate data (40-year history, Arctic-aligned years)
ArcticDEM: 32m resolution terrain (slope, aspect, indices)
AlphaEarth: 64-dimensional satellite embeddings
Watermask: Ocean exclusion layer

Model Support

Primary: eSPA (entropy-optimal Scalable Probabilistic Approximations) Alternatives: XGBoost, Random Forest, K-Nearest Neighbors

Training features:

Randomized hyperparameter search
K-Fold cross-validation
Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)

Extension Points

To extend Entropice:

New data source: Follow patterns in era5.py or arcticdem.py
Custom aggregations: Add to _Aggregations dataclass in aggregators.py
Alternative labels: Implement extractor following darts.py pattern
New models: Add scikit-learn compatible estimators to training.py
Dashboard pages: Add Streamlit pages to dashboard/ module

Important Notes

Grid resolutions: H3 (3-6), HEALPix (6-10)
Arctic years run October 1 to September 30 (not calendar years)
Handle antimeridian crossing in polar regions
Use batch processing for GPU memory management
Notebooks are for exploration only - keep production code in src/
Always use absolute paths or paths from paths.py

Common Issues

Memory: Use batch processing and Dask chunking for large datasets
GPU OOM: Reduce batch size in inference or training
Antimeridian: Use proper handling in aggregators.py for polar grids
Temporal alignment: ERA5 uses Arctic-aligned years (Oct-Sep)
CRS: Compute in EPSG:3413, visualize in EPSG:4326

References

For more details, consult:

ARCHITECTURE.md - System architecture and design patterns
CONTRIBUTING.md - Development workflow and standards
README.md - Project goals and setup instructions

7 KiB Raw Blame History