entropice/ARCHITECTURE.md

# Entropice Architecture

## Overview

Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA).
The system processes multi-source geospatial data, aggregates it into discrete global grids, and trains probabilistic classifiers to estimate RTS occurrence patterns.
It allows for rigorous experimentation with the data and the models, modularizing and abstracting most parts of the data flow pipeline.
Thus, alternative models, datasets and target labels can be used to compare, but also to explore new methodologies other than RTS prediction or eSPA modelling.

## Core Architecture

### Data Flow Pipeline

```txt
Data Sources → Grid Aggregation → Feature Engineering → Dataset Assembly → Model Training → Inference → Visualization
```

The pipeline follows a sequential processing approach where each stage produces intermediate datasets that feed into subsequent stages:

1. **Grid Generation** (H3/HEALPix) → Parquet files
2. **Label Extraction** (DARTS) → Grid-enriched labels
3. **Feature Extraction** (ERA5, ArcticDEM, AlphaEarth) → L2 Datasets (XDGGS Xarray)
4. **Dataset Ensemble** → Training-ready tabular data
5. **Model Training** → Pickled classifiers + CV results
6. **Inference** → GeoDataFrames with predictions
7. **Dashboard** → Interactive visualizations

### System Components

#### 1. Spatial Grid System (`grids.py`)

- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
  - Hex: 3, 4, 5, 6
  - Healpix: 6, 7, 8, 9, 10
- **Functionality**:
  - Generates cell IDs and geometries at configurable resolutions
  - Applies watermask to exclude ocean areas
  - Provides spatial indexing for efficient data aggregation
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas

#### 2. Label Management (`darts.py`)

- **Purpose**: Extracts RTS labels from DARTS v2 dataset
- **Processing**:
  - Spatial overlay of DARTS polygons with grid cells
  - Temporal aggregation across multiple observation years
  - Computes RTS count, area, density, and coverage per cell
  - Supports both standard DARTS and ML training labels
- **Output**: Grid-enriched parquet files with RTS metrics

#### 3. Feature Extractors

**ERA5 Climate Data (`era5.py`)**

- Downloads hourly climate variables from Copernicus Climate Data Store
- Computes daily aggregates: temperature extrema, precipitation, snow metrics
- Derives thaw/freeze metrics: degree days, thawing period length
- Temporal aggregations: yearly, seasonal, shoulder seasons
- Uses Arctic-aligned years (October 1 - September 30)

**ArcticDEM Terrain (`arcticdem.py`)**

- Processes 32m resolution Arctic elevation data
- Computes terrain derivatives: slope, aspect, curvature
- Calculates terrain indices: TPI, TRI, ruggedness, VRM
- Applies watermask clipping and GPU-accelerated convolutions
- Aggregates terrain statistics per grid cell

**AlphaEarth Embeddings (`alphaearth.py`)**

- Extracts 64-dimensional satellite image embeddings via Google Earth Engine
- Uses foundation models to capture visual patterns
- Partitions large grids using KMeans clustering
- Temporal sampling across multiple years

#### 4. Spatial Aggregation Framework (`aggregators.py`)

- **Core Capability**: Raster-to-vector aggregation engine
- **Methods**:
  - Exact geometry-based aggregation using polygon overlay
  - Grid-aligned batch processing for memory efficiency
  - Statistical aggregations: mean, sum, std, min, max, median, quantiles
- **Optimization**:
  - Antimeridian handling for polar regions
  - Parallel processing with worker pools
  - GPU acceleration via CuPy/CuML where applicable

#### 5. Dataset Assembly (`dataset.py`)

- **DatasetEnsemble Class**: Orchestrates multi-source data integration
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
- **Features**:
  - Dynamic dataset configuration via dataclasses
  - Multi-task support: binary classification, count, density estimation
  - Target binning: quantile-based categorical bins
  - Train/test splitting with spatial awareness
  - GPU-accelerated data loading (PyTorch/CuPy)
- **Output**: Tabular feature matrices ready for scikit-learn API

#### 6. Model Training (`training.py`)

- **Supported Models**:
  - **eSPA**: Entropy-optimal probabilistic classifier (primary)
  - **XGBoost**: Gradient boosting (GPU-accelerated)
  - **Random Forest**: cuML implementation
  - **K-Nearest Neighbors**: cuML implementation
- **Training Strategy**:
  - Randomized hyperparameter search (Scipy distributions)
  - K-Fold cross-validation
  - Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
- **Configuration**: TOML-based configuration with Cyclopts CLI
- **Output**: Pickled models, CV results, feature importance

#### 7. Inference (`inference.py`)

- Batch prediction pipeline for trained classifiers
- GPU memory management with configurable batch sizes
- Outputs GeoDataFrames with predicted classes and probabilities
- Supports all trained model types via unified interface

#### 8. Interactive Dashboard (`dashboard/`)

- **Technology**: Streamlit-based web application
- **Pages**:
  - **Overview**: Result directory summary and metrics
  - **Training Data**: Feature distribution visualizations
  - **Training Analysis**: CV results, hyperparameter analysis
  - **Model State**: Feature importance, decision boundaries
  - **Inference**: Spatial prediction maps
- **Plotting Modules**: Bokeh-based interactive geospatial plots

## Key Design Patterns

### 1. XDGGS Integration

All geospatial data is indexed using discrete global grid systems (H3 or HEALPix) via the `xdggs` library, enabling:

- Consistent spatial indexing across data sources
- Efficient spatial joins and aggregations
- Multi-resolution analysis capabilities

### 2. Lazy Evaluation & Chunking

- Xarray/Dask for out-of-core computation
- Zarr/Icechunk for chunked storage
- Batch processing to manage GPU memory

### 3. GPU Acceleration

- CuPy for array operations
- cuML for machine learning (RF, KNN)
- XGBoost GPU training
- PyTorch tensor operations

### 4. Modular CLI Design

Each module exposes a Cyclopts-based CLI for standalone execution. Super-scripts utilize them to run parts of the data flow for all grid-level combinations:

```bash
scripts/00grids.sh      # Grid generation
scripts/01darts.sh      # Label extraction
scripts/02alphaearth.sh # Satellite embeddings
scripts/03era5.sh       # Climate features
scripts/04arcticdem.sh  # Terrain features
scripts/05train.sh      # Model training
```

### 5. Configuration as Code

- Dataclasses for typed configuration
- TOML files for training hyperparameters
- Environment-based path management (`paths.py`)

## Data Storage Hierarchy

```sh
DATA_DIR/
├── grids/                    # Hexagonal/HEALPix tessellations (GeoParquet)
├── darts/                    # RTS labels (GeoParquet)
├── era5/                     # Climate data (Zarr)
├── arcticdem/                # Terrain data (Icechunk Zarr)
├── alphaearth/               # Satellite embeddings (Zarr)
├── datasets/                 # L2 XDGGS datasets (Zarr)
├── training-results/         # Models, CV results, predictions (Parquet, NetCDF, Pickle)
└── watermask/                # Ocean mask (GeoParquet)
```

## Technology Stack

**Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
**Spatial**: H3, xdggs, xvec, Shapely, Rasterio
**ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
**GPU**: CuPy, PyTorch, CUDA
**Storage**: Zarr, Icechunk, Parquet, NetCDF
**Visualization**: Matplotlib, Seaborn, Bokeh, Streamlit, Cartopy, PyDeck, Altair, Plotly
**CLI**: Cyclopts, Rich
**External APIs**: Google Earth Engine, Copernicus Climate Data Store

## Performance Considerations

- **Memory Management**: Batch processing with configurable chunk sizes
- **Parallelization**: Multi-process aggregation, Dask distributed computing
- **GPU Utilization**: CuPy/cuML for array operations and ML training
- **Storage Optimization**: Blosc compression, Zarr chunking strategies
- **Spatial Indexing**: Grid-based partitioning for large-scale operations

## Extension Points

The architecture supports extension through:

- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
- **Alternative Targets**: Implement label extractor following DARTS pattern
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
- **Grid Systems**: Support additional DGGS via xdggs integration