Tobias Hölzer 45bc61e49e Refactor non-dashboard modules

2025-12-28 20:38:51 +01:00

9.2 KiB

Raw Blame History

Entropice Architecture

Overview

Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data, aggregates it into discrete global grids, and trains probabilistic classifiers to estimate RTS occurrence patterns. It allows for rigorous experimentation with the data and the models, modularizing and abstracting most parts of the data flow pipeline. Thus, alternative models, datasets and target labels can be used to compare, but also to explore new methodologies other than RTS prediction or eSPA modelling.

Core Architecture

Data Flow Pipeline

Data Sources → Grid Aggregation → Feature Engineering → Dataset Assembly → Model Training → Inference → Visualization

The pipeline follows a sequential processing approach where each stage produces intermediate datasets that feed into subsequent stages:

Grid Generation (H3/HEALPix) → Parquet files
Label Extraction (DARTS) → Grid-enriched labels
Feature Extraction (ERA5, ArcticDEM, AlphaEarth) → L2 Datasets (XDGGS Xarray)
Dataset Ensemble → Training-ready tabular data
Model Training → Pickled classifiers + CV results
Inference → GeoDataFrames with predictions
Dashboard → Interactive visualizations

System Components

The codebase is organized into four main packages:

entropice.ingest: Data ingestion from external sources (DARTS, ERA5, ArcticDEM, AlphaEarth)
entropice.spatial: Spatial operations and grid management
entropice.ml: Machine learning workflows (dataset, training, inference)
entropice.utils: Common utilities (paths, codecs)

1. Spatial Grid System (`spatial/grids.py`)

Purpose: Creates global tessellations (discrete global grid systems) for spatial aggregation
Grid Types & Levels: H3 hexagonal grids and HEALPix grids
- Hex: 3, 4, 5, 6
- Healpix: 6, 7, 8, 9, 10
Functionality:
- Generates cell IDs and geometries at configurable resolutions
- Applies watermask to exclude ocean areas
- Provides spatial indexing for efficient data aggregation
Output: GeoDataFrames with cell IDs, geometries, and land areas

2. Label Management (`ingest/darts.py`)

Purpose: Extracts RTS labels from DARTS v2 dataset
Processing:
- Spatial overlay of DARTS polygons with grid cells
- Temporal aggregation across multiple observation years
- Computes RTS count, area, density, and coverage per cell
- Supports both standard DARTS and ML training labels
Output: Grid-enriched parquet files with RTS metrics

3. Feature Extractors

ERA5 Climate Data (ingest/era5.py)

Downloads hourly climate variables from Copernicus Climate Data Store
Computes daily aggregates: temperature extrema, precipitation, snow metrics
Derives thaw/freeze metrics: degree days, thawing period length
Temporal aggregations: yearly, seasonal, shoulder seasons
Uses Arctic-aligned years (October 1 - September 30)

ArcticDEM Terrain (ingest/arcticdem.py)

Processes 32m resolution Arctic elevation data
Computes terrain derivatives: slope, aspect, curvature
Calculates terrain indices: TPI, TRI, ruggedness, VRM
Applies watermask clipping and GPU-accelerated convolutions
Aggregates terrain statistics per grid cell

AlphaEarth Embeddings (ingest/alphaearth.py)

Extracts 64-dimensional satellite image embeddings via Google Earth Engine
Uses foundation models to capture visual patterns
Partitions large grids using KMeans clustering
Temporal sampling across multiple years

4. Spatial Aggregation Framework (`spatial/aggregators.py`)

Core Capability: Raster-to-vector aggregation engine
Methods:
- Exact geometry-based aggregation using polygon overlay
- Grid-aligned batch processing for memory efficiency
- Statistical aggregations: mean, sum, std, min, max, median, quantiles
Optimization:
- Antimeridian handling for polar regions
- Parallel processing with worker pools
- GPU acceleration via CuPy/CuML where applicable

5. Dataset Assembly (`ml/dataset.py`)

DatasetEnsemble Class: Orchestrates multi-source data integration
L2 Datasets: Standardized XDGGS Xarray datasets per data source
Features:
- Dynamic dataset configuration via dataclasses
- Multi-task support: binary classification, count, density estimation
- Target binning: quantile-based categorical bins
- Train/test splitting with spatial awareness
- GPU-accelerated data loading (PyTorch/CuPy)
Output: Tabular feature matrices ready for scikit-learn API

6. Model Training (`ml/training.py`)

Supported Models:
- eSPA: Entropy-optimal probabilistic classifier (primary)
- XGBoost: Gradient boosting (GPU-accelerated)
- Random Forest: cuML implementation
- K-Nearest Neighbors: cuML implementation
Training Strategy:
- Randomized hyperparameter search (Scipy distributions)
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
Configuration: TOML-based configuration with Cyclopts CLI
Output: Pickled models, CV results, feature importance

7. Inference (`ml/inference.py`)

Batch prediction pipeline for trained classifiers
GPU memory management with configurable batch sizes
Outputs GeoDataFrames with predicted classes and probabilities
Supports all trained model types via unified interface

8. Interactive Dashboard (`dashboard/`)

Technology: Streamlit-based web application
Pages:
- Overview: Result directory summary and metrics
- Training Data: Feature distribution visualizations
- Training Analysis: CV results, hyperparameter analysis
- Model State: Feature importance, decision boundaries
- Inference: Spatial prediction maps
Plotting Modules: Bokeh-based interactive geospatial plots

Key Design Patterns

1. XDGGS Integration

All geospatial data is indexed using discrete global grid systems (H3 or HEALPix) via the xdggs library, enabling:

Consistent spatial indexing across data sources
Efficient spatial joins and aggregations
Multi-resolution analysis capabilities

2. Lazy Evaluation & Chunking

Xarray/Dask for out-of-core computation
Zarr/Icechunk for chunked storage
Batch processing to manage GPU memory

3. GPU Acceleration

CuPy for array operations
cuML for machine learning (RF, KNN)
XGBoost GPU training
PyTorch tensor operations

4. Modular CLI Design

Each module exposes a Cyclopts-based CLI for standalone execution. Super-scripts utilize them to run parts of the data flow for all grid-level combinations:

scripts/00grids.sh      # Grid generation
scripts/01darts.sh      # Label extraction
scripts/02alphaearth.sh # Satellite embeddings
scripts/03era5.sh       # Climate features
scripts/04arcticdem.sh  # Terrain features
scripts/05train.sh      # Model training

5. Configuration as Code

Dataclasses for typed configuration
TOML files for training hyperparameters
Environment-based path management (utils/paths.py)

Data Storage Hierarchy

DATA_DIR/
├── grids/                    # Hexagonal/HEALPix tessellations (GeoParquet)
├── darts/                    # RTS labels (GeoParquet)
├── era5/                     # Climate data (Zarr)
├── arcticdem/                # Terrain data (Icechunk Zarr)
├── alphaearth/               # Satellite embeddings (Zarr)
├── datasets/                 # L2 XDGGS datasets (Zarr)
├── training-results/         # Models, CV results, predictions (Parquet, NetCDF, Pickle)
└── watermask/                # Ocean mask (GeoParquet)

Technology Stack

Core: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
Spatial: H3, xdggs, xvec, Shapely, Rasterio
ML: scikit-learn, XGBoost, cuML, entropy (eSPA)
GPU: CuPy, PyTorch, CUDA
Storage: Zarr, Icechunk, Parquet, NetCDF
Visualization: Matplotlib, Seaborn, Bokeh, Streamlit, Cartopy, PyDeck, Altair, Plotly
CLI: Cyclopts, Rich
External APIs: Google Earth Engine, Copernicus Climate Data Store

Performance Considerations

Memory Management: Batch processing with configurable chunk sizes
Parallelization: Multi-process aggregation, Dask distributed computing
GPU Utilization: CuPy/cuML for array operations and ML training
Storage Optimization: Blosc compression, Zarr chunking strategies
Spatial Indexing: Grid-based partitioning for large-scale operations

Extension Points

The architecture supports extension through:

New Data Sources: Implement feature extractor in ingest/ following ERA5/ArcticDEM patterns
Custom Aggregations: Add methods to _Aggregations dataclass in spatial/aggregators.py
Alternative Targets: Implement label extractor in ingest/ following DARTS pattern
Alternative Models: Extend training CLI in ml/training.py with new scikit-learn compatible estimators
Dashboard Pages: Add Streamlit pages to dashboard/ module
Grid Systems: Support additional DGGS in spatial/grids.py via xdggs integration

9.2 KiB Raw Blame History