217 lines
8.7 KiB
Markdown
217 lines
8.7 KiB
Markdown
# Entropice Architecture
|
|
|
|
## Overview
|
|
|
|
Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA).
|
|
The system processes multi-source geospatial data, aggregates it into discrete global grids, and trains probabilistic classifiers to estimate RTS occurrence patterns.
|
|
It allows for rigorous experimentation with the data and the models, modularizing and abstracting most parts of the data flow pipeline.
|
|
Thus, alternative models, datasets and target labels can be used to compare, but also to explore new methodologies other than RTS prediction or eSPA modelling.
|
|
|
|
## Core Architecture
|
|
|
|
### Data Flow Pipeline
|
|
|
|
```txt
|
|
Data Sources → Grid Aggregation → Feature Engineering → Dataset Assembly → Model Training → Inference → Visualization
|
|
```
|
|
|
|
The pipeline follows a sequential processing approach where each stage produces intermediate datasets that feed into subsequent stages:
|
|
|
|
1. **Grid Generation** (H3/HEALPix) → Parquet files
|
|
2. **Label Extraction** (DARTS) → Grid-enriched labels
|
|
3. **Feature Extraction** (ERA5, ArcticDEM, AlphaEarth) → L2 Datasets (XDGGS Xarray)
|
|
4. **Dataset Ensemble** → Training-ready tabular data
|
|
5. **Model Training** → Pickled classifiers + CV results
|
|
6. **Inference** → GeoDataFrames with predictions
|
|
7. **Dashboard** → Interactive visualizations
|
|
|
|
### System Components
|
|
|
|
#### 1. Spatial Grid System (`grids.py`)
|
|
|
|
- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
|
|
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
|
|
- Hex: 3, 4, 5, 6
|
|
- Healpix: 6, 7, 8, 9, 10
|
|
- **Functionality**:
|
|
- Generates cell IDs and geometries at configurable resolutions
|
|
- Applies watermask to exclude ocean areas
|
|
- Provides spatial indexing for efficient data aggregation
|
|
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas
|
|
|
|
#### 2. Label Management (`darts.py`)
|
|
|
|
- **Purpose**: Extracts RTS labels from DARTS v2 dataset
|
|
- **Processing**:
|
|
- Spatial overlay of DARTS polygons with grid cells
|
|
- Temporal aggregation across multiple observation years
|
|
- Computes RTS count, area, density, and coverage per cell
|
|
- Supports both standard DARTS and ML training labels
|
|
- **Output**: Grid-enriched parquet files with RTS metrics
|
|
|
|
#### 3. Feature Extractors
|
|
|
|
**ERA5 Climate Data (`era5.py`)**
|
|
|
|
- Downloads hourly climate variables from Copernicus Climate Data Store
|
|
- Computes daily aggregates: temperature extrema, precipitation, snow metrics
|
|
- Derives thaw/freeze metrics: degree days, thawing period length
|
|
- Temporal aggregations: yearly, seasonal, shoulder seasons
|
|
- Uses Arctic-aligned years (October 1 - September 30)
|
|
|
|
**ArcticDEM Terrain (`arcticdem.py`)**
|
|
|
|
- Processes 32m resolution Arctic elevation data
|
|
- Computes terrain derivatives: slope, aspect, curvature
|
|
- Calculates terrain indices: TPI, TRI, ruggedness, VRM
|
|
- Applies watermask clipping and GPU-accelerated convolutions
|
|
- Aggregates terrain statistics per grid cell
|
|
|
|
**AlphaEarth Embeddings (`alphaearth.py`)**
|
|
|
|
- Extracts 64-dimensional satellite image embeddings via Google Earth Engine
|
|
- Uses foundation models to capture visual patterns
|
|
- Partitions large grids using KMeans clustering
|
|
- Temporal sampling across multiple years
|
|
|
|
#### 4. Spatial Aggregation Framework (`aggregators.py`)
|
|
|
|
- **Core Capability**: Raster-to-vector aggregation engine
|
|
- **Methods**:
|
|
- Exact geometry-based aggregation using polygon overlay
|
|
- Grid-aligned batch processing for memory efficiency
|
|
- Statistical aggregations: mean, sum, std, min, max, median, quantiles
|
|
- **Optimization**:
|
|
- Antimeridian handling for polar regions
|
|
- Parallel processing with worker pools
|
|
- GPU acceleration via CuPy/CuML where applicable
|
|
|
|
#### 5. Dataset Assembly (`dataset.py`)
|
|
|
|
- **DatasetEnsemble Class**: Orchestrates multi-source data integration
|
|
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
|
|
- **Features**:
|
|
- Dynamic dataset configuration via dataclasses
|
|
- Multi-task support: binary classification, count, density estimation
|
|
- Target binning: quantile-based categorical bins
|
|
- Train/test splitting with spatial awareness
|
|
- GPU-accelerated data loading (PyTorch/CuPy)
|
|
- **Output**: Tabular feature matrices ready for scikit-learn API
|
|
|
|
#### 6. Model Training (`training.py`)
|
|
|
|
- **Supported Models**:
|
|
- **eSPA**: Entropy-optimal probabilistic classifier (primary)
|
|
- **XGBoost**: Gradient boosting (GPU-accelerated)
|
|
- **Random Forest**: cuML implementation
|
|
- **K-Nearest Neighbors**: cuML implementation
|
|
- **Training Strategy**:
|
|
- Randomized hyperparameter search (Scipy distributions)
|
|
- K-Fold cross-validation
|
|
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
|
|
- **Configuration**: TOML-based configuration with Cyclopts CLI
|
|
- **Output**: Pickled models, CV results, feature importance
|
|
|
|
#### 7. Inference (`inference.py`)
|
|
|
|
- Batch prediction pipeline for trained classifiers
|
|
- GPU memory management with configurable batch sizes
|
|
- Outputs GeoDataFrames with predicted classes and probabilities
|
|
- Supports all trained model types via unified interface
|
|
|
|
#### 8. Interactive Dashboard (`dashboard/`)
|
|
|
|
- **Technology**: Streamlit-based web application
|
|
- **Pages**:
|
|
- **Overview**: Result directory summary and metrics
|
|
- **Training Data**: Feature distribution visualizations
|
|
- **Training Analysis**: CV results, hyperparameter analysis
|
|
- **Model State**: Feature importance, decision boundaries
|
|
- **Inference**: Spatial prediction maps
|
|
- **Plotting Modules**: Bokeh-based interactive geospatial plots
|
|
|
|
## Key Design Patterns
|
|
|
|
### 1. XDGGS Integration
|
|
|
|
All geospatial data is indexed using discrete global grid systems (H3 or HEALPix) via the `xdggs` library, enabling:
|
|
|
|
- Consistent spatial indexing across data sources
|
|
- Efficient spatial joins and aggregations
|
|
- Multi-resolution analysis capabilities
|
|
|
|
### 2. Lazy Evaluation & Chunking
|
|
|
|
- Xarray/Dask for out-of-core computation
|
|
- Zarr/Icechunk for chunked storage
|
|
- Batch processing to manage GPU memory
|
|
|
|
### 3. GPU Acceleration
|
|
|
|
- CuPy for array operations
|
|
- cuML for machine learning (RF, KNN)
|
|
- XGBoost GPU training
|
|
- PyTorch tensor operations
|
|
|
|
### 4. Modular CLI Design
|
|
|
|
Each module exposes a Cyclopts-based CLI for standalone execution. Super-scripts utilize them to run parts of the data flow for all grid-level combinations:
|
|
|
|
```bash
|
|
scripts/00grids.sh # Grid generation
|
|
scripts/01darts.sh # Label extraction
|
|
scripts/02alphaearth.sh # Satellite embeddings
|
|
scripts/03era5.sh # Climate features
|
|
scripts/04arcticdem.sh # Terrain features
|
|
scripts/05train.sh # Model training
|
|
```
|
|
|
|
### 5. Configuration as Code
|
|
|
|
- Dataclasses for typed configuration
|
|
- TOML files for training hyperparameters
|
|
- Environment-based path management (`paths.py`)
|
|
|
|
## Data Storage Hierarchy
|
|
|
|
```sh
|
|
DATA_DIR/
|
|
├── grids/ # Hexagonal/HEALPix tessellations (GeoParquet)
|
|
├── darts/ # RTS labels (GeoParquet)
|
|
├── era5/ # Climate data (Zarr)
|
|
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
|
├── alphaearth/ # Satellite embeddings (Zarr)
|
|
├── datasets/ # L2 XDGGS datasets (Zarr)
|
|
├── training-results/ # Models, CV results, predictions (Parquet, NetCDF, Pickle)
|
|
└── watermask/ # Ocean mask (GeoParquet)
|
|
```
|
|
|
|
## Technology Stack
|
|
|
|
**Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
|
|
**Spatial**: H3, xdggs, xvec, Shapely, Rasterio
|
|
**ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
|
|
**GPU**: CuPy, PyTorch, CUDA
|
|
**Storage**: Zarr, Icechunk, Parquet, NetCDF
|
|
**Visualization**: Matplotlib, Seaborn, Bokeh, Streamlit, Cartopy, PyDeck, Altair, Plotly
|
|
**CLI**: Cyclopts, Rich
|
|
**External APIs**: Google Earth Engine, Copernicus Climate Data Store
|
|
|
|
## Performance Considerations
|
|
|
|
- **Memory Management**: Batch processing with configurable chunk sizes
|
|
- **Parallelization**: Multi-process aggregation, Dask distributed computing
|
|
- **GPU Utilization**: CuPy/cuML for array operations and ML training
|
|
- **Storage Optimization**: Blosc compression, Zarr chunking strategies
|
|
- **Spatial Indexing**: Grid-based partitioning for large-scale operations
|
|
|
|
## Extension Points
|
|
|
|
The architecture supports extension through:
|
|
|
|
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
|
|
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
|
|
- **Alternative Targets**: Implement label extractor following DARTS pattern
|
|
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
|
|
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
|
|
- **Grid Systems**: Support additional DGGS via xdggs integration
|