Add some docs for copilot
This commit is contained in:
parent
1ee3d532fc
commit
f8df10f687
9 changed files with 908 additions and 2027 deletions
217
ARCHITECTURE.md
Normal file
217
ARCHITECTURE.md
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
# Entropice Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA).
|
||||
The system processes multi-source geospatial data, aggregates it into discrete global grids, and trains probabilistic classifiers to estimate RTS occurrence patterns.
|
||||
It allows for rigorous experimentation with the data and the models, modularizing and abstracting most parts of the data flow pipeline.
|
||||
Thus, alternative models, datasets and target labels can be used to compare, but also to explore new methodologies other than RTS prediction or eSPA modelling.
|
||||
|
||||
## Core Architecture
|
||||
|
||||
### Data Flow Pipeline
|
||||
|
||||
```txt
|
||||
Data Sources → Grid Aggregation → Feature Engineering → Dataset Assembly → Model Training → Inference → Visualization
|
||||
```
|
||||
|
||||
The pipeline follows a sequential processing approach where each stage produces intermediate datasets that feed into subsequent stages:
|
||||
|
||||
1. **Grid Generation** (H3/HEALPix) → Parquet files
|
||||
2. **Label Extraction** (DARTS) → Grid-enriched labels
|
||||
3. **Feature Extraction** (ERA5, ArcticDEM, AlphaEarth) → L2 Datasets (XDGGS Xarray)
|
||||
4. **Dataset Ensemble** → Training-ready tabular data
|
||||
5. **Model Training** → Pickled classifiers + CV results
|
||||
6. **Inference** → GeoDataFrames with predictions
|
||||
7. **Dashboard** → Interactive visualizations
|
||||
|
||||
### System Components
|
||||
|
||||
#### 1. Spatial Grid System (`grids.py`)
|
||||
|
||||
- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
|
||||
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
|
||||
- Hex: 3, 4, 5, 6
|
||||
- Healpix: 6, 7, 8, 9, 10
|
||||
- **Functionality**:
|
||||
- Generates cell IDs and geometries at configurable resolutions
|
||||
- Applies watermask to exclude ocean areas
|
||||
- Provides spatial indexing for efficient data aggregation
|
||||
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas
|
||||
|
||||
#### 2. Label Management (`darts.py`)
|
||||
|
||||
- **Purpose**: Extracts RTS labels from DARTS v2 dataset
|
||||
- **Processing**:
|
||||
- Spatial overlay of DARTS polygons with grid cells
|
||||
- Temporal aggregation across multiple observation years
|
||||
- Computes RTS count, area, density, and coverage per cell
|
||||
- Supports both standard DARTS and ML training labels
|
||||
- **Output**: Grid-enriched parquet files with RTS metrics
|
||||
|
||||
#### 3. Feature Extractors
|
||||
|
||||
**ERA5 Climate Data (`era5.py`)**
|
||||
|
||||
- Downloads hourly climate variables from Copernicus Climate Data Store
|
||||
- Computes daily aggregates: temperature extrema, precipitation, snow metrics
|
||||
- Derives thaw/freeze metrics: degree days, thawing period length
|
||||
- Temporal aggregations: yearly, seasonal, shoulder seasons
|
||||
- Uses Arctic-aligned years (October 1 - September 30)
|
||||
|
||||
**ArcticDEM Terrain (`arcticdem.py`)**
|
||||
|
||||
- Processes 32m resolution Arctic elevation data
|
||||
- Computes terrain derivatives: slope, aspect, curvature
|
||||
- Calculates terrain indices: TPI, TRI, ruggedness, VRM
|
||||
- Applies watermask clipping and GPU-accelerated convolutions
|
||||
- Aggregates terrain statistics per grid cell
|
||||
|
||||
**AlphaEarth Embeddings (`alphaearth.py`)**
|
||||
|
||||
- Extracts 256-dimensional satellite image embeddings via Google Earth Engine
|
||||
- Uses foundation models to capture visual patterns
|
||||
- Partitions large grids using KMeans clustering
|
||||
- Temporal sampling across multiple years
|
||||
|
||||
#### 4. Spatial Aggregation Framework (`aggregators.py`)
|
||||
|
||||
- **Core Capability**: Raster-to-vector aggregation engine
|
||||
- **Methods**:
|
||||
- Exact geometry-based aggregation using polygon overlay
|
||||
- Grid-aligned batch processing for memory efficiency
|
||||
- Statistical aggregations: mean, sum, std, min, max, median, quantiles
|
||||
- **Optimization**:
|
||||
- Antimeridian handling for polar regions
|
||||
- Parallel processing with worker pools
|
||||
- GPU acceleration via CuPy/CuML where applicable
|
||||
|
||||
#### 5. Dataset Assembly (`dataset.py`)
|
||||
|
||||
- **DatasetEnsemble Class**: Orchestrates multi-source data integration
|
||||
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
|
||||
- **Features**:
|
||||
- Dynamic dataset configuration via dataclasses
|
||||
- Multi-task support: binary classification, count, density estimation
|
||||
- Target binning: quantile-based categorical bins
|
||||
- Train/test splitting with spatial awareness
|
||||
- GPU-accelerated data loading (PyTorch/CuPy)
|
||||
- **Output**: Tabular feature matrices ready for scikit-learn API
|
||||
|
||||
#### 6. Model Training (`training.py`)
|
||||
|
||||
- **Supported Models**:
|
||||
- **eSPA**: Entropy-optimal probabilistic classifier (primary)
|
||||
- **XGBoost**: Gradient boosting (GPU-accelerated)
|
||||
- **Random Forest**: cuML implementation
|
||||
- **K-Nearest Neighbors**: cuML implementation
|
||||
- **Training Strategy**:
|
||||
- Randomized hyperparameter search (Scipy distributions)
|
||||
- K-Fold cross-validation
|
||||
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
|
||||
- **Configuration**: TOML-based configuration with Cyclopts CLI
|
||||
- **Output**: Pickled models, CV results, feature importance
|
||||
|
||||
#### 7. Inference (`inference.py`)
|
||||
|
||||
- Batch prediction pipeline for trained classifiers
|
||||
- GPU memory management with configurable batch sizes
|
||||
- Outputs GeoDataFrames with predicted classes and probabilities
|
||||
- Supports all trained model types via unified interface
|
||||
|
||||
#### 8. Interactive Dashboard (`dashboard/`)
|
||||
|
||||
- **Technology**: Streamlit-based web application
|
||||
- **Pages**:
|
||||
- **Overview**: Result directory summary and metrics
|
||||
- **Training Data**: Feature distribution visualizations
|
||||
- **Training Analysis**: CV results, hyperparameter analysis
|
||||
- **Model State**: Feature importance, decision boundaries
|
||||
- **Inference**: Spatial prediction maps
|
||||
- **Plotting Modules**: Bokeh-based interactive geospatial plots
|
||||
|
||||
## Key Design Patterns
|
||||
|
||||
### 1. XDGGS Integration
|
||||
|
||||
All geospatial data is indexed using discrete global grid systems (H3 or HEALPix) via the `xdggs` library, enabling:
|
||||
|
||||
- Consistent spatial indexing across data sources
|
||||
- Efficient spatial joins and aggregations
|
||||
- Multi-resolution analysis capabilities
|
||||
|
||||
### 2. Lazy Evaluation & Chunking
|
||||
|
||||
- Xarray/Dask for out-of-core computation
|
||||
- Zarr/Icechunk for chunked storage
|
||||
- Batch processing to manage GPU memory
|
||||
|
||||
### 3. GPU Acceleration
|
||||
|
||||
- CuPy for array operations
|
||||
- cuML for machine learning (RF, KNN)
|
||||
- XGBoost GPU training
|
||||
- PyTorch tensor operations
|
||||
|
||||
### 4. Modular CLI Design
|
||||
|
||||
Each module exposes a Cyclopts-based CLI for standalone execution. Super-scripts utilize them to run parts of the data flow for all grid-level combinations:
|
||||
|
||||
```bash
|
||||
scripts/00grids.sh # Grid generation
|
||||
scripts/01darts.sh # Label extraction
|
||||
scripts/02alphaearth.sh # Satellite embeddings
|
||||
scripts/03era5.sh # Climate features
|
||||
scripts/04arcticdem.sh # Terrain features
|
||||
scripts/05train.sh # Model training
|
||||
```
|
||||
|
||||
### 5. Configuration as Code
|
||||
|
||||
- Dataclasses for typed configuration
|
||||
- TOML files for training hyperparameters
|
||||
- Environment-based path management (`paths.py`)
|
||||
|
||||
## Data Storage Hierarchy
|
||||
|
||||
```sh
|
||||
DATA_DIR/
|
||||
├── grids/ # Hexagonal/HEALPix tessellations (GeoParquet)
|
||||
├── darts/ # RTS labels (GeoParquet)
|
||||
├── era5/ # Climate data (Zarr)
|
||||
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
||||
├── alphaearth/ # Satellite embeddings (Zarr)
|
||||
├── datasets/ # L2 XDGGS datasets (Zarr)
|
||||
├── training-results/ # Models, CV results, predictions (Parquet, NetCDF, Pickle)
|
||||
└── watermask/ # Ocean mask (GeoParquet)
|
||||
```
|
||||
|
||||
## Technology Stack
|
||||
|
||||
**Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
|
||||
**Spatial**: H3, xdggs, xvec, Shapely, Rasterio
|
||||
**ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
|
||||
**GPU**: CuPy, PyTorch, CUDA
|
||||
**Storage**: Zarr, Icechunk, Parquet, NetCDF
|
||||
**Visualization**: Matplotlib, Seaborn, Bokeh, Streamlit, Cartopy, PyDeck, Altair, Plotly
|
||||
**CLI**: Cyclopts, Rich
|
||||
**External APIs**: Google Earth Engine, Copernicus Climate Data Store
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Memory Management**: Batch processing with configurable chunk sizes
|
||||
- **Parallelization**: Multi-process aggregation, Dask distributed computing
|
||||
- **GPU Utilization**: CuPy/cuML for array operations and ML training
|
||||
- **Storage Optimization**: Blosc compression, Zarr chunking strategies
|
||||
- **Spatial Indexing**: Grid-based partitioning for large-scale operations
|
||||
|
||||
## Extension Points
|
||||
|
||||
The architecture supports extension through:
|
||||
|
||||
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
|
||||
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
|
||||
- **Alternative Targets**: Implement label extractor following DARTS pattern
|
||||
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
|
||||
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
|
||||
- **Grid Systems**: Support additional DGGS via xdggs integration
|
||||
Loading…
Add table
Add a link
Reference in a new issue