Refactor non-dashboard modules

This commit is contained in:
Tobias Hölzer 2025-12-28 20:38:51 +01:00
parent 907e907856
commit 45bc61e49e
22 changed files with 122 additions and 186 deletions

View file

@ -27,7 +27,14 @@ The pipeline follows a sequential processing approach where each stage produces
### System Components
#### 1. Spatial Grid System (`grids.py`)
The codebase is organized into four main packages:
- **`entropice.ingest`**: Data ingestion from external sources (DARTS, ERA5, ArcticDEM, AlphaEarth)
- **`entropice.spatial`**: Spatial operations and grid management
- **`entropice.ml`**: Machine learning workflows (dataset, training, inference)
- **`entropice.utils`**: Common utilities (paths, codecs)
#### 1. Spatial Grid System (`spatial/grids.py`)
- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
@ -39,7 +46,7 @@ The pipeline follows a sequential processing approach where each stage produces
- Provides spatial indexing for efficient data aggregation
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas
#### 2. Label Management (`darts.py`)
#### 2. Label Management (`ingest/darts.py`)
- **Purpose**: Extracts RTS labels from DARTS v2 dataset
- **Processing**:
@ -51,7 +58,7 @@ The pipeline follows a sequential processing approach where each stage produces
#### 3. Feature Extractors
**ERA5 Climate Data (`era5.py`)**
**ERA5 Climate Data (`ingest/era5.py`)**
- Downloads hourly climate variables from Copernicus Climate Data Store
- Computes daily aggregates: temperature extrema, precipitation, snow metrics
@ -59,7 +66,7 @@ The pipeline follows a sequential processing approach where each stage produces
- Temporal aggregations: yearly, seasonal, shoulder seasons
- Uses Arctic-aligned years (October 1 - September 30)
**ArcticDEM Terrain (`arcticdem.py`)**
**ArcticDEM Terrain (`ingest/arcticdem.py`)**
- Processes 32m resolution Arctic elevation data
- Computes terrain derivatives: slope, aspect, curvature
@ -67,14 +74,14 @@ The pipeline follows a sequential processing approach where each stage produces
- Applies watermask clipping and GPU-accelerated convolutions
- Aggregates terrain statistics per grid cell
**AlphaEarth Embeddings (`alphaearth.py`)**
**AlphaEarth Embeddings (`ingest/alphaearth.py`)**
- Extracts 64-dimensional satellite image embeddings via Google Earth Engine
- Uses foundation models to capture visual patterns
- Partitions large grids using KMeans clustering
- Temporal sampling across multiple years
#### 4. Spatial Aggregation Framework (`aggregators.py`)
#### 4. Spatial Aggregation Framework (`spatial/aggregators.py`)
- **Core Capability**: Raster-to-vector aggregation engine
- **Methods**:
@ -86,7 +93,7 @@ The pipeline follows a sequential processing approach where each stage produces
- Parallel processing with worker pools
- GPU acceleration via CuPy/CuML where applicable
#### 5. Dataset Assembly (`dataset.py`)
#### 5. Dataset Assembly (`ml/dataset.py`)
- **DatasetEnsemble Class**: Orchestrates multi-source data integration
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
@ -98,7 +105,7 @@ The pipeline follows a sequential processing approach where each stage produces
- GPU-accelerated data loading (PyTorch/CuPy)
- **Output**: Tabular feature matrices ready for scikit-learn API
#### 6. Model Training (`training.py`)
#### 6. Model Training (`ml/training.py`)
- **Supported Models**:
- **eSPA**: Entropy-optimal probabilistic classifier (primary)
@ -112,7 +119,7 @@ The pipeline follows a sequential processing approach where each stage produces
- **Configuration**: TOML-based configuration with Cyclopts CLI
- **Output**: Pickled models, CV results, feature importance
#### 7. Inference (`inference.py`)
#### 7. Inference (`ml/inference.py`)
- Batch prediction pipeline for trained classifiers
- GPU memory management with configurable batch sizes
@ -170,7 +177,7 @@ scripts/05train.sh # Model training
- Dataclasses for typed configuration
- TOML files for training hyperparameters
- Environment-based path management (`paths.py`)
- Environment-based path management (`utils/paths.py`)
## Data Storage Hierarchy
@ -209,9 +216,9 @@ DATA_DIR/
The architecture supports extension through:
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
- **Alternative Targets**: Implement label extractor following DARTS pattern
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
- **New Data Sources**: Implement feature extractor in `ingest/` following ERA5/ArcticDEM patterns
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass in `spatial/aggregators.py`
- **Alternative Targets**: Implement label extractor in `ingest/` following DARTS pattern
- **Alternative Models**: Extend training CLI in `ml/training.py` with new scikit-learn compatible estimators
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
- **Grid Systems**: Support additional DGGS via xdggs integration
- **Grid Systems**: Support additional DGGS in `spatial/grids.py` via xdggs integration