Refactor non-dashboard modules

This commit is contained in:
Tobias Hölzer 2025-12-28 20:38:51 +01:00
parent 907e907856
commit 45bc61e49e
22 changed files with 122 additions and 186 deletions

View file

@ -42,7 +42,7 @@ For project goals and setup, see [README.md](../README.md).
- Follow the numbered script sequence: `00grids.sh``01darts.sh``02alphaearth.sh``03era5.sh``04arcticdem.sh``05train.sh` - Follow the numbered script sequence: `00grids.sh``01darts.sh``02alphaearth.sh``03era5.sh``04arcticdem.sh``05train.sh`
- Each pipeline stage should produce **reproducible intermediate outputs** - Each pipeline stage should produce **reproducible intermediate outputs**
- Use `src/entropice/paths.py` for consistent path management - Use `src/entropice/utils/paths.py` for consistent path management
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`) - Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
### Storage Hierarchy ### Storage Hierarchy
@ -64,16 +64,37 @@ DATA_DIR/
### Core Modules (`src/entropice/`) ### Core Modules (`src/entropice/`)
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask The codebase is organized into four main packages:
- **`entropice.ingest`**: Data ingestion from external sources
- **`entropice.spatial`**: Spatial operations and grid management
- **`entropice.ml`**: Machine learning workflows
- **`entropice.utils`**: Common utilities
#### Data Ingestion (`src/entropice/ingest/`)
- **`darts.py`**: RTS label extraction from DARTS v2 dataset - **`darts.py`**: RTS label extraction from DARTS v2 dataset
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30) - **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data - **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine - **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
#### Spatial Operations (`src/entropice/spatial/`)
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine - **`aggregators.py`**: Raster-to-vector spatial aggregation engine
- **`watermask.py`**: Ocean masking utilities
- **`xvec.py`**: Extended vector operations for xarray
#### Machine Learning (`src/entropice/ml/`)
- **`dataset.py`**: Multi-source data integration and feature engineering - **`dataset.py`**: Multi-source data integration and feature engineering
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN - **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
- **`inference.py`**: Batch prediction pipeline for trained models - **`inference.py`**: Batch prediction pipeline for trained models
#### Utilities (`src/entropice/utils/`)
- **`paths.py`**: Centralized path management - **`paths.py`**: Centralized path management
- **`codecs.py`**: Custom codecs for data serialization
### Dashboard (`src/entropice/dashboard/`) ### Dashboard (`src/entropice/dashboard/`)
@ -110,10 +131,10 @@ pixi run pytest
### Common Tasks ### Common Tasks
- **Generate grids**: Use `grids.py` CLI - **Generate grids**: Use `spatial/grids.py` CLI
- **Process labels**: Use `darts.py` CLI - **Process labels**: Use `ingest/darts.py` CLI
- **Train models**: Use `training.py` CLI with TOML config - **Train models**: Use `ml/training.py` CLI with TOML config
- **Run inference**: Use `inference.py` CLI - **Run inference**: Use `ml/inference.py` CLI
- **View results**: `pixi run dashboard` - **View results**: `pixi run dashboard`
## Key Design Patterns ## Key Design Patterns
@ -162,10 +183,10 @@ Training features:
To extend Entropice: To extend Entropice:
- **New data source**: Follow patterns in `era5.py` or `arcticdem.py` - **New data source**: Follow patterns in `ingest/era5.py` or `ingest/arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py` - **Custom aggregations**: Add to `_Aggregations` dataclass in `spatial/aggregators.py`
- **Alternative labels**: Implement extractor following `darts.py` pattern - **Alternative labels**: Implement extractor following `ingest/darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `training.py` - **New models**: Add scikit-learn compatible estimators to `ml/training.py`
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module - **Dashboard pages**: Add Streamlit pages to `dashboard/` module
## Important Notes ## Important Notes
@ -175,13 +196,13 @@ To extend Entropice:
- Handle **antimeridian crossing** in polar regions - Handle **antimeridian crossing** in polar regions
- Use **batch processing** for GPU memory management - Use **batch processing** for GPU memory management
- Notebooks are for exploration only - **keep production code in `src/`** - Notebooks are for exploration only - **keep production code in `src/`**
- Always use **absolute paths** or paths from `paths.py` - Always use **absolute paths** or paths from `utils/paths.py`
## Common Issues ## Common Issues
- **Memory**: Use batch processing and Dask chunking for large datasets - **Memory**: Use batch processing and Dask chunking for large datasets
- **GPU OOM**: Reduce batch size in inference or training - **GPU OOM**: Reduce batch size in inference or training
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids - **Antimeridian**: Use proper handling in `spatial/aggregators.py` for polar grids
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep) - **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326 - **CRS**: Compute in EPSG:3413, visualize in EPSG:4326

View file

@ -27,7 +27,14 @@ The pipeline follows a sequential processing approach where each stage produces
### System Components ### System Components
#### 1. Spatial Grid System (`grids.py`) The codebase is organized into four main packages:
- **`entropice.ingest`**: Data ingestion from external sources (DARTS, ERA5, ArcticDEM, AlphaEarth)
- **`entropice.spatial`**: Spatial operations and grid management
- **`entropice.ml`**: Machine learning workflows (dataset, training, inference)
- **`entropice.utils`**: Common utilities (paths, codecs)
#### 1. Spatial Grid System (`spatial/grids.py`)
- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation - **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids - **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
@ -39,7 +46,7 @@ The pipeline follows a sequential processing approach where each stage produces
- Provides spatial indexing for efficient data aggregation - Provides spatial indexing for efficient data aggregation
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas - **Output**: GeoDataFrames with cell IDs, geometries, and land areas
#### 2. Label Management (`darts.py`) #### 2. Label Management (`ingest/darts.py`)
- **Purpose**: Extracts RTS labels from DARTS v2 dataset - **Purpose**: Extracts RTS labels from DARTS v2 dataset
- **Processing**: - **Processing**:
@ -51,7 +58,7 @@ The pipeline follows a sequential processing approach where each stage produces
#### 3. Feature Extractors #### 3. Feature Extractors
**ERA5 Climate Data (`era5.py`)** **ERA5 Climate Data (`ingest/era5.py`)**
- Downloads hourly climate variables from Copernicus Climate Data Store - Downloads hourly climate variables from Copernicus Climate Data Store
- Computes daily aggregates: temperature extrema, precipitation, snow metrics - Computes daily aggregates: temperature extrema, precipitation, snow metrics
@ -59,7 +66,7 @@ The pipeline follows a sequential processing approach where each stage produces
- Temporal aggregations: yearly, seasonal, shoulder seasons - Temporal aggregations: yearly, seasonal, shoulder seasons
- Uses Arctic-aligned years (October 1 - September 30) - Uses Arctic-aligned years (October 1 - September 30)
**ArcticDEM Terrain (`arcticdem.py`)** **ArcticDEM Terrain (`ingest/arcticdem.py`)**
- Processes 32m resolution Arctic elevation data - Processes 32m resolution Arctic elevation data
- Computes terrain derivatives: slope, aspect, curvature - Computes terrain derivatives: slope, aspect, curvature
@ -67,14 +74,14 @@ The pipeline follows a sequential processing approach where each stage produces
- Applies watermask clipping and GPU-accelerated convolutions - Applies watermask clipping and GPU-accelerated convolutions
- Aggregates terrain statistics per grid cell - Aggregates terrain statistics per grid cell
**AlphaEarth Embeddings (`alphaearth.py`)** **AlphaEarth Embeddings (`ingest/alphaearth.py`)**
- Extracts 64-dimensional satellite image embeddings via Google Earth Engine - Extracts 64-dimensional satellite image embeddings via Google Earth Engine
- Uses foundation models to capture visual patterns - Uses foundation models to capture visual patterns
- Partitions large grids using KMeans clustering - Partitions large grids using KMeans clustering
- Temporal sampling across multiple years - Temporal sampling across multiple years
#### 4. Spatial Aggregation Framework (`aggregators.py`) #### 4. Spatial Aggregation Framework (`spatial/aggregators.py`)
- **Core Capability**: Raster-to-vector aggregation engine - **Core Capability**: Raster-to-vector aggregation engine
- **Methods**: - **Methods**:
@ -86,7 +93,7 @@ The pipeline follows a sequential processing approach where each stage produces
- Parallel processing with worker pools - Parallel processing with worker pools
- GPU acceleration via CuPy/CuML where applicable - GPU acceleration via CuPy/CuML where applicable
#### 5. Dataset Assembly (`dataset.py`) #### 5. Dataset Assembly (`ml/dataset.py`)
- **DatasetEnsemble Class**: Orchestrates multi-source data integration - **DatasetEnsemble Class**: Orchestrates multi-source data integration
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source - **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
@ -98,7 +105,7 @@ The pipeline follows a sequential processing approach where each stage produces
- GPU-accelerated data loading (PyTorch/CuPy) - GPU-accelerated data loading (PyTorch/CuPy)
- **Output**: Tabular feature matrices ready for scikit-learn API - **Output**: Tabular feature matrices ready for scikit-learn API
#### 6. Model Training (`training.py`) #### 6. Model Training (`ml/training.py`)
- **Supported Models**: - **Supported Models**:
- **eSPA**: Entropy-optimal probabilistic classifier (primary) - **eSPA**: Entropy-optimal probabilistic classifier (primary)
@ -112,7 +119,7 @@ The pipeline follows a sequential processing approach where each stage produces
- **Configuration**: TOML-based configuration with Cyclopts CLI - **Configuration**: TOML-based configuration with Cyclopts CLI
- **Output**: Pickled models, CV results, feature importance - **Output**: Pickled models, CV results, feature importance
#### 7. Inference (`inference.py`) #### 7. Inference (`ml/inference.py`)
- Batch prediction pipeline for trained classifiers - Batch prediction pipeline for trained classifiers
- GPU memory management with configurable batch sizes - GPU memory management with configurable batch sizes
@ -170,7 +177,7 @@ scripts/05train.sh # Model training
- Dataclasses for typed configuration - Dataclasses for typed configuration
- TOML files for training hyperparameters - TOML files for training hyperparameters
- Environment-based path management (`paths.py`) - Environment-based path management (`utils/paths.py`)
## Data Storage Hierarchy ## Data Storage Hierarchy
@ -209,9 +216,9 @@ DATA_DIR/
The architecture supports extension through: The architecture supports extension through:
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns - **New Data Sources**: Implement feature extractor in `ingest/` following ERA5/ArcticDEM patterns
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass - **Custom Aggregations**: Add methods to `_Aggregations` dataclass in `spatial/aggregators.py`
- **Alternative Targets**: Implement label extractor following DARTS pattern - **Alternative Targets**: Implement label extractor in `ingest/` following DARTS pattern
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators - **Alternative Models**: Extend training CLI in `ml/training.py` with new scikit-learn compatible estimators
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module - **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
- **Grid Systems**: Support additional DGGS via xdggs integration - **Grid Systems**: Support additional DGGS in `spatial/grids.py` via xdggs integration

View file

@ -22,7 +22,10 @@ This will set up the complete environment including RAPIDS, PyTorch, and all geo
### Code Organization ### Code Organization
- **`src/entropice/`**: Core modules (grids, data sources, training, inference) - **`src/entropice/ingest/`**: Data ingestion modules (darts, era5, arcticdem, alphaearth)
- **`src/entropice/spatial/`**: Spatial operations (grids, aggregators, watermask, xvec)
- **`src/entropice/ml/`**: Machine learning components (dataset, training, inference)
- **`src/entropice/utils/`**: Utilities (paths, codecs)
- **`src/entropice/dashboard/`**: Streamlit visualization dashboard - **`src/entropice/dashboard/`**: Streamlit visualization dashboard
- **`scripts/`**: Data processing pipeline scripts (numbered 00-05) - **`scripts/`**: Data processing pipeline scripts (numbered 00-05)
- **`notebooks/`**: Exploratory analysis and validation notebooks - **`notebooks/`**: Exploratory analysis and validation notebooks
@ -30,11 +33,12 @@ This will set up the complete environment including RAPIDS, PyTorch, and all geo
### Key Modules ### Key Modules
- `grids.py`: H3/HEALPix spatial grid systems - `spatial/grids.py`: H3/HEALPix spatial grid systems
- `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`: Data source processors - `ingest/darts.py`, `ingest/era5.py`, `ingest/arcticdem.py`, `ingest/alphaearth.py`: Data source processors
- `dataset.py`: Dataset assembly and feature engineering - `ml/dataset.py`: Dataset assembly and feature engineering
- `training.py`: Model training with eSPA, XGBoost, Random Forest, KNN - `ml/training.py`: Model training with eSPA, XGBoost, Random Forest, KNN
- `inference.py`: Prediction generation - `ml/inference.py`: Prediction generation
- `utils/paths.py`: Centralized path management
## Coding Standards ## Coding Standards
@ -58,7 +62,7 @@ This will set up the complete environment including RAPIDS, PyTorch, and all geo
- Follow the numbered script sequence: `00grids.sh``01darts.sh` → ... → `05train.sh` - Follow the numbered script sequence: `00grids.sh``01darts.sh` → ... → `05train.sh`
- Each stage should produce reproducible intermediate outputs - Each stage should produce reproducible intermediate outputs
- Document data dependencies in module docstrings - Document data dependencies in module docstrings
- Use `paths.py` for consistent path management - Use `utils/paths.py` for consistent path management
## Testing ## Testing

View file

@ -70,12 +70,12 @@ dependencies = [
] ]
[project.scripts] [project.scripts]
create-grid = "entropice.grids:main" create-grid = "entropice.spatial.grids:main"
darts = "entropice.darts:cli" darts = "entropice.ingest.darts:cli"
alpha-earth = "entropice.alphaearth:main" alpha-earth = "entropice.ingest.alphaearth:main"
era5 = "entropice.era5:cli" era5 = "entropice.ingest.era5:cli"
arcticdem = "entropice.arcticdem:cli" arcticdem = "entropice.ingest.arcticdem:cli"
train = "entropice.training:cli" train = "entropice.ml.training:cli"
[build-system] [build-system]
requires = ["hatchling"] requires = ["hatchling"]

View file

@ -0,0 +1,14 @@
"""Data ingestion modules for external data sources.
This package contains modules for processing and ingesting data from various
external sources into the Entropice system:
- darts: Retrogressive Thaw Slump (RTS) labels from DARTS v2 dataset
- era5: Climate data from ERA5 reanalysis
- arcticdem: Terrain data from ArcticDEM
- alphaearth: Satellite image embeddings from AlphaEarth/Google Earth Engine
"""
from . import alphaearth, arcticdem, darts, era5
__all__ = ["alphaearth", "arcticdem", "darts", "era5"]

View file

@ -0,0 +1,12 @@
"""Machine learning components for model training and inference.
This package contains modules for machine learning workflows:
- dataset: Multi-source dataset assembly and feature engineering
- training: Model training with eSPA, XGBoost, Random Forest, KNN
- inference: Batch prediction pipeline for trained classifiers
"""
from . import dataset, inference, training
__all__ = ["dataset", "inference", "training"]

View file

@ -0,0 +1,13 @@
"""Spatial operations and grid management modules.
This package contains modules for spatial data processing and grid-based operations:
- grids: Discrete global grid system (H3/HEALPix) generation and management
- aggregators: Raster-to-vector spatial aggregation framework
- watermask: Ocean masking utilities
- xvec: Extended vector operations for xarray
"""
from . import aggregators, grids, watermask, xvec
__all__ = ["aggregators", "grids", "watermask", "xvec"]

View file

@ -0,0 +1,11 @@
"""Utility modules for common functionality.
This package contains utility modules used across the Entropice system:
- paths: Centralized path management and configuration
- codecs: Custom codecs for data serialization
"""
from . import codecs, paths
__all__ = ["codecs", "paths"]

View file

@ -1,146 +0,0 @@
"""Test script to verify feature extraction works correctly."""
import numpy as np
import xarray as xr
# Create a mock model state with various feature types
features = [
# Embedding features: embedding_{agg}_{band}_{year}
"embedding_mean_B02_2020",
"embedding_std_B03_2021",
"embedding_max_B04_2022",
# ERA5 features without aggregations: era5_{variable}_{time}
"era5_temperature_2020_summer",
"era5_precipitation_2021_winter",
# ERA5 features with aggregations: era5_{variable}_{time}_{agg}
"era5_temperature_2020_summer_mean",
"era5_precipitation_2021_winter_std",
# ArcticDEM features: arcticdem_{variable}_{agg}
"arcticdem_elevation_mean",
"arcticdem_slope_std",
"arcticdem_aspect_max",
# Common features
"cell_area",
"water_area",
"land_area",
"land_ratio",
"lon",
"lat",
]
# Create mock importance values
importance_values = np.random.rand(len(features))
# Create a mock model state for ESPA
model_state_espa = xr.Dataset(
{
"feature_weights": xr.DataArray(
importance_values,
dims=["feature"],
coords={"feature": features},
)
}
)
# Create a mock model state for XGBoost
model_state_xgb = xr.Dataset(
{
"feature_importance_gain": xr.DataArray(
importance_values,
dims=["feature"],
coords={"feature": features},
),
"feature_importance_weight": xr.DataArray(
importance_values * 0.8,
dims=["feature"],
coords={"feature": features},
),
}
)
# Create a mock model state for Random Forest
model_state_rf = xr.Dataset(
{
"feature_importance": xr.DataArray(
importance_values,
dims=["feature"],
coords={"feature": features},
)
}
)
# Test extraction functions
from entropice.dashboard.utils.data import (
extract_arcticdem_features,
extract_common_features,
extract_embedding_features,
extract_era5_features,
)
print("=" * 80)
print("Testing ESPA model state")
print("=" * 80)
embedding_array = extract_embedding_features(model_state_espa)
print(f"\nEmbedding features extracted: {embedding_array is not None}")
if embedding_array is not None:
print(f" Dimensions: {embedding_array.dims}")
print(f" Shape: {embedding_array.shape}")
print(f" Coordinates: {list(embedding_array.coords)}")
era5_array = extract_era5_features(model_state_espa)
print(f"\nERA5 features extracted: {era5_array is not None}")
if era5_array is not None:
print(f" Dimensions: {era5_array.dims}")
print(f" Shape: {era5_array.shape}")
print(f" Coordinates: {list(era5_array.coords)}")
arcticdem_array = extract_arcticdem_features(model_state_espa)
print(f"\nArcticDEM features extracted: {arcticdem_array is not None}")
if arcticdem_array is not None:
print(f" Dimensions: {arcticdem_array.dims}")
print(f" Shape: {arcticdem_array.shape}")
print(f" Coordinates: {list(arcticdem_array.coords)}")
common_array = extract_common_features(model_state_espa)
print(f"\nCommon features extracted: {common_array is not None}")
if common_array is not None:
print(f" Dimensions: {common_array.dims}")
print(f" Shape: {common_array.shape}")
print(f" Size: {common_array.size}")
print("\n" + "=" * 80)
print("Testing XGBoost model state")
print("=" * 80)
embedding_array_xgb = extract_embedding_features(model_state_xgb, importance_type="feature_importance_gain")
print(f"\nEmbedding features (gain) extracted: {embedding_array_xgb is not None}")
if embedding_array_xgb is not None:
print(f" Dimensions: {embedding_array_xgb.dims}")
print(f" Shape: {embedding_array_xgb.shape}")
era5_array_xgb = extract_era5_features(model_state_xgb, importance_type="feature_importance_weight")
print(f"\nERA5 features (weight) extracted: {era5_array_xgb is not None}")
if era5_array_xgb is not None:
print(f" Dimensions: {era5_array_xgb.dims}")
print(f" Shape: {era5_array_xgb.shape}")
print("\n" + "=" * 80)
print("Testing Random Forest model state")
print("=" * 80)
embedding_array_rf = extract_embedding_features(model_state_rf, importance_type="feature_importance")
print(f"\nEmbedding features extracted: {embedding_array_rf is not None}")
if embedding_array_rf is not None:
print(f" Dimensions: {embedding_array_rf.dims}")
print(f" Shape: {embedding_array_rf.shape}")
arcticdem_array_rf = extract_arcticdem_features(model_state_rf, importance_type="feature_importance")
print(f"\nArcticDEM features extracted: {arcticdem_array_rf is not None}")
if arcticdem_array_rf is not None:
print(f" Dimensions: {arcticdem_array_rf.dims}")
print(f" Shape: {arcticdem_array_rf.shape}")
print("\n" + "=" * 80)
print("All tests completed successfully!")
print("=" * 80)