Refactor non-dashboard modules

2025-12-28 20:38:51 +01:00 · 2025-12-28 20:38:51 +01:00 · 45bc61e49e
commit 45bc61e49e
parent 907e907856
22 changed files with 122 additions and 186 deletions
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@ -42,7 +42,7 @@ For project goals and setup, see [README.md](../README.md).
 - Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
 - Each pipeline stage should produce **reproducible intermediate outputs**
- Use `src/entropice/paths.py` for consistent path management
+- Use `src/entropice/utils/paths.py` for consistent path management
 - Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
 ### Storage Hierarchy
@ -64,16 +64,37 @@ DATA_DIR/
 ### Core Modules (`src/entropice/`)
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
+The codebase is organized into four main packages:
 - **`entropice.ingest`**: Data ingestion from external sources
 - **`entropice.spatial`**: Spatial operations and grid management
 - **`entropice.ml`**: Machine learning workflows
 - **`entropice.utils`**: Common utilities
 #### Data Ingestion (`src/entropice/ingest/`)
 - **`darts.py`**: RTS label extraction from DARTS v2 dataset
 - **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
 - **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
 - **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
 #### Spatial Operations (`src/entropice/spatial/`)
 - **`grids.py`**: H3/HEALPix spatial grid generation with watermask
 - **`aggregators.py`**: Raster-to-vector spatial aggregation engine
 - **`watermask.py`**: Ocean masking utilities
 - **`xvec.py`**: Extended vector operations for xarray
 #### Machine Learning (`src/entropice/ml/`)
 - **`dataset.py`**: Multi-source data integration and feature engineering
 - **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
 - **`inference.py`**: Batch prediction pipeline for trained models
 #### Utilities (`src/entropice/utils/`)
 - **`paths.py`**: Centralized path management
 - **`codecs.py`**: Custom codecs for data serialization
 ### Dashboard (`src/entropice/dashboard/`)
@ -110,10 +131,10 @@ pixi run pytest
 ### Common Tasks
- **Generate grids**: Use `grids.py` CLI
+- **Generate grids**: Use `spatial/grids.py` CLI
- **Process labels**: Use `darts.py` CLI
+- **Process labels**: Use `ingest/darts.py` CLI
- **Train models**: Use `training.py` CLI with TOML config
+- **Train models**: Use `ml/training.py` CLI with TOML config
- **Run inference**: Use `inference.py` CLI
+- **Run inference**: Use `ml/inference.py` CLI
 - **View results**: `pixi run dashboard`
 ## Key Design Patterns
@ -162,10 +183,10 @@ Training features:
 To extend Entropice:
- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
+- **New data source**: Follow patterns in `ingest/era5.py` or `ingest/arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
+- **Custom aggregations**: Add to `_Aggregations` dataclass in `spatial/aggregators.py`
- **Alternative labels**: Implement extractor following `darts.py` pattern
+- **Alternative labels**: Implement extractor following `ingest/darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `training.py`
+- **New models**: Add scikit-learn compatible estimators to `ml/training.py`
 - **Dashboard pages**: Add Streamlit pages to `dashboard/` module
 ## Important Notes
@ -175,13 +196,13 @@ To extend Entropice:
 - Handle **antimeridian crossing** in polar regions
 - Use **batch processing** for GPU memory management
 - Notebooks are for exploration only - **keep production code in `src/`**
- Always use **absolute paths** or paths from `paths.py`
+- Always use **absolute paths** or paths from `utils/paths.py`
 ## Common Issues
 - **Memory**: Use batch processing and Dask chunking for large datasets
 - **GPU OOM**: Reduce batch size in inference or training
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
+- **Antimeridian**: Use proper handling in `spatial/aggregators.py` for polar grids
 - **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
 - **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -27,7 +27,14 @@ The pipeline follows a sequential processing approach where each stage produces
 ### System Components
-#### 1. Spatial Grid System (`grids.py`)
+The codebase is organized into four main packages:
 - **`entropice.ingest`**: Data ingestion from external sources (DARTS, ERA5, ArcticDEM, AlphaEarth)
 - **`entropice.spatial`**: Spatial operations and grid management
 - **`entropice.ml`**: Machine learning workflows (dataset, training, inference)
 - **`entropice.utils`**: Common utilities (paths, codecs)
 #### 1. Spatial Grid System (`spatial/grids.py`)
 - **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
 - **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
@ -39,7 +46,7 @@ The pipeline follows a sequential processing approach where each stage produces
  - Provides spatial indexing for efficient data aggregation
 - **Output**: GeoDataFrames with cell IDs, geometries, and land areas
-#### 2. Label Management (`darts.py`)
+#### 2. Label Management (`ingest/darts.py`)
 - **Purpose**: Extracts RTS labels from DARTS v2 dataset
 - **Processing**:
@ -51,7 +58,7 @@ The pipeline follows a sequential processing approach where each stage produces
 #### 3. Feature Extractors
-**ERA5 Climate Data (`era5.py`)**
+**ERA5 Climate Data (`ingest/era5.py`)**
 - Downloads hourly climate variables from Copernicus Climate Data Store
 - Computes daily aggregates: temperature extrema, precipitation, snow metrics
@ -59,7 +66,7 @@ The pipeline follows a sequential processing approach where each stage produces
 - Temporal aggregations: yearly, seasonal, shoulder seasons
 - Uses Arctic-aligned years (October 1 - September 30)
-**ArcticDEM Terrain (`arcticdem.py`)**
+**ArcticDEM Terrain (`ingest/arcticdem.py`)**
 - Processes 32m resolution Arctic elevation data
 - Computes terrain derivatives: slope, aspect, curvature
@ -67,14 +74,14 @@ The pipeline follows a sequential processing approach where each stage produces
 - Applies watermask clipping and GPU-accelerated convolutions
 - Aggregates terrain statistics per grid cell
-**AlphaEarth Embeddings (`alphaearth.py`)**
+**AlphaEarth Embeddings (`ingest/alphaearth.py`)**
 - Extracts 64-dimensional satellite image embeddings via Google Earth Engine
 - Uses foundation models to capture visual patterns
 - Partitions large grids using KMeans clustering
 - Temporal sampling across multiple years
-#### 4. Spatial Aggregation Framework (`aggregators.py`)
+#### 4. Spatial Aggregation Framework (`spatial/aggregators.py`)
 - **Core Capability**: Raster-to-vector aggregation engine
 - **Methods**:
@ -86,7 +93,7 @@ The pipeline follows a sequential processing approach where each stage produces
  - Parallel processing with worker pools
  - GPU acceleration via CuPy/CuML where applicable
-#### 5. Dataset Assembly (`dataset.py`)
+#### 5. Dataset Assembly (`ml/dataset.py`)
 - **DatasetEnsemble Class**: Orchestrates multi-source data integration
 - **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
@ -98,7 +105,7 @@ The pipeline follows a sequential processing approach where each stage produces
  - GPU-accelerated data loading (PyTorch/CuPy)
 - **Output**: Tabular feature matrices ready for scikit-learn API
-#### 6. Model Training (`training.py`)
+#### 6. Model Training (`ml/training.py`)
 - **Supported Models**:
  - **eSPA**: Entropy-optimal probabilistic classifier (primary)
@ -112,7 +119,7 @@ The pipeline follows a sequential processing approach where each stage produces
 - **Configuration**: TOML-based configuration with Cyclopts CLI
 - **Output**: Pickled models, CV results, feature importance
-#### 7. Inference (`inference.py`)
+#### 7. Inference (`ml/inference.py`)
 - Batch prediction pipeline for trained classifiers
 - GPU memory management with configurable batch sizes
@ -170,7 +177,7 @@ scripts/05train.sh      # Model training
 - Dataclasses for typed configuration
 - TOML files for training hyperparameters
- Environment-based path management (`paths.py`)
+- Environment-based path management (`utils/paths.py`)
 ## Data Storage Hierarchy
@ -209,9 +216,9 @@ DATA_DIR/
 The architecture supports extension through:
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
+- **New Data Sources**: Implement feature extractor in `ingest/` following ERA5/ArcticDEM patterns
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
+- **Custom Aggregations**: Add methods to `_Aggregations` dataclass in `spatial/aggregators.py`
- **Alternative Targets**: Implement label extractor following DARTS pattern
+- **Alternative Targets**: Implement label extractor in `ingest/` following DARTS pattern
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
+- **Alternative Models**: Extend training CLI in `ml/training.py` with new scikit-learn compatible estimators
 - **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
- **Grid Systems**: Support additional DGGS via xdggs integration
+- **Grid Systems**: Support additional DGGS in `spatial/grids.py` via xdggs integration
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -22,7 +22,10 @@ This will set up the complete environment including RAPIDS, PyTorch, and all geo
 ### Code Organization
- **`src/entropice/`**: Core modules (grids, data sources, training, inference)
+- **`src/entropice/ingest/`**: Data ingestion modules (darts, era5, arcticdem, alphaearth)
 - **`src/entropice/spatial/`**: Spatial operations (grids, aggregators, watermask, xvec)
 - **`src/entropice/ml/`**: Machine learning components (dataset, training, inference)
 - **`src/entropice/utils/`**: Utilities (paths, codecs)
 - **`src/entropice/dashboard/`**: Streamlit visualization dashboard
 - **`scripts/`**: Data processing pipeline scripts (numbered 00-05)
 - **`notebooks/`**: Exploratory analysis and validation notebooks
@ -30,11 +33,12 @@ This will set up the complete environment including RAPIDS, PyTorch, and all geo
 ### Key Modules
- `grids.py`: H3/HEALPix spatial grid systems
+- `spatial/grids.py`: H3/HEALPix spatial grid systems
- `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`: Data source processors
+- `ingest/darts.py`, `ingest/era5.py`, `ingest/arcticdem.py`, `ingest/alphaearth.py`: Data source processors
- `dataset.py`: Dataset assembly and feature engineering
+- `ml/dataset.py`: Dataset assembly and feature engineering
- `training.py`: Model training with eSPA, XGBoost, Random Forest, KNN
+- `ml/training.py`: Model training with eSPA, XGBoost, Random Forest, KNN
- `inference.py`: Prediction generation
+- `ml/inference.py`: Prediction generation
 - `utils/paths.py`: Centralized path management
 ## Coding Standards
@ -58,7 +62,7 @@ This will set up the complete environment including RAPIDS, PyTorch, and all geo
 - Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → ... → `05train.sh`
 - Each stage should produce reproducible intermediate outputs
 - Document data dependencies in module docstrings
- Use `paths.py` for consistent path management
+- Use `utils/paths.py` for consistent path management
 ## Testing
--- a/pyproject.toml
+++ b/pyproject.toml
@ -70,12 +70,12 @@ dependencies = [
 ]
 [project.scripts]
-create-grid = "entropice.grids:main"
+create-grid = "entropice.spatial.grids:main"
-darts = "entropice.darts:cli"
+darts = "entropice.ingest.darts:cli"
-alpha-earth = "entropice.alphaearth:main"
+alpha-earth = "entropice.ingest.alphaearth:main"
-era5 = "entropice.era5:cli"
+era5 = "entropice.ingest.era5:cli"
-arcticdem = "entropice.arcticdem:cli"
+arcticdem = "entropice.ingest.arcticdem:cli"
-train = "entropice.training:cli"
+train = "entropice.ml.training:cli"
 [build-system]
 requires = ["hatchling"]
--- a/src/entropice/ingest/init.py
+++ b/src/entropice/ingest/init.py
@ -0,0 +1,14 @@
 """Data ingestion modules for external data sources.
 This package contains modules for processing and ingesting data from various
 external sources into the Entropice system:
 - darts: Retrogressive Thaw Slump (RTS) labels from DARTS v2 dataset
 - era5: Climate data from ERA5 reanalysis
 - arcticdem: Terrain data from ArcticDEM
 - alphaearth: Satellite image embeddings from AlphaEarth/Google Earth Engine
 """
 from . import alphaearth, arcticdem, darts, era5
 __all__ = ["alphaearth", "arcticdem", "darts", "era5"]
--- a/src/entropice/ingest/alphaearth.py
+++ b/src/entropice/ingest/alphaearth.py
--- a/src/entropice/ingest/arcticdem.py
+++ b/src/entropice/ingest/arcticdem.py
--- a/src/entropice/ingest/darts.py
+++ b/src/entropice/ingest/darts.py
--- a/src/entropice/ingest/era5.py
+++ b/src/entropice/ingest/era5.py
--- a/src/entropice/ml/init.py
+++ b/src/entropice/ml/init.py
@ -0,0 +1,12 @@
 """Machine learning components for model training and inference.
 This package contains modules for machine learning workflows:
 - dataset: Multi-source dataset assembly and feature engineering
 - training: Model training with eSPA, XGBoost, Random Forest, KNN
 - inference: Batch prediction pipeline for trained classifiers
 """
 from . import dataset, inference, training
 __all__ = ["dataset", "inference", "training"]
--- a/src/entropice/ml/dataset.py
+++ b/src/entropice/ml/dataset.py
--- a/src/entropice/ml/inference.py
+++ b/src/entropice/ml/inference.py
--- a/src/entropice/ml/training.py
+++ b/src/entropice/ml/training.py
--- a/src/entropice/spatial/init.py
+++ b/src/entropice/spatial/init.py
@ -0,0 +1,13 @@
 """Spatial operations and grid management modules.
 This package contains modules for spatial data processing and grid-based operations:
 - grids: Discrete global grid system (H3/HEALPix) generation and management
 - aggregators: Raster-to-vector spatial aggregation framework
 - watermask: Ocean masking utilities
 - xvec: Extended vector operations for xarray
 """
 from . import aggregators, grids, watermask, xvec
 __all__ = ["aggregators", "grids", "watermask", "xvec"]
--- a/src/entropice/spatial/aggregators.py
+++ b/src/entropice/spatial/aggregators.py
--- a/src/entropice/spatial/grids.py
+++ b/src/entropice/spatial/grids.py
--- a/src/entropice/spatial/watermask.py
+++ b/src/entropice/spatial/watermask.py
--- a/src/entropice/spatial/xvec.py
+++ b/src/entropice/spatial/xvec.py
--- a/src/entropice/utils/init.py
+++ b/src/entropice/utils/init.py
@ -0,0 +1,11 @@
 """Utility modules for common functionality.
 This package contains utility modules used across the Entropice system:
 - paths: Centralized path management and configuration
 - codecs: Custom codecs for data serialization
 """
 from . import codecs, paths
 __all__ = ["codecs", "paths"]
--- a/src/entropice/utils/codecs.py
+++ b/src/entropice/utils/codecs.py
--- a/src/entropice/utils/paths.py
+++ b/src/entropice/utils/paths.py
--- a/test_feature_extraction.py
+++ b/test_feature_extraction.py
@ -1,146 +0,0 @@
 """Test script to verify feature extraction works correctly."""
 import numpy as np
 import xarray as xr
 # Create a mock model state with various feature types
 features = [
    # Embedding features: embedding_{agg}_{band}_{year}
    "embedding_mean_B02_2020",
    "embedding_std_B03_2021",
    "embedding_max_B04_2022",
    # ERA5 features without aggregations: era5_{variable}_{time}
    "era5_temperature_2020_summer",
    "era5_precipitation_2021_winter",
    # ERA5 features with aggregations: era5_{variable}_{time}_{agg}
    "era5_temperature_2020_summer_mean",
    "era5_precipitation_2021_winter_std",
    # ArcticDEM features: arcticdem_{variable}_{agg}
    "arcticdem_elevation_mean",
    "arcticdem_slope_std",
    "arcticdem_aspect_max",
    # Common features
    "cell_area",
    "water_area",
    "land_area",
    "land_ratio",
    "lon",
    "lat",
 ]
 # Create mock importance values
 importance_values = np.random.rand(len(features))
 # Create a mock model state for ESPA
 model_state_espa = xr.Dataset(
    {
        "feature_weights": xr.DataArray(
            importance_values,
            dims=["feature"],
            coords={"feature": features},
        )
    }
 )
 # Create a mock model state for XGBoost
 model_state_xgb = xr.Dataset(
    {
        "feature_importance_gain": xr.DataArray(
            importance_values,
            dims=["feature"],
            coords={"feature": features},
        ),
        "feature_importance_weight": xr.DataArray(
            importance_values * 0.8,
            dims=["feature"],
            coords={"feature": features},
        ),
    }
 )
 # Create a mock model state for Random Forest
 model_state_rf = xr.Dataset(
    {
        "feature_importance": xr.DataArray(
            importance_values,
            dims=["feature"],
            coords={"feature": features},
        )
    }
 )
 # Test extraction functions
 from entropice.dashboard.utils.data import (
    extract_arcticdem_features,
    extract_common_features,
    extract_embedding_features,
    extract_era5_features,
 )
 print("=" * 80)
 print("Testing ESPA model state")
 print("=" * 80)
 embedding_array = extract_embedding_features(model_state_espa)
 print(f"\nEmbedding features extracted: {embedding_array is not None}")
 if embedding_array is not None:
    print(f"  Dimensions: {embedding_array.dims}")
    print(f"  Shape: {embedding_array.shape}")
    print(f"  Coordinates: {list(embedding_array.coords)}")
 era5_array = extract_era5_features(model_state_espa)
 print(f"\nERA5 features extracted: {era5_array is not None}")
 if era5_array is not None:
    print(f"  Dimensions: {era5_array.dims}")
    print(f"  Shape: {era5_array.shape}")
    print(f"  Coordinates: {list(era5_array.coords)}")
 arcticdem_array = extract_arcticdem_features(model_state_espa)
 print(f"\nArcticDEM features extracted: {arcticdem_array is not None}")
 if arcticdem_array is not None:
    print(f"  Dimensions: {arcticdem_array.dims}")
    print(f"  Shape: {arcticdem_array.shape}")
    print(f"  Coordinates: {list(arcticdem_array.coords)}")
 common_array = extract_common_features(model_state_espa)
 print(f"\nCommon features extracted: {common_array is not None}")
 if common_array is not None:
    print(f"  Dimensions: {common_array.dims}")
    print(f"  Shape: {common_array.shape}")
    print(f"  Size: {common_array.size}")
 print("\n" + "=" * 80)
 print("Testing XGBoost model state")
 print("=" * 80)
 embedding_array_xgb = extract_embedding_features(model_state_xgb, importance_type="feature_importance_gain")
 print(f"\nEmbedding features (gain) extracted: {embedding_array_xgb is not None}")
 if embedding_array_xgb is not None:
    print(f"  Dimensions: {embedding_array_xgb.dims}")
    print(f"  Shape: {embedding_array_xgb.shape}")
 era5_array_xgb = extract_era5_features(model_state_xgb, importance_type="feature_importance_weight")
 print(f"\nERA5 features (weight) extracted: {era5_array_xgb is not None}")
 if era5_array_xgb is not None:
    print(f"  Dimensions: {era5_array_xgb.dims}")
    print(f"  Shape: {era5_array_xgb.shape}")
 print("\n" + "=" * 80)
 print("Testing Random Forest model state")
 print("=" * 80)
 embedding_array_rf = extract_embedding_features(model_state_rf, importance_type="feature_importance")
 print(f"\nEmbedding features extracted: {embedding_array_rf is not None}")
 if embedding_array_rf is not None:
    print(f"  Dimensions: {embedding_array_rf.dims}")
    print(f"  Shape: {embedding_array_rf.shape}")
 arcticdem_array_rf = extract_arcticdem_features(model_state_rf, importance_type="feature_importance")
 print(f"\nArcticDEM features extracted: {arcticdem_array_rf is not None}")
 if arcticdem_array_rf is not None:
    print(f"  Dimensions: {arcticdem_array_rf.dims}")
    print(f"  Shape: {arcticdem_array_rf.shape}")
 print("\n" + "=" * 80)
 print("All tests completed successfully!")
 print("=" * 80)