Refactor non-dashboard modules

2025-12-28 20:38:51 +01:00 · 2025-12-28 20:38:51 +01:00 · 45bc61e49e
commit 45bc61e49e
parent 907e907856
22 changed files with 122 additions and 186 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -27,7 +27,14 @@ The pipeline follows a sequential processing approach where each stage produces

 ### System Components

-#### 1. Spatial Grid System (`grids.py`)
+The codebase is organized into four main packages:
+
+- **`entropice.ingest`**: Data ingestion from external sources (DARTS, ERA5, ArcticDEM, AlphaEarth)
+- **`entropice.spatial`**: Spatial operations and grid management
+- **`entropice.ml`**: Machine learning workflows (dataset, training, inference)
+- **`entropice.utils`**: Common utilities (paths, codecs)
+
+#### 1. Spatial Grid System (`spatial/grids.py`)

 - **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
 - **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
@ -39,7 +46,7 @@ The pipeline follows a sequential processing approach where each stage produces
  - Provides spatial indexing for efficient data aggregation
 - **Output**: GeoDataFrames with cell IDs, geometries, and land areas

-#### 2. Label Management (`darts.py`)
+#### 2. Label Management (`ingest/darts.py`)

 - **Purpose**: Extracts RTS labels from DARTS v2 dataset
 - **Processing**:
@ -51,7 +58,7 @@ The pipeline follows a sequential processing approach where each stage produces

 #### 3. Feature Extractors

-**ERA5 Climate Data (`era5.py`)**
+**ERA5 Climate Data (`ingest/era5.py`)**

 - Downloads hourly climate variables from Copernicus Climate Data Store
 - Computes daily aggregates: temperature extrema, precipitation, snow metrics
@ -59,7 +66,7 @@ The pipeline follows a sequential processing approach where each stage produces
 - Temporal aggregations: yearly, seasonal, shoulder seasons
 - Uses Arctic-aligned years (October 1 - September 30)

-**ArcticDEM Terrain (`arcticdem.py`)**
+**ArcticDEM Terrain (`ingest/arcticdem.py`)**

 - Processes 32m resolution Arctic elevation data
 - Computes terrain derivatives: slope, aspect, curvature
@ -67,14 +74,14 @@ The pipeline follows a sequential processing approach where each stage produces
 - Applies watermask clipping and GPU-accelerated convolutions
 - Aggregates terrain statistics per grid cell

-**AlphaEarth Embeddings (`alphaearth.py`)**
+**AlphaEarth Embeddings (`ingest/alphaearth.py`)**

 - Extracts 64-dimensional satellite image embeddings via Google Earth Engine
 - Uses foundation models to capture visual patterns
 - Partitions large grids using KMeans clustering
 - Temporal sampling across multiple years

-#### 4. Spatial Aggregation Framework (`aggregators.py`)
+#### 4. Spatial Aggregation Framework (`spatial/aggregators.py`)

 - **Core Capability**: Raster-to-vector aggregation engine
 - **Methods**:
@ -86,7 +93,7 @@ The pipeline follows a sequential processing approach where each stage produces
  - Parallel processing with worker pools
  - GPU acceleration via CuPy/CuML where applicable

-#### 5. Dataset Assembly (`dataset.py`)
+#### 5. Dataset Assembly (`ml/dataset.py`)

 - **DatasetEnsemble Class**: Orchestrates multi-source data integration
 - **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
@ -98,7 +105,7 @@ The pipeline follows a sequential processing approach where each stage produces
  - GPU-accelerated data loading (PyTorch/CuPy)
 - **Output**: Tabular feature matrices ready for scikit-learn API

-#### 6. Model Training (`training.py`)
+#### 6. Model Training (`ml/training.py`)

 - **Supported Models**:
  - **eSPA**: Entropy-optimal probabilistic classifier (primary)
@ -112,7 +119,7 @@ The pipeline follows a sequential processing approach where each stage produces
 - **Configuration**: TOML-based configuration with Cyclopts CLI
 - **Output**: Pickled models, CV results, feature importance

-#### 7. Inference (`inference.py`)
+#### 7. Inference (`ml/inference.py`)

 - Batch prediction pipeline for trained classifiers
 - GPU memory management with configurable batch sizes
@ -170,7 +177,7 @@ scripts/05train.sh      # Model training

 - Dataclasses for typed configuration
 - TOML files for training hyperparameters
- Environment-based path management (`paths.py`)
+- Environment-based path management (`utils/paths.py`)

 ## Data Storage Hierarchy

@ -209,9 +216,9 @@ DATA_DIR/

 The architecture supports extension through:

- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
- **Alternative Targets**: Implement label extractor following DARTS pattern
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
+- **New Data Sources**: Implement feature extractor in `ingest/` following ERA5/ArcticDEM patterns
+- **Custom Aggregations**: Add methods to `_Aggregations` dataclass in `spatial/aggregators.py`
+- **Alternative Targets**: Implement label extractor in `ingest/` following DARTS pattern
+- **Alternative Models**: Extend training CLI in `ml/training.py` with new scikit-learn compatible estimators
 - **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
- **Grid Systems**: Support additional DGGS via xdggs integration
+- **Grid Systems**: Support additional DGGS in `spatial/grids.py` via xdggs integration