Refactor non-dashboard modules
This commit is contained in:
parent
907e907856
commit
45bc61e49e
22 changed files with 122 additions and 186 deletions
|
|
@ -27,7 +27,14 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
|
||||
### System Components
|
||||
|
||||
#### 1. Spatial Grid System (`grids.py`)
|
||||
The codebase is organized into four main packages:
|
||||
|
||||
- **`entropice.ingest`**: Data ingestion from external sources (DARTS, ERA5, ArcticDEM, AlphaEarth)
|
||||
- **`entropice.spatial`**: Spatial operations and grid management
|
||||
- **`entropice.ml`**: Machine learning workflows (dataset, training, inference)
|
||||
- **`entropice.utils`**: Common utilities (paths, codecs)
|
||||
|
||||
#### 1. Spatial Grid System (`spatial/grids.py`)
|
||||
|
||||
- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
|
||||
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
|
||||
|
|
@ -39,7 +46,7 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
- Provides spatial indexing for efficient data aggregation
|
||||
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas
|
||||
|
||||
#### 2. Label Management (`darts.py`)
|
||||
#### 2. Label Management (`ingest/darts.py`)
|
||||
|
||||
- **Purpose**: Extracts RTS labels from DARTS v2 dataset
|
||||
- **Processing**:
|
||||
|
|
@ -51,7 +58,7 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
|
||||
#### 3. Feature Extractors
|
||||
|
||||
**ERA5 Climate Data (`era5.py`)**
|
||||
**ERA5 Climate Data (`ingest/era5.py`)**
|
||||
|
||||
- Downloads hourly climate variables from Copernicus Climate Data Store
|
||||
- Computes daily aggregates: temperature extrema, precipitation, snow metrics
|
||||
|
|
@ -59,7 +66,7 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
- Temporal aggregations: yearly, seasonal, shoulder seasons
|
||||
- Uses Arctic-aligned years (October 1 - September 30)
|
||||
|
||||
**ArcticDEM Terrain (`arcticdem.py`)**
|
||||
**ArcticDEM Terrain (`ingest/arcticdem.py`)**
|
||||
|
||||
- Processes 32m resolution Arctic elevation data
|
||||
- Computes terrain derivatives: slope, aspect, curvature
|
||||
|
|
@ -67,14 +74,14 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
- Applies watermask clipping and GPU-accelerated convolutions
|
||||
- Aggregates terrain statistics per grid cell
|
||||
|
||||
**AlphaEarth Embeddings (`alphaearth.py`)**
|
||||
**AlphaEarth Embeddings (`ingest/alphaearth.py`)**
|
||||
|
||||
- Extracts 64-dimensional satellite image embeddings via Google Earth Engine
|
||||
- Uses foundation models to capture visual patterns
|
||||
- Partitions large grids using KMeans clustering
|
||||
- Temporal sampling across multiple years
|
||||
|
||||
#### 4. Spatial Aggregation Framework (`aggregators.py`)
|
||||
#### 4. Spatial Aggregation Framework (`spatial/aggregators.py`)
|
||||
|
||||
- **Core Capability**: Raster-to-vector aggregation engine
|
||||
- **Methods**:
|
||||
|
|
@ -86,7 +93,7 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
- Parallel processing with worker pools
|
||||
- GPU acceleration via CuPy/CuML where applicable
|
||||
|
||||
#### 5. Dataset Assembly (`dataset.py`)
|
||||
#### 5. Dataset Assembly (`ml/dataset.py`)
|
||||
|
||||
- **DatasetEnsemble Class**: Orchestrates multi-source data integration
|
||||
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
|
||||
|
|
@ -98,7 +105,7 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
- GPU-accelerated data loading (PyTorch/CuPy)
|
||||
- **Output**: Tabular feature matrices ready for scikit-learn API
|
||||
|
||||
#### 6. Model Training (`training.py`)
|
||||
#### 6. Model Training (`ml/training.py`)
|
||||
|
||||
- **Supported Models**:
|
||||
- **eSPA**: Entropy-optimal probabilistic classifier (primary)
|
||||
|
|
@ -112,7 +119,7 @@ The pipeline follows a sequential processing approach where each stage produces
|
|||
- **Configuration**: TOML-based configuration with Cyclopts CLI
|
||||
- **Output**: Pickled models, CV results, feature importance
|
||||
|
||||
#### 7. Inference (`inference.py`)
|
||||
#### 7. Inference (`ml/inference.py`)
|
||||
|
||||
- Batch prediction pipeline for trained classifiers
|
||||
- GPU memory management with configurable batch sizes
|
||||
|
|
@ -170,7 +177,7 @@ scripts/05train.sh # Model training
|
|||
|
||||
- Dataclasses for typed configuration
|
||||
- TOML files for training hyperparameters
|
||||
- Environment-based path management (`paths.py`)
|
||||
- Environment-based path management (`utils/paths.py`)
|
||||
|
||||
## Data Storage Hierarchy
|
||||
|
||||
|
|
@ -209,9 +216,9 @@ DATA_DIR/
|
|||
|
||||
The architecture supports extension through:
|
||||
|
||||
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
|
||||
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
|
||||
- **Alternative Targets**: Implement label extractor following DARTS pattern
|
||||
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
|
||||
- **New Data Sources**: Implement feature extractor in `ingest/` following ERA5/ArcticDEM patterns
|
||||
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass in `spatial/aggregators.py`
|
||||
- **Alternative Targets**: Implement label extractor in `ingest/` following DARTS pattern
|
||||
- **Alternative Models**: Extend training CLI in `ml/training.py` with new scikit-learn compatible estimators
|
||||
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
|
||||
- **Grid Systems**: Support additional DGGS via xdggs integration
|
||||
- **Grid Systems**: Support additional DGGS in `spatial/grids.py` via xdggs integration
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue