Refactor non-dashboard modules

This commit is contained in:
Tobias Hölzer 2025-12-28 20:38:51 +01:00
parent 907e907856
commit 45bc61e49e
22 changed files with 122 additions and 186 deletions

View file

@ -42,7 +42,7 @@ For project goals and setup, see [README.md](../README.md).
- Follow the numbered script sequence: `00grids.sh``01darts.sh``02alphaearth.sh``03era5.sh``04arcticdem.sh``05train.sh`
- Each pipeline stage should produce **reproducible intermediate outputs**
- Use `src/entropice/paths.py` for consistent path management
- Use `src/entropice/utils/paths.py` for consistent path management
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
### Storage Hierarchy
@ -64,16 +64,37 @@ DATA_DIR/
### Core Modules (`src/entropice/`)
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
The codebase is organized into four main packages:
- **`entropice.ingest`**: Data ingestion from external sources
- **`entropice.spatial`**: Spatial operations and grid management
- **`entropice.ml`**: Machine learning workflows
- **`entropice.utils`**: Common utilities
#### Data Ingestion (`src/entropice/ingest/`)
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
#### Spatial Operations (`src/entropice/spatial/`)
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
- **`watermask.py`**: Ocean masking utilities
- **`xvec.py`**: Extended vector operations for xarray
#### Machine Learning (`src/entropice/ml/`)
- **`dataset.py`**: Multi-source data integration and feature engineering
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
- **`inference.py`**: Batch prediction pipeline for trained models
#### Utilities (`src/entropice/utils/`)
- **`paths.py`**: Centralized path management
- **`codecs.py`**: Custom codecs for data serialization
### Dashboard (`src/entropice/dashboard/`)
@ -110,10 +131,10 @@ pixi run pytest
### Common Tasks
- **Generate grids**: Use `grids.py` CLI
- **Process labels**: Use `darts.py` CLI
- **Train models**: Use `training.py` CLI with TOML config
- **Run inference**: Use `inference.py` CLI
- **Generate grids**: Use `spatial/grids.py` CLI
- **Process labels**: Use `ingest/darts.py` CLI
- **Train models**: Use `ml/training.py` CLI with TOML config
- **Run inference**: Use `ml/inference.py` CLI
- **View results**: `pixi run dashboard`
## Key Design Patterns
@ -162,10 +183,10 @@ Training features:
To extend Entropice:
- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
- **Alternative labels**: Implement extractor following `darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `training.py`
- **New data source**: Follow patterns in `ingest/era5.py` or `ingest/arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `spatial/aggregators.py`
- **Alternative labels**: Implement extractor following `ingest/darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `ml/training.py`
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
## Important Notes
@ -175,13 +196,13 @@ To extend Entropice:
- Handle **antimeridian crossing** in polar regions
- Use **batch processing** for GPU memory management
- Notebooks are for exploration only - **keep production code in `src/`**
- Always use **absolute paths** or paths from `paths.py`
- Always use **absolute paths** or paths from `utils/paths.py`
## Common Issues
- **Memory**: Use batch processing and Dask chunking for large datasets
- **GPU OOM**: Reduce batch size in inference or training
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
- **Antimeridian**: Use proper handling in `spatial/aggregators.py` for polar grids
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326