Update docs, instructions and format code

This commit is contained in:
Tobias Hölzer 2026-01-04 17:19:02 +01:00
parent fca232da91
commit 4260b492ab
29 changed files with 987 additions and 467 deletions

View file

@ -27,12 +27,20 @@ The pipeline follows a sequential processing approach where each stage produces
### System Components
The codebase is organized into four main packages:
The project in general is organized into four directories:
- **`src/entropice/`**: The main processing codebase
- **`scripts/`**: Bash scripts and wrapper for data processing pipeline scripts
- **`notebooks/`**: Exploratory analysis and validation notebooks (not commited to git)
- **`tests/`**: Unit tests and manual validation scripts
The processing codebase is organized into five main packages:
- **`entropice.ingest`**: Data ingestion from external sources (DARTS, ERA5, ArcticDEM, AlphaEarth)
- **`entropice.spatial`**: Spatial operations and grid management
- **`entropice.ml`**: Machine learning workflows (dataset, training, inference)
- **`entropice.utils`**: Common utilities (paths, codecs)
- **`entropice.dashboard`**: Streamlit Dashboard for interactive visualization
#### 1. Spatial Grid System (`spatial/grids.py`)
@ -179,6 +187,14 @@ scripts/05train.sh # Model training
- TOML files for training hyperparameters
- Environment-based path management (`utils/paths.py`)
### 6. Geospatial Best Practices
- Use **xarray** with XDGGS for gridded data storage
- Store intermediate results as **Parquet** (tabular) or **Zarr** (arrays)
- Leverage **Dask** for lazy evaluation of large datasets
- Use **GeoPandas** for vector operations
- Use EPSG:3413 (Arctic Stereographic) coordinate reference system (CRS) for any computation on the data and EPSG:4326 (WGS84) for data visualization and compatability with some libraries
## Data Storage Hierarchy
```sh