Update docs, instructions and format code

This commit is contained in:
Tobias Hölzer 2026-01-04 17:19:02 +01:00
parent fca232da91
commit 4260b492ab
29 changed files with 987 additions and 467 deletions

View file

@ -6,7 +6,6 @@ Thank you for your interest in contributing to Entropice! This document provides
### Prerequisites
- Python 3.13
- CUDA 12 compatible GPU (for full functionality)
- [Pixi package manager](https://pixi.sh/)
@ -16,55 +15,36 @@ Thank you for your interest in contributing to Entropice! This document provides
pixi install
```
This will set up the complete environment including RAPIDS, PyTorch, and all geospatial dependencies.
This will set up the complete environment including Python, RAPIDS, PyTorch, and all geospatial dependencies.
## Development Workflow
**Important**: Always use `pixi run` to execute Python commands and scripts to ensure you're using the correct environment with all dependencies.
> Read in the [Architecture Guide](ARCHITECTURE.md) about the code organisatoin and key modules
### Code Organization
**Important**: Always use `pixi run` to execute Python commands and scripts to ensure you're using the correct environment with all dependencies:
- **`src/entropice/ingest/`**: Data ingestion modules (darts, era5, arcticdem, alphaearth)
- **`src/entropice/spatial/`**: Spatial operations (grids, aggregators, watermask, xvec)
- **`src/entropice/ml/`**: Machine learning components (dataset, training, inference)
- **`src/entropice/utils/`**: Utilities (paths, codecs)
- **`src/entropice/dashboard/`**: Streamlit visualization dashboard
- **`scripts/`**: Data processing pipeline scripts (numbered 00-05)
- **`notebooks/`**: Exploratory analysis and validation notebooks
- **`tests/`**: Unit tests
```bash
pixi run python script.py
pixi run python -c "import entropice"
```
### Key Modules
- `spatial/grids.py`: H3/HEALPix spatial grid systems
- `ingest/darts.py`, `ingest/era5.py`, `ingest/arcticdem.py`, `ingest/alphaearth.py`: Data source processors
- `ml/dataset.py`: Dataset assembly and feature engineering
- `ml/training.py`: Model training with eSPA, XGBoost, Random Forest, KNN
- `ml/inference.py`: Prediction generation
- `utils/paths.py`: Centralized path management
## Coding Standards
### Python Style
### Python Style and Formatting
- Follow PEP 8 conventions
- Use type hints for function signatures
- Prefer numpy-style docstrings for public functions
- Prefer google-style docstrings for public functions
- Keep functions focused and modular
### Geospatial Best Practices
`ty` and `ruff` are used for typing, linting and formatting.
Ensure to check for any warnings from both of these:
- Use **xarray** with XDGGS for gridded data storage
- Store intermediate results as **Parquet** (tabular) or **Zarr** (arrays)
- Leverage **Dask** for lazy evaluation of large datasets
- Use **GeoPandas** for vector operations
- Use EPSG:3413 (Arctic Stereographic) coordinate reference system (CRS) for any computation on the data and EPSG:4326 (WGS84) for data visualization and compatability with some libraries
```sh
pixi run ty check # For type checks
pixi run ruff check # For linting
pixi run ruff format # For formatting
```
### Data Pipeline
- Follow the numbered script sequence: `00grids.sh``01darts.sh` → ... → `05train.sh`
- Each stage should produce reproducible intermediate outputs
- Document data dependencies in module docstrings
- Use `utils/paths.py` for consistent path management
Single files can be specified by just adding them to the command, e.g. `pixi run ty check src/entropice/dashboard/app.py`
## Testing
@ -74,13 +54,6 @@ Run tests for specific modules:
pixi run pytest
```
When running Python scripts or commands, always use `pixi run`:
```bash
pixi run python script.py
pixi run python -c "import entropice"
```
When adding features, include tests that verify:
- Correct handling of geospatial coordinates and projections
@ -100,15 +73,8 @@ When adding features, include tests that verify:
### Commit Messages
- Use present tense: "Add feature" not "Added feature"
- Reference issues when applicable: "Fix #123: Correct grid aggregation"
- Keep first line under 72 characters
## Working with Data
### Local Development
- Set `FAST_DATA_DIR` environment variable for data directory (default: `./data`)
### Notebooks
- Notebooks in `notebooks/` are for exploration and validation, they are not commited to git