Update docs, instructions and format code
This commit is contained in:
parent
fca232da91
commit
4260b492ab
29 changed files with 987 additions and 467 deletions
254
.github/copilot-instructions.md
vendored
254
.github/copilot-instructions.md
vendored
|
|
@ -1,226 +1,70 @@
|
|||
# Entropice - GitHub Copilot Instructions
|
||||
# Entropice - Copilot Instructions
|
||||
|
||||
## Project Overview
|
||||
## Project Context
|
||||
|
||||
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
|
||||
This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.
|
||||
|
||||
For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
|
||||
For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
|
||||
For project goals and setup, see [README.md](../README.md).
|
||||
## Code Style
|
||||
|
||||
## Core Technologies
|
||||
- Follow PEP 8 conventions with 120 character line length
|
||||
- Use type hints for all function signatures
|
||||
- Use google-style docstrings for public functions
|
||||
- Keep functions focused and modular
|
||||
- Use `ruff` for linting and formatting, `ty` for type checking
|
||||
|
||||
- **Python**: 3.13 (strict version requirement)
|
||||
- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
|
||||
- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
|
||||
- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
|
||||
- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
|
||||
- **Storage**: Zarr, Icechunk, Parquet, NetCDF
|
||||
- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
|
||||
## Technology Stack
|
||||
|
||||
## Code Style Guidelines
|
||||
- **Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
|
||||
- **Spatial**: H3, xdggs, xvec for discrete global grid systems
|
||||
- **ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
|
||||
- **GPU**: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
|
||||
- **Storage**: Zarr, Icechunk, Parquet for intermediate data
|
||||
- **CLI**: Cyclopts with dataclass-based configurations
|
||||
|
||||
### Python Standards
|
||||
## Execution Guidelines
|
||||
|
||||
- Follow **PEP 8** conventions
|
||||
- Use **type hints** for all function signatures
|
||||
- Write **numpy-style docstrings** for public functions
|
||||
- Keep functions **focused and modular**
|
||||
- Prefer descriptive variable names over abbreviations
|
||||
- Always use `pixi run` to execute Python commands and scripts
|
||||
- Environment variables: `SCIPY_ARRAY_API=1`, `FAST_DATA_DIR=./data`
|
||||
|
||||
### Geospatial Best Practices
|
||||
## Geospatial Best Practices
|
||||
|
||||
- Use **EPSG:3413** (Arctic Stereographic) for computations
|
||||
- Use **EPSG:4326** (WGS84) for visualization and library compatibility
|
||||
- Store gridded data using **xarray with XDGGS** indexing
|
||||
- Store tabular data as **Parquet**, array data as **Zarr**
|
||||
- Leverage **Dask** for lazy evaluation of large datasets
|
||||
- Use **GeoPandas** for vector operations
|
||||
- Handle **antimeridian** correctly for polar regions
|
||||
- Use EPSG:3413 (Arctic Stereographic) for computations
|
||||
- Use EPSG:4326 (WGS84) for visualization and compatibility
|
||||
- Store gridded data as XDGGS Xarray datasets (Zarr format)
|
||||
- Store tabular data as GeoParquet
|
||||
- Handle antimeridian issues in polar regions
|
||||
- Leverage Xarray/Dask for lazy evaluation and chunked processing
|
||||
|
||||
### Data Pipeline Conventions
|
||||
## Architecture Patterns
|
||||
|
||||
- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
|
||||
- Each pipeline stage should produce **reproducible intermediate outputs**
|
||||
- Use `src/entropice/utils/paths.py` for consistent path management
|
||||
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
|
||||
- Modular CLI design: each module exposes standalone Cyclopts CLI
|
||||
- Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
|
||||
- GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
|
||||
- Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization
|
||||
|
||||
### Storage Hierarchy
|
||||
## Data Storage Hierarchy
|
||||
|
||||
All data follows this structure:
|
||||
```
|
||||
DATA_DIR/
|
||||
├── grids/ # H3/HEALPix tessellations (GeoParquet)
|
||||
├── darts/ # RTS labels (GeoParquet)
|
||||
├── era5/ # Climate data (Zarr)
|
||||
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
||||
├── alphaearth/ # Satellite embeddings (Zarr)
|
||||
├── datasets/ # L2 XDGGS datasets (Zarr)
|
||||
├── training-results/ # Models, CV results, predictions
|
||||
└── watermask/ # Ocean mask (GeoParquet)
|
||||
├── grids/ # H3/HEALPix tessellations (GeoParquet)
|
||||
├── darts/ # RTS labels (GeoParquet)
|
||||
├── era5/ # Climate data (Zarr)
|
||||
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
||||
├── alphaearth/ # Satellite embeddings (Zarr)
|
||||
├── datasets/ # L2 XDGGS datasets (Zarr)
|
||||
└── training-results/ # Models, CV results, predictions
|
||||
```
|
||||
|
||||
## Module Organization
|
||||
## Key Modules
|
||||
|
||||
### Core Modules (`src/entropice/`)
|
||||
- `entropice.spatial`: Grid generation and raster-to-vector aggregation
|
||||
- `entropice.ingest`: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)
|
||||
- `entropice.ml`: Dataset assembly, training, inference
|
||||
- `entropice.dashboard`: Streamlit visualization app
|
||||
- `entropice.utils`: Paths, codecs, types
|
||||
|
||||
The codebase is organized into four main packages:
|
||||
## Testing & Notebooks
|
||||
|
||||
- **`entropice.ingest`**: Data ingestion from external sources
|
||||
- **`entropice.spatial`**: Spatial operations and grid management
|
||||
- **`entropice.ml`**: Machine learning workflows
|
||||
- **`entropice.utils`**: Common utilities
|
||||
|
||||
#### Data Ingestion (`src/entropice/ingest/`)
|
||||
|
||||
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
|
||||
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
|
||||
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
|
||||
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
|
||||
|
||||
#### Spatial Operations (`src/entropice/spatial/`)
|
||||
|
||||
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
|
||||
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
|
||||
- **`watermask.py`**: Ocean masking utilities
|
||||
- **`xvec.py`**: Extended vector operations for xarray
|
||||
|
||||
#### Machine Learning (`src/entropice/ml/`)
|
||||
|
||||
- **`dataset.py`**: Multi-source data integration and feature engineering
|
||||
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
|
||||
- **`inference.py`**: Batch prediction pipeline for trained models
|
||||
|
||||
#### Utilities (`src/entropice/utils/`)
|
||||
|
||||
- **`paths.py`**: Centralized path management
|
||||
- **`codecs.py`**: Custom codecs for data serialization
|
||||
|
||||
### Dashboard (`src/entropice/dashboard/`)
|
||||
|
||||
- Streamlit-based interactive visualization
|
||||
- Modular pages: overview, training data, analysis, model state, inference
|
||||
- Bokeh-based geospatial plotting utilities
|
||||
- Run with: `pixi run dashboard`
|
||||
|
||||
### Scripts (`scripts/`)
|
||||
|
||||
- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
|
||||
- Run entire pipeline for multiple grid configurations
|
||||
- Each script uses CLIs from core modules
|
||||
|
||||
### Notebooks (`notebooks/`)
|
||||
|
||||
- Exploratory analysis and validation
|
||||
- **NOT committed to git**
|
||||
- Keep production code in `src/entropice/`
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
pixi install # NOT pip install or conda install
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
pixi run pytest
|
||||
```
|
||||
|
||||
### Running Python Commands
|
||||
|
||||
Always use `pixi run` to execute Python commands to use the correct environment:
|
||||
|
||||
```bash
|
||||
pixi run python script.py
|
||||
pixi run python -c "import entropice"
|
||||
```
|
||||
|
||||
### Common Tasks
|
||||
|
||||
**Important**: Always use `pixi run` prefix for Python commands to ensure correct environment.
|
||||
|
||||
- **Generate grids**: Use `pixi run create-grid` or `spatial/grids.py` CLI
|
||||
- **Process labels**: Use `pixi run darts` or `ingest/darts.py` CLI
|
||||
- **Train models**: Use `pixi run train` with TOML config or `ml/training.py` CLI
|
||||
- **Run inference**: Use `ml/inference.py` CLI
|
||||
- **View results**: `pixi run dashboard`
|
||||
|
||||
## Key Design Patterns
|
||||
|
||||
### 1. XDGGS Indexing
|
||||
|
||||
All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
|
||||
|
||||
### 2. Lazy Evaluation
|
||||
|
||||
Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
|
||||
|
||||
### 3. GPU Acceleration
|
||||
|
||||
Prefer GPU-accelerated operations:
|
||||
- CuPy for array operations
|
||||
- cuML for Random Forest and KNN
|
||||
- XGBoost GPU training
|
||||
- PyTorch tensors when applicable
|
||||
|
||||
### 4. Configuration as Code
|
||||
|
||||
- Use dataclasses for typed configuration
|
||||
- TOML files for training hyperparameters
|
||||
- Cyclopts for CLI argument parsing
|
||||
|
||||
## Data Sources
|
||||
|
||||
- **DARTS v2**: RTS labels (year, area, count, density)
|
||||
- **ERA5**: Climate data (40-year history, Arctic-aligned years)
|
||||
- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
|
||||
- **AlphaEarth**: 64-dimensional satellite embeddings
|
||||
- **Watermask**: Ocean exclusion layer
|
||||
|
||||
## Model Support
|
||||
|
||||
Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
|
||||
Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
|
||||
|
||||
Training features:
|
||||
- Randomized hyperparameter search
|
||||
- K-Fold cross-validation
|
||||
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
|
||||
|
||||
## Extension Points
|
||||
|
||||
To extend Entropice:
|
||||
|
||||
- **New data source**: Follow patterns in `ingest/era5.py` or `ingest/arcticdem.py`
|
||||
- **Custom aggregations**: Add to `_Aggregations` dataclass in `spatial/aggregators.py`
|
||||
- **Alternative labels**: Implement extractor following `ingest/darts.py` pattern
|
||||
- **New models**: Add scikit-learn compatible estimators to `ml/training.py`
|
||||
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
|
||||
|
||||
## Important Notes
|
||||
|
||||
- **Always use `pixi run` prefix** for Python commands (not plain `python`)
|
||||
- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
|
||||
- Arctic years run **October 1 to September 30** (not calendar years)
|
||||
- Handle **antimeridian crossing** in polar regions
|
||||
- Use **batch processing** for GPU memory management
|
||||
- Notebooks are for exploration only - **keep production code in `src/`**
|
||||
- Always use **absolute paths** or paths from `utils/paths.py`
|
||||
|
||||
## Common Issues
|
||||
|
||||
- **Memory**: Use batch processing and Dask chunking for large datasets
|
||||
- **GPU OOM**: Reduce batch size in inference or training
|
||||
- **Antimeridian**: Use proper handling in `spatial/aggregators.py` for polar grids
|
||||
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
|
||||
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
|
||||
|
||||
## References
|
||||
|
||||
For more details, consult:
|
||||
- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
|
||||
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
|
||||
- [README.md](../README.md) - Project goals and setup instructions
|
||||
- Production code belongs in `src/entropice/`, not notebooks
|
||||
- Notebooks in `notebooks/` are for exploration only (not version-controlled)
|
||||
- Use `pytest` for testing geospatial correctness and data integrity
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue