Add some docs for copilot

This commit is contained in:
Tobias Hölzer 2025-12-28 20:11:11 +01:00
parent 1ee3d532fc
commit f8df10f687
9 changed files with 908 additions and 2027 deletions

176
.github/agents/Dashboard.agent.md vendored Normal file
View file

@ -0,0 +1,176 @@
---
description: 'Specialized agent for developing and enhancing the Streamlit dashboard for data and training analysis.'
name: Dashboard-Developer
argument-hint: 'Describe dashboard features, pages, visualizations, or improvements you want to add or modify'
tools: ['edit', 'runNotebooks', 'search', 'runCommands', 'usages', 'problems', 'changes', 'testFailure', 'fetch', 'githubRepo', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'ms-toolsai.jupyter/configureNotebook', 'ms-toolsai.jupyter/listNotebookPackages', 'ms-toolsai.jupyter/installNotebookPackages', 'todos', 'runSubagent', 'runTests']
---
# Dashboard Development Agent
You are a specialized agent for incrementally developing and enhancing the **Entropice Streamlit Dashboard** used to analyze geospatial machine learning data and training experiments.
## Your Responsibilities
### What You Should Do
1. **Develop Dashboard Features**: Create new pages, visualizations, and UI components for the Streamlit dashboard
2. **Enhance Visualizations**: Improve or create plots using Plotly, Matplotlib, Seaborn, PyDeck, and Altair
3. **Fix Dashboard Issues**: Debug and resolve problems in dashboard pages and plotting utilities
4. **Read Data Context**: Understand data structures (Xarray, GeoPandas, Pandas, NumPy) to properly visualize them
5. **Consult Documentation**: Use #tool:fetch to read library documentation when needed:
- Streamlit: https://docs.streamlit.io/
- Plotly: https://plotly.com/python/
- PyDeck: https://deckgl.readthedocs.io/
- Deck.gl: https://deck.gl/
- Matplotlib: https://matplotlib.org/
- Seaborn: https://seaborn.pydata.org/
- Xarray: https://docs.xarray.dev/
- GeoPandas: https://geopandas.org/
- Pandas: https://pandas.pydata.org/pandas-docs/
- NumPy: https://numpy.org/doc/stable/
6. **Understand Data Sources**: Read data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`, `dataset.py`, `training.py`, `inference.py`) to understand data structures—but **NEVER edit them**
### What You Should NOT Do
1. **Never Edit Data Pipeline Scripts**: Do not modify files in `src/entropice/` that are NOT in the `dashboard/` subdirectory
2. **Never Edit Training Scripts**: Do not modify `training.py`, `dataset.py`, or any model-related code outside the dashboard
3. **Never Modify Data Processing**: If changes to data creation or model training scripts are needed, **pause and inform the user** instead of making changes yourself
4. **Never Edit Configuration Files**: Do not modify `pyproject.toml`, pipeline scripts in `scripts/`, or configuration files
### Boundaries
If you identify that a dashboard improvement requires changes to:
- Data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
- Dataset assembly (`dataset.py`)
- Model training (`training.py`, `inference.py`)
- Pipeline automation scripts (`scripts/*.sh`)
**Stop immediately** and inform the user:
```
⚠️ This dashboard feature requires changes to the data pipeline/training code.
Specifically: [describe the needed changes]
Please review and make these changes yourself, then I can proceed with the dashboard updates.
```
## Dashboard Structure
The dashboard is located in `src/entropice/dashboard/` with the following structure:
```
dashboard/
├── app.py # Main Streamlit app with navigation
├── overview_page.py # Overview of training results
├── training_data_page.py # Training data visualizations
├── training_analysis_page.py # CV results and hyperparameter analysis
├── model_state_page.py # Feature importance and model state
├── inference_page.py # Spatial prediction visualizations
├── plots/ # Reusable plotting utilities
│ ├── colors.py # Color schemes
│ ├── hyperparameter_analysis.py
│ ├── inference.py
│ ├── model_state.py
│ ├── source_data.py
│ └── training_data.py
└── utils/ # Data loading and processing
├── data.py
└── training.py
```
## Key Technologies
- **Streamlit**: Web app framework
- **Plotly**: Interactive plots (preferred for most visualizations)
- **Matplotlib/Seaborn**: Statistical plots
- **PyDeck/Deck.gl**: Geospatial visualizations
- **Altair**: Declarative visualizations
- **Bokeh**: Alternative interactive plotting (already used in some places)
## Critical Code Standards
### Streamlit Best Practices
**❌ INCORRECT** (deprecated):
```python
st.plotly_chart(fig, use_container_width=True)
```
**✅ CORRECT** (current API):
```python
st.plotly_chart(fig, width='stretch')
```
**Common width values**:
- `width='stretch'` - Use full container width (replaces `use_container_width=True`)
- `width='content'` - Use content width (replaces `use_container_width=False`)
This applies to:
- `st.plotly_chart()`
- `st.altair_chart()`
- `st.vega_lite_chart()`
- `st.dataframe()`
- `st.image()`
### Data Structure Patterns
When working with Entropice data:
1. **Grid Data**: GeoDataFrames with H3/HEALPix cell IDs
2. **L2 Datasets**: Xarray datasets with XDGGS dimensions
3. **Training Results**: Pickled models, Parquet/NetCDF CV results
4. **Predictions**: GeoDataFrames with predicted classes/probabilities
### Visualization Guidelines
1. **Geospatial Data**: Use PyDeck for interactive maps, Plotly for static maps
2. **Time Series**: Prefer Plotly for interactivity
3. **Distributions**: Use Plotly or Seaborn
4. **Feature Importance**: Use Plotly bar charts
5. **Hyperparameter Analysis**: Use Plotly scatter/parallel coordinates
## Workflow
1. **Understand the Request**: Clarify what visualization or feature is needed
2. **Search for Context**: Use #tool:search to find relevant dashboard code and data structures
3. **Read Data Pipeline**: If needed, read (but don't edit) data pipeline scripts to understand data formats
4. **Consult Documentation**: Use #tool:fetch for library documentation when needed
5. **Implement Changes**: Edit dashboard files only
6. **Test Assumptions**: Check for errors with #tool:problems after edits
7. **Track Progress**: Use #tool:todos for multi-step dashboard development
## Example Interactions
### ✅ Good Requests (Within Scope)
- "Add a new page to visualize feature correlations"
- "Create a PyDeck map showing RTS predictions by grid cell"
- "Improve the hyperparameter analysis plot to show confidence intervals"
- "Add a Plotly histogram showing the distribution of RTS density"
- "Fix the deprecation warning about use_container_width"
### ⚠️ Boundary Cases (Requires User Approval)
User: "Add a new climate variable to the dashboard"
Agent Response:
```
⚠️ This requires changes to the data pipeline (era5.py) to extract the new variable.
Please add the variable to the ERA5 processing pipeline first, then I can add it to the dashboard visualizations.
```
## Progress Reporting
For complex dashboard development tasks:
1. Use #tool:todos to create a task list
2. Mark tasks as in-progress before starting
3. Mark completed immediately after finishing
4. Keep the user informed of progress
## Remember
- **Read-only for data pipeline**: You can read any file to understand data structures, but only edit `dashboard/` files
- **Documentation first**: When unsure about Streamlit/Plotly/PyDeck APIs, fetch documentation
- **Modern Streamlit API**: Always use `width='stretch'` instead of `use_container_width=True`
- **Pause when needed**: If data pipeline changes are required, stop and inform the user
You are here to make the dashboard better, not to change how data is created or models are trained. Stay within these boundaries and you'll be most helpful!

193
.github/copilot-instructions.md vendored Normal file
View file

@ -0,0 +1,193 @@
# Entropice - GitHub Copilot Instructions
## Project Overview
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
For project goals and setup, see [README.md](../README.md).
## Core Technologies
- **Python**: 3.13 (strict version requirement)
- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
- **Storage**: Zarr, Icechunk, Parquet, NetCDF
- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
## Code Style Guidelines
### Python Standards
- Follow **PEP 8** conventions
- Use **type hints** for all function signatures
- Write **numpy-style docstrings** for public functions
- Keep functions **focused and modular**
- Prefer descriptive variable names over abbreviations
### Geospatial Best Practices
- Use **EPSG:3413** (Arctic Stereographic) for computations
- Use **EPSG:4326** (WGS84) for visualization and library compatibility
- Store gridded data using **xarray with XDGGS** indexing
- Store tabular data as **Parquet**, array data as **Zarr**
- Leverage **Dask** for lazy evaluation of large datasets
- Use **GeoPandas** for vector operations
- Handle **antimeridian** correctly for polar regions
### Data Pipeline Conventions
- Follow the numbered script sequence: `00grids.sh``01darts.sh``02alphaearth.sh``03era5.sh``04arcticdem.sh``05train.sh`
- Each pipeline stage should produce **reproducible intermediate outputs**
- Use `src/entropice/paths.py` for consistent path management
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
### Storage Hierarchy
All data follows this structure:
```
DATA_DIR/
├── grids/ # H3/HEALPix tessellations (GeoParquet)
├── darts/ # RTS labels (GeoParquet)
├── era5/ # Climate data (Zarr)
├── arcticdem/ # Terrain data (Icechunk Zarr)
├── alphaearth/ # Satellite embeddings (Zarr)
├── datasets/ # L2 XDGGS datasets (Zarr)
├── training-results/ # Models, CV results, predictions
└── watermask/ # Ocean mask (GeoParquet)
```
## Module Organization
### Core Modules (`src/entropice/`)
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
- **`dataset.py`**: Multi-source data integration and feature engineering
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
- **`inference.py`**: Batch prediction pipeline for trained models
- **`paths.py`**: Centralized path management
### Dashboard (`src/entropice/dashboard/`)
- Streamlit-based interactive visualization
- Modular pages: overview, training data, analysis, model state, inference
- Bokeh-based geospatial plotting utilities
- Run with: `pixi run dashboard`
### Scripts (`scripts/`)
- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
- Run entire pipeline for multiple grid configurations
- Each script uses CLIs from core modules
### Notebooks (`notebooks/`)
- Exploratory analysis and validation
- **NOT committed to git**
- Keep production code in `src/entropice/`
## Development Workflow
### Setup
```bash
pixi install # NOT pip install or conda install
```
### Running Tests
```bash
pixi run pytest
```
### Common Tasks
- **Generate grids**: Use `grids.py` CLI
- **Process labels**: Use `darts.py` CLI
- **Train models**: Use `training.py` CLI with TOML config
- **Run inference**: Use `inference.py` CLI
- **View results**: `pixi run dashboard`
## Key Design Patterns
### 1. XDGGS Indexing
All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
### 2. Lazy Evaluation
Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
### 3. GPU Acceleration
Prefer GPU-accelerated operations:
- CuPy for array operations
- cuML for Random Forest and KNN
- XGBoost GPU training
- PyTorch tensors when applicable
### 4. Configuration as Code
- Use dataclasses for typed configuration
- TOML files for training hyperparameters
- Cyclopts for CLI argument parsing
## Data Sources
- **DARTS v2**: RTS labels (year, area, count, density)
- **ERA5**: Climate data (40-year history, Arctic-aligned years)
- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
- **AlphaEarth**: 256-dimensional satellite embeddings
- **Watermask**: Ocean exclusion layer
## Model Support
Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
Training features:
- Randomized hyperparameter search
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
## Extension Points
To extend Entropice:
- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
- **Alternative labels**: Implement extractor following `darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `training.py`
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
## Important Notes
- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
- Arctic years run **October 1 to September 30** (not calendar years)
- Handle **antimeridian crossing** in polar regions
- Use **batch processing** for GPU memory management
- Notebooks are for exploration only - **keep production code in `src/`**
- Always use **absolute paths** or paths from `paths.py`
## Common Issues
- **Memory**: Use batch processing and Dask chunking for large datasets
- **GPU OOM**: Reduce batch size in inference or training
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
## References
For more details, consult:
- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
- [README.md](../README.md) - Project goals and setup instructions

56
.github/python.instructions.md vendored Normal file
View file

@ -0,0 +1,56 @@
---
description: 'Python coding conventions and guidelines'
applyTo: '**/*.py,**/*.ipynb'
---
# Python Coding Conventions
## Python Instructions
- Write clear and concise comments for each function.
- Ensure functions have descriptive names and include type hints.
- Provide docstrings following PEP 257 conventions.
- Use the `typing` module for type annotations (e.g., `List[str]`, `Dict[str, int]`).
- Break down complex functions into smaller, more manageable functions.
## General Instructions
- Always prioritize readability and clarity.
- For algorithm-related code, include explanations of the approach used.
- Write code with good maintainability practices, including comments on why certain design decisions were made.
- Handle edge cases and write clear exception handling.
- For libraries or external dependencies, mention their usage and purpose in comments.
- Use consistent naming conventions and follow language-specific best practices.
- Write concise, efficient, and idiomatic code that is also easily understandable.
## Code Style and Formatting
- Follow the **PEP 8** style guide for Python.
- Maintain proper indentation (use 4 spaces for each level of indentation).
- Ensure lines do not exceed 79 characters.
- Place function and class docstrings immediately after the `def` or `class` keyword.
- Use blank lines to separate functions, classes, and code blocks where appropriate.
## Edge Cases and Testing
- Always include test cases for critical paths of the application.
- Account for common edge cases like empty inputs, invalid data types, and large datasets.
- Include comments for edge cases and the expected behavior in those cases.
- Write unit tests for functions and document them with docstrings explaining the test cases.
## Example of Proper Documentation
```python
def calculate_area(radius: float) -> float:
"""
Calculate the area of a circle given the radius.
Parameters:
radius (float): The radius of the circle.
Returns:
float: The area of the circle, calculated as π * radius^2.
"""
import math
return math.pi * radius ** 2
```