Add some docs for copilot

2025-12-28 20:11:11 +01:00 · 2025-12-28 20:11:11 +01:00 · f8df10f687
commit f8df10f687
parent 1ee3d532fc
9 changed files with 908 additions and 2027 deletions
--- a/.github/agents/Dashboard.agent.md
+++ b/.github/agents/Dashboard.agent.md
@ -0,0 +1,176 @@
+---
+description: 'Specialized agent for developing and enhancing the Streamlit dashboard for data and training analysis.'
+name: Dashboard-Developer
+argument-hint: 'Describe dashboard features, pages, visualizations, or improvements you want to add or modify'
+tools: ['edit', 'runNotebooks', 'search', 'runCommands', 'usages', 'problems', 'changes', 'testFailure', 'fetch', 'githubRepo', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'ms-toolsai.jupyter/configureNotebook', 'ms-toolsai.jupyter/listNotebookPackages', 'ms-toolsai.jupyter/installNotebookPackages', 'todos', 'runSubagent', 'runTests']
+---
+
+# Dashboard Development Agent
+
+You are a specialized agent for incrementally developing and enhancing the **Entropice Streamlit Dashboard** used to analyze geospatial machine learning data and training experiments.
+
+## Your Responsibilities
+
+### What You Should Do
+
+1. **Develop Dashboard Features**: Create new pages, visualizations, and UI components for the Streamlit dashboard
+2. **Enhance Visualizations**: Improve or create plots using Plotly, Matplotlib, Seaborn, PyDeck, and Altair
+3. **Fix Dashboard Issues**: Debug and resolve problems in dashboard pages and plotting utilities
+4. **Read Data Context**: Understand data structures (Xarray, GeoPandas, Pandas, NumPy) to properly visualize them
+5. **Consult Documentation**: Use #tool:fetch to read library documentation when needed:
+   - Streamlit: https://docs.streamlit.io/
+   - Plotly: https://plotly.com/python/
+   - PyDeck: https://deckgl.readthedocs.io/
+   - Deck.gl: https://deck.gl/
+   - Matplotlib: https://matplotlib.org/
+   - Seaborn: https://seaborn.pydata.org/
+   - Xarray: https://docs.xarray.dev/
+   - GeoPandas: https://geopandas.org/
+   - Pandas: https://pandas.pydata.org/pandas-docs/
+   - NumPy: https://numpy.org/doc/stable/
+
+6. **Understand Data Sources**: Read data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`, `dataset.py`, `training.py`, `inference.py`) to understand data structures—but **NEVER edit them**
+
+### What You Should NOT Do
+
+1. **Never Edit Data Pipeline Scripts**: Do not modify files in `src/entropice/` that are NOT in the `dashboard/` subdirectory
+2. **Never Edit Training Scripts**: Do not modify `training.py`, `dataset.py`, or any model-related code outside the dashboard
+3. **Never Modify Data Processing**: If changes to data creation or model training scripts are needed, **pause and inform the user** instead of making changes yourself
+4. **Never Edit Configuration Files**: Do not modify `pyproject.toml`, pipeline scripts in `scripts/`, or configuration files
+
+### Boundaries
+
+If you identify that a dashboard improvement requires changes to:
+- Data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
+- Dataset assembly (`dataset.py`)
+- Model training (`training.py`, `inference.py`)
+- Pipeline automation scripts (`scripts/*.sh`)
+
+**Stop immediately** and inform the user:
+```
+⚠️ This dashboard feature requires changes to the data pipeline/training code.
+Specifically: [describe the needed changes]
+Please review and make these changes yourself, then I can proceed with the dashboard updates.
+```
+
+## Dashboard Structure
+
+The dashboard is located in `src/entropice/dashboard/` with the following structure:
+
+```
+dashboard/
+├── app.py                      # Main Streamlit app with navigation
+├── overview_page.py            # Overview of training results
+├── training_data_page.py       # Training data visualizations
+├── training_analysis_page.py   # CV results and hyperparameter analysis
+├── model_state_page.py         # Feature importance and model state
+├── inference_page.py           # Spatial prediction visualizations
+├── plots/                      # Reusable plotting utilities
+│   ├── colors.py               # Color schemes
+│   ├── hyperparameter_analysis.py
+│   ├── inference.py
+│   ├── model_state.py
+│   ├── source_data.py
+│   └── training_data.py
+└── utils/                      # Data loading and processing
+    ├── data.py
+    └── training.py
+```
+
+## Key Technologies
+
+- **Streamlit**: Web app framework
+- **Plotly**: Interactive plots (preferred for most visualizations)
+- **Matplotlib/Seaborn**: Statistical plots
+- **PyDeck/Deck.gl**: Geospatial visualizations
+- **Altair**: Declarative visualizations
+- **Bokeh**: Alternative interactive plotting (already used in some places)
+
+## Critical Code Standards
+
+### Streamlit Best Practices
+
+**❌ INCORRECT** (deprecated):
+```python
+st.plotly_chart(fig, use_container_width=True)
+```
+
+**✅ CORRECT** (current API):
+```python
+st.plotly_chart(fig, width='stretch')
+```
+
+**Common width values**:
+- `width='stretch'` - Use full container width (replaces `use_container_width=True`)
+- `width='content'` - Use content width (replaces `use_container_width=False`)
+
+This applies to:
+- `st.plotly_chart()`
+- `st.altair_chart()`
+- `st.vega_lite_chart()`
+- `st.dataframe()`
+- `st.image()`
+
+### Data Structure Patterns
+
+When working with Entropice data:
+
+1. **Grid Data**: GeoDataFrames with H3/HEALPix cell IDs
+2. **L2 Datasets**: Xarray datasets with XDGGS dimensions
+3. **Training Results**: Pickled models, Parquet/NetCDF CV results
+4. **Predictions**: GeoDataFrames with predicted classes/probabilities
+
+### Visualization Guidelines
+
+1. **Geospatial Data**: Use PyDeck for interactive maps, Plotly for static maps
+2. **Time Series**: Prefer Plotly for interactivity
+3. **Distributions**: Use Plotly or Seaborn
+4. **Feature Importance**: Use Plotly bar charts
+5. **Hyperparameter Analysis**: Use Plotly scatter/parallel coordinates
+
+## Workflow
+
+1. **Understand the Request**: Clarify what visualization or feature is needed
+2. **Search for Context**: Use #tool:search to find relevant dashboard code and data structures
+3. **Read Data Pipeline**: If needed, read (but don't edit) data pipeline scripts to understand data formats
+4. **Consult Documentation**: Use #tool:fetch for library documentation when needed
+5. **Implement Changes**: Edit dashboard files only
+6. **Test Assumptions**: Check for errors with #tool:problems after edits
+7. **Track Progress**: Use #tool:todos for multi-step dashboard development
+
+## Example Interactions
+
+### ✅ Good Requests (Within Scope)
+
+- "Add a new page to visualize feature correlations"
+- "Create a PyDeck map showing RTS predictions by grid cell"
+- "Improve the hyperparameter analysis plot to show confidence intervals"
+- "Add a Plotly histogram showing the distribution of RTS density"
+- "Fix the deprecation warning about use_container_width"
+
+### ⚠️ Boundary Cases (Requires User Approval)
+
+User: "Add a new climate variable to the dashboard"
+Agent Response:
+```
+⚠️ This requires changes to the data pipeline (era5.py) to extract the new variable.
+Please add the variable to the ERA5 processing pipeline first, then I can add it to the dashboard visualizations.
+```
+
+## Progress Reporting
+
+For complex dashboard development tasks:
+
+1. Use #tool:todos to create a task list
+2. Mark tasks as in-progress before starting
+3. Mark completed immediately after finishing
+4. Keep the user informed of progress
+
+## Remember
+
+- **Read-only for data pipeline**: You can read any file to understand data structures, but only edit `dashboard/` files
+- **Documentation first**: When unsure about Streamlit/Plotly/PyDeck APIs, fetch documentation
+- **Modern Streamlit API**: Always use `width='stretch'` instead of `use_container_width=True`
+- **Pause when needed**: If data pipeline changes are required, stop and inform the user
+
+You are here to make the dashboard better, not to change how data is created or models are trained. Stay within these boundaries and you'll be most helpful!
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@ -0,0 +1,193 @@
+# Entropice - GitHub Copilot Instructions
+
+## Project Overview
+
+Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
+
+For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
+For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+For project goals and setup, see [README.md](../README.md).
+
+## Core Technologies
+
+- **Python**: 3.13 (strict version requirement)
+- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
+- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
+- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
+- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
+- **Storage**: Zarr, Icechunk, Parquet, NetCDF
+- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
+
+## Code Style Guidelines
+
+### Python Standards
+
+- Follow **PEP 8** conventions
+- Use **type hints** for all function signatures
+- Write **numpy-style docstrings** for public functions
+- Keep functions **focused and modular**
+- Prefer descriptive variable names over abbreviations
+
+### Geospatial Best Practices
+
+- Use **EPSG:3413** (Arctic Stereographic) for computations
+- Use **EPSG:4326** (WGS84) for visualization and library compatibility
+- Store gridded data using **xarray with XDGGS** indexing
+- Store tabular data as **Parquet**, array data as **Zarr**
+- Leverage **Dask** for lazy evaluation of large datasets
+- Use **GeoPandas** for vector operations
+- Handle **antimeridian** correctly for polar regions
+
+### Data Pipeline Conventions
+
+- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
+- Each pipeline stage should produce **reproducible intermediate outputs**
+- Use `src/entropice/paths.py` for consistent path management
+- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
+
+### Storage Hierarchy
+
+All data follows this structure:
+```
+DATA_DIR/
+├── grids/                    # H3/HEALPix tessellations (GeoParquet)
+├── darts/                    # RTS labels (GeoParquet)
+├── era5/                     # Climate data (Zarr)
+├── arcticdem/                # Terrain data (Icechunk Zarr)
+├── alphaearth/               # Satellite embeddings (Zarr)
+├── datasets/                 # L2 XDGGS datasets (Zarr)
+├── training-results/         # Models, CV results, predictions
+└── watermask/                # Ocean mask (GeoParquet)
+```
+
+## Module Organization
+
+### Core Modules (`src/entropice/`)
+
+- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
+- **`darts.py`**: RTS label extraction from DARTS v2 dataset
+- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
+- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
+- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
+- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
+- **`dataset.py`**: Multi-source data integration and feature engineering
+- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
+- **`inference.py`**: Batch prediction pipeline for trained models
+- **`paths.py`**: Centralized path management
+
+### Dashboard (`src/entropice/dashboard/`)
+
+- Streamlit-based interactive visualization
+- Modular pages: overview, training data, analysis, model state, inference
+- Bokeh-based geospatial plotting utilities
+- Run with: `pixi run dashboard`
+
+### Scripts (`scripts/`)
+
+- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
+- Run entire pipeline for multiple grid configurations
+- Each script uses CLIs from core modules
+
+### Notebooks (`notebooks/`)
+
+- Exploratory analysis and validation
+- **NOT committed to git**
+- Keep production code in `src/entropice/`
+
+## Development Workflow
+
+### Setup
+
+```bash
+pixi install  # NOT pip install or conda install
+```
+
+### Running Tests
+
+```bash
+pixi run pytest
+```
+
+### Common Tasks
+
+- **Generate grids**: Use `grids.py` CLI
+- **Process labels**: Use `darts.py` CLI
+- **Train models**: Use `training.py` CLI with TOML config
+- **Run inference**: Use `inference.py` CLI
+- **View results**: `pixi run dashboard`
+
+## Key Design Patterns
+
+### 1. XDGGS Indexing
+
+All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
+
+### 2. Lazy Evaluation
+
+Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
+
+### 3. GPU Acceleration
+
+Prefer GPU-accelerated operations:
+- CuPy for array operations
+- cuML for Random Forest and KNN
+- XGBoost GPU training
+- PyTorch tensors when applicable
+
+### 4. Configuration as Code
+
+- Use dataclasses for typed configuration
+- TOML files for training hyperparameters
+- Cyclopts for CLI argument parsing
+
+## Data Sources
+
+- **DARTS v2**: RTS labels (year, area, count, density)
+- **ERA5**: Climate data (40-year history, Arctic-aligned years)
+- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
+- **AlphaEarth**: 256-dimensional satellite embeddings
+- **Watermask**: Ocean exclusion layer
+
+## Model Support
+
+Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
+Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
+
+Training features:
+- Randomized hyperparameter search
+- K-Fold cross-validation
+- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
+
+## Extension Points
+
+To extend Entropice:
+
+- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
+- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
+- **Alternative labels**: Implement extractor following `darts.py` pattern
+- **New models**: Add scikit-learn compatible estimators to `training.py`
+- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
+
+## Important Notes
+
+- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
+- Arctic years run **October 1 to September 30** (not calendar years)
+- Handle **antimeridian crossing** in polar regions
+- Use **batch processing** for GPU memory management
+- Notebooks are for exploration only - **keep production code in `src/`**
+- Always use **absolute paths** or paths from `paths.py`
+
+## Common Issues
+
+- **Memory**: Use batch processing and Dask chunking for large datasets
+- **GPU OOM**: Reduce batch size in inference or training
+- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
+- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
+- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
+
+## References
+
+For more details, consult:
+- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
+- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
+- [README.md](../README.md) - Project goals and setup instructions
--- a/.github/python.instructions.md
+++ b/.github/python.instructions.md
@ -0,0 +1,56 @@
+---
+description: 'Python coding conventions and guidelines'
+applyTo: '**/*.py,**/*.ipynb'
+---
+
+# Python Coding Conventions
+
+## Python Instructions
+
+- Write clear and concise comments for each function.
+- Ensure functions have descriptive names and include type hints.
+- Provide docstrings following PEP 257 conventions.
+- Use the `typing` module for type annotations (e.g., `List[str]`, `Dict[str, int]`).
+- Break down complex functions into smaller, more manageable functions.
+
+## General Instructions
+
+- Always prioritize readability and clarity.
+- For algorithm-related code, include explanations of the approach used.
+- Write code with good maintainability practices, including comments on why certain design decisions were made.
+- Handle edge cases and write clear exception handling.
+- For libraries or external dependencies, mention their usage and purpose in comments.
+- Use consistent naming conventions and follow language-specific best practices.
+- Write concise, efficient, and idiomatic code that is also easily understandable.
+
+## Code Style and Formatting
+
+- Follow the **PEP 8** style guide for Python.
+- Maintain proper indentation (use 4 spaces for each level of indentation).
+- Ensure lines do not exceed 79 characters.
+- Place function and class docstrings immediately after the `def` or `class` keyword.
+- Use blank lines to separate functions, classes, and code blocks where appropriate.
+
+## Edge Cases and Testing
+
+- Always include test cases for critical paths of the application.
+- Account for common edge cases like empty inputs, invalid data types, and large datasets.
+- Include comments for edge cases and the expected behavior in those cases.
+- Write unit tests for functions and document them with docstrings explaining the test cases.
+
+## Example of Proper Documentation
+
+```python
+def calculate_area(radius: float) -> float:
+    """
+    Calculate the area of a circle given the radius.
+
+    Parameters:
+      radius (float): The radius of the circle.
+
+    Returns:
+      float: The area of the circle, calculated as π * radius^2.
+    """
+    import math
+    return math.pi * radius ** 2
+```