Add some docs for copilot
This commit is contained in:
parent
1ee3d532fc
commit
f8df10f687
9 changed files with 908 additions and 2027 deletions
176
.github/agents/Dashboard.agent.md
vendored
Normal file
176
.github/agents/Dashboard.agent.md
vendored
Normal file
|
|
@ -0,0 +1,176 @@
|
|||
---
|
||||
description: 'Specialized agent for developing and enhancing the Streamlit dashboard for data and training analysis.'
|
||||
name: Dashboard-Developer
|
||||
argument-hint: 'Describe dashboard features, pages, visualizations, or improvements you want to add or modify'
|
||||
tools: ['edit', 'runNotebooks', 'search', 'runCommands', 'usages', 'problems', 'changes', 'testFailure', 'fetch', 'githubRepo', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'ms-toolsai.jupyter/configureNotebook', 'ms-toolsai.jupyter/listNotebookPackages', 'ms-toolsai.jupyter/installNotebookPackages', 'todos', 'runSubagent', 'runTests']
|
||||
---
|
||||
|
||||
# Dashboard Development Agent
|
||||
|
||||
You are a specialized agent for incrementally developing and enhancing the **Entropice Streamlit Dashboard** used to analyze geospatial machine learning data and training experiments.
|
||||
|
||||
## Your Responsibilities
|
||||
|
||||
### What You Should Do
|
||||
|
||||
1. **Develop Dashboard Features**: Create new pages, visualizations, and UI components for the Streamlit dashboard
|
||||
2. **Enhance Visualizations**: Improve or create plots using Plotly, Matplotlib, Seaborn, PyDeck, and Altair
|
||||
3. **Fix Dashboard Issues**: Debug and resolve problems in dashboard pages and plotting utilities
|
||||
4. **Read Data Context**: Understand data structures (Xarray, GeoPandas, Pandas, NumPy) to properly visualize them
|
||||
5. **Consult Documentation**: Use #tool:fetch to read library documentation when needed:
|
||||
- Streamlit: https://docs.streamlit.io/
|
||||
- Plotly: https://plotly.com/python/
|
||||
- PyDeck: https://deckgl.readthedocs.io/
|
||||
- Deck.gl: https://deck.gl/
|
||||
- Matplotlib: https://matplotlib.org/
|
||||
- Seaborn: https://seaborn.pydata.org/
|
||||
- Xarray: https://docs.xarray.dev/
|
||||
- GeoPandas: https://geopandas.org/
|
||||
- Pandas: https://pandas.pydata.org/pandas-docs/
|
||||
- NumPy: https://numpy.org/doc/stable/
|
||||
|
||||
6. **Understand Data Sources**: Read data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`, `dataset.py`, `training.py`, `inference.py`) to understand data structures—but **NEVER edit them**
|
||||
|
||||
### What You Should NOT Do
|
||||
|
||||
1. **Never Edit Data Pipeline Scripts**: Do not modify files in `src/entropice/` that are NOT in the `dashboard/` subdirectory
|
||||
2. **Never Edit Training Scripts**: Do not modify `training.py`, `dataset.py`, or any model-related code outside the dashboard
|
||||
3. **Never Modify Data Processing**: If changes to data creation or model training scripts are needed, **pause and inform the user** instead of making changes yourself
|
||||
4. **Never Edit Configuration Files**: Do not modify `pyproject.toml`, pipeline scripts in `scripts/`, or configuration files
|
||||
|
||||
### Boundaries
|
||||
|
||||
If you identify that a dashboard improvement requires changes to:
|
||||
- Data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
|
||||
- Dataset assembly (`dataset.py`)
|
||||
- Model training (`training.py`, `inference.py`)
|
||||
- Pipeline automation scripts (`scripts/*.sh`)
|
||||
|
||||
**Stop immediately** and inform the user:
|
||||
```
|
||||
⚠️ This dashboard feature requires changes to the data pipeline/training code.
|
||||
Specifically: [describe the needed changes]
|
||||
Please review and make these changes yourself, then I can proceed with the dashboard updates.
|
||||
```
|
||||
|
||||
## Dashboard Structure
|
||||
|
||||
The dashboard is located in `src/entropice/dashboard/` with the following structure:
|
||||
|
||||
```
|
||||
dashboard/
|
||||
├── app.py # Main Streamlit app with navigation
|
||||
├── overview_page.py # Overview of training results
|
||||
├── training_data_page.py # Training data visualizations
|
||||
├── training_analysis_page.py # CV results and hyperparameter analysis
|
||||
├── model_state_page.py # Feature importance and model state
|
||||
├── inference_page.py # Spatial prediction visualizations
|
||||
├── plots/ # Reusable plotting utilities
|
||||
│ ├── colors.py # Color schemes
|
||||
│ ├── hyperparameter_analysis.py
|
||||
│ ├── inference.py
|
||||
│ ├── model_state.py
|
||||
│ ├── source_data.py
|
||||
│ └── training_data.py
|
||||
└── utils/ # Data loading and processing
|
||||
├── data.py
|
||||
└── training.py
|
||||
```
|
||||
|
||||
## Key Technologies
|
||||
|
||||
- **Streamlit**: Web app framework
|
||||
- **Plotly**: Interactive plots (preferred for most visualizations)
|
||||
- **Matplotlib/Seaborn**: Statistical plots
|
||||
- **PyDeck/Deck.gl**: Geospatial visualizations
|
||||
- **Altair**: Declarative visualizations
|
||||
- **Bokeh**: Alternative interactive plotting (already used in some places)
|
||||
|
||||
## Critical Code Standards
|
||||
|
||||
### Streamlit Best Practices
|
||||
|
||||
**❌ INCORRECT** (deprecated):
|
||||
```python
|
||||
st.plotly_chart(fig, use_container_width=True)
|
||||
```
|
||||
|
||||
**✅ CORRECT** (current API):
|
||||
```python
|
||||
st.plotly_chart(fig, width='stretch')
|
||||
```
|
||||
|
||||
**Common width values**:
|
||||
- `width='stretch'` - Use full container width (replaces `use_container_width=True`)
|
||||
- `width='content'` - Use content width (replaces `use_container_width=False`)
|
||||
|
||||
This applies to:
|
||||
- `st.plotly_chart()`
|
||||
- `st.altair_chart()`
|
||||
- `st.vega_lite_chart()`
|
||||
- `st.dataframe()`
|
||||
- `st.image()`
|
||||
|
||||
### Data Structure Patterns
|
||||
|
||||
When working with Entropice data:
|
||||
|
||||
1. **Grid Data**: GeoDataFrames with H3/HEALPix cell IDs
|
||||
2. **L2 Datasets**: Xarray datasets with XDGGS dimensions
|
||||
3. **Training Results**: Pickled models, Parquet/NetCDF CV results
|
||||
4. **Predictions**: GeoDataFrames with predicted classes/probabilities
|
||||
|
||||
### Visualization Guidelines
|
||||
|
||||
1. **Geospatial Data**: Use PyDeck for interactive maps, Plotly for static maps
|
||||
2. **Time Series**: Prefer Plotly for interactivity
|
||||
3. **Distributions**: Use Plotly or Seaborn
|
||||
4. **Feature Importance**: Use Plotly bar charts
|
||||
5. **Hyperparameter Analysis**: Use Plotly scatter/parallel coordinates
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Understand the Request**: Clarify what visualization or feature is needed
|
||||
2. **Search for Context**: Use #tool:search to find relevant dashboard code and data structures
|
||||
3. **Read Data Pipeline**: If needed, read (but don't edit) data pipeline scripts to understand data formats
|
||||
4. **Consult Documentation**: Use #tool:fetch for library documentation when needed
|
||||
5. **Implement Changes**: Edit dashboard files only
|
||||
6. **Test Assumptions**: Check for errors with #tool:problems after edits
|
||||
7. **Track Progress**: Use #tool:todos for multi-step dashboard development
|
||||
|
||||
## Example Interactions
|
||||
|
||||
### ✅ Good Requests (Within Scope)
|
||||
|
||||
- "Add a new page to visualize feature correlations"
|
||||
- "Create a PyDeck map showing RTS predictions by grid cell"
|
||||
- "Improve the hyperparameter analysis plot to show confidence intervals"
|
||||
- "Add a Plotly histogram showing the distribution of RTS density"
|
||||
- "Fix the deprecation warning about use_container_width"
|
||||
|
||||
### ⚠️ Boundary Cases (Requires User Approval)
|
||||
|
||||
User: "Add a new climate variable to the dashboard"
|
||||
Agent Response:
|
||||
```
|
||||
⚠️ This requires changes to the data pipeline (era5.py) to extract the new variable.
|
||||
Please add the variable to the ERA5 processing pipeline first, then I can add it to the dashboard visualizations.
|
||||
```
|
||||
|
||||
## Progress Reporting
|
||||
|
||||
For complex dashboard development tasks:
|
||||
|
||||
1. Use #tool:todos to create a task list
|
||||
2. Mark tasks as in-progress before starting
|
||||
3. Mark completed immediately after finishing
|
||||
4. Keep the user informed of progress
|
||||
|
||||
## Remember
|
||||
|
||||
- **Read-only for data pipeline**: You can read any file to understand data structures, but only edit `dashboard/` files
|
||||
- **Documentation first**: When unsure about Streamlit/Plotly/PyDeck APIs, fetch documentation
|
||||
- **Modern Streamlit API**: Always use `width='stretch'` instead of `use_container_width=True`
|
||||
- **Pause when needed**: If data pipeline changes are required, stop and inform the user
|
||||
|
||||
You are here to make the dashboard better, not to change how data is created or models are trained. Stay within these boundaries and you'll be most helpful!
|
||||
193
.github/copilot-instructions.md
vendored
Normal file
193
.github/copilot-instructions.md
vendored
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
# Entropice - GitHub Copilot Instructions
|
||||
|
||||
## Project Overview
|
||||
|
||||
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
|
||||
|
||||
For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
|
||||
For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
|
||||
For project goals and setup, see [README.md](../README.md).
|
||||
|
||||
## Core Technologies
|
||||
|
||||
- **Python**: 3.13 (strict version requirement)
|
||||
- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
|
||||
- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
|
||||
- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
|
||||
- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
|
||||
- **Storage**: Zarr, Icechunk, Parquet, NetCDF
|
||||
- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
|
||||
|
||||
## Code Style Guidelines
|
||||
|
||||
### Python Standards
|
||||
|
||||
- Follow **PEP 8** conventions
|
||||
- Use **type hints** for all function signatures
|
||||
- Write **numpy-style docstrings** for public functions
|
||||
- Keep functions **focused and modular**
|
||||
- Prefer descriptive variable names over abbreviations
|
||||
|
||||
### Geospatial Best Practices
|
||||
|
||||
- Use **EPSG:3413** (Arctic Stereographic) for computations
|
||||
- Use **EPSG:4326** (WGS84) for visualization and library compatibility
|
||||
- Store gridded data using **xarray with XDGGS** indexing
|
||||
- Store tabular data as **Parquet**, array data as **Zarr**
|
||||
- Leverage **Dask** for lazy evaluation of large datasets
|
||||
- Use **GeoPandas** for vector operations
|
||||
- Handle **antimeridian** correctly for polar regions
|
||||
|
||||
### Data Pipeline Conventions
|
||||
|
||||
- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
|
||||
- Each pipeline stage should produce **reproducible intermediate outputs**
|
||||
- Use `src/entropice/paths.py` for consistent path management
|
||||
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
|
||||
|
||||
### Storage Hierarchy
|
||||
|
||||
All data follows this structure:
|
||||
```
|
||||
DATA_DIR/
|
||||
├── grids/ # H3/HEALPix tessellations (GeoParquet)
|
||||
├── darts/ # RTS labels (GeoParquet)
|
||||
├── era5/ # Climate data (Zarr)
|
||||
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
||||
├── alphaearth/ # Satellite embeddings (Zarr)
|
||||
├── datasets/ # L2 XDGGS datasets (Zarr)
|
||||
├── training-results/ # Models, CV results, predictions
|
||||
└── watermask/ # Ocean mask (GeoParquet)
|
||||
```
|
||||
|
||||
## Module Organization
|
||||
|
||||
### Core Modules (`src/entropice/`)
|
||||
|
||||
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
|
||||
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
|
||||
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
|
||||
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
|
||||
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
|
||||
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
|
||||
- **`dataset.py`**: Multi-source data integration and feature engineering
|
||||
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
|
||||
- **`inference.py`**: Batch prediction pipeline for trained models
|
||||
- **`paths.py`**: Centralized path management
|
||||
|
||||
### Dashboard (`src/entropice/dashboard/`)
|
||||
|
||||
- Streamlit-based interactive visualization
|
||||
- Modular pages: overview, training data, analysis, model state, inference
|
||||
- Bokeh-based geospatial plotting utilities
|
||||
- Run with: `pixi run dashboard`
|
||||
|
||||
### Scripts (`scripts/`)
|
||||
|
||||
- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
|
||||
- Run entire pipeline for multiple grid configurations
|
||||
- Each script uses CLIs from core modules
|
||||
|
||||
### Notebooks (`notebooks/`)
|
||||
|
||||
- Exploratory analysis and validation
|
||||
- **NOT committed to git**
|
||||
- Keep production code in `src/entropice/`
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
pixi install # NOT pip install or conda install
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
pixi run pytest
|
||||
```
|
||||
|
||||
### Common Tasks
|
||||
|
||||
- **Generate grids**: Use `grids.py` CLI
|
||||
- **Process labels**: Use `darts.py` CLI
|
||||
- **Train models**: Use `training.py` CLI with TOML config
|
||||
- **Run inference**: Use `inference.py` CLI
|
||||
- **View results**: `pixi run dashboard`
|
||||
|
||||
## Key Design Patterns
|
||||
|
||||
### 1. XDGGS Indexing
|
||||
|
||||
All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
|
||||
|
||||
### 2. Lazy Evaluation
|
||||
|
||||
Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
|
||||
|
||||
### 3. GPU Acceleration
|
||||
|
||||
Prefer GPU-accelerated operations:
|
||||
- CuPy for array operations
|
||||
- cuML for Random Forest and KNN
|
||||
- XGBoost GPU training
|
||||
- PyTorch tensors when applicable
|
||||
|
||||
### 4. Configuration as Code
|
||||
|
||||
- Use dataclasses for typed configuration
|
||||
- TOML files for training hyperparameters
|
||||
- Cyclopts for CLI argument parsing
|
||||
|
||||
## Data Sources
|
||||
|
||||
- **DARTS v2**: RTS labels (year, area, count, density)
|
||||
- **ERA5**: Climate data (40-year history, Arctic-aligned years)
|
||||
- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
|
||||
- **AlphaEarth**: 256-dimensional satellite embeddings
|
||||
- **Watermask**: Ocean exclusion layer
|
||||
|
||||
## Model Support
|
||||
|
||||
Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
|
||||
Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
|
||||
|
||||
Training features:
|
||||
- Randomized hyperparameter search
|
||||
- K-Fold cross-validation
|
||||
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
|
||||
|
||||
## Extension Points
|
||||
|
||||
To extend Entropice:
|
||||
|
||||
- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
|
||||
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
|
||||
- **Alternative labels**: Implement extractor following `darts.py` pattern
|
||||
- **New models**: Add scikit-learn compatible estimators to `training.py`
|
||||
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
|
||||
|
||||
## Important Notes
|
||||
|
||||
- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
|
||||
- Arctic years run **October 1 to September 30** (not calendar years)
|
||||
- Handle **antimeridian crossing** in polar regions
|
||||
- Use **batch processing** for GPU memory management
|
||||
- Notebooks are for exploration only - **keep production code in `src/`**
|
||||
- Always use **absolute paths** or paths from `paths.py`
|
||||
|
||||
## Common Issues
|
||||
|
||||
- **Memory**: Use batch processing and Dask chunking for large datasets
|
||||
- **GPU OOM**: Reduce batch size in inference or training
|
||||
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
|
||||
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
|
||||
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
|
||||
|
||||
## References
|
||||
|
||||
For more details, consult:
|
||||
- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
|
||||
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
|
||||
- [README.md](../README.md) - Project goals and setup instructions
|
||||
56
.github/python.instructions.md
vendored
Normal file
56
.github/python.instructions.md
vendored
Normal file
|
|
@ -0,0 +1,56 @@
|
|||
---
|
||||
description: 'Python coding conventions and guidelines'
|
||||
applyTo: '**/*.py,**/*.ipynb'
|
||||
---
|
||||
|
||||
# Python Coding Conventions
|
||||
|
||||
## Python Instructions
|
||||
|
||||
- Write clear and concise comments for each function.
|
||||
- Ensure functions have descriptive names and include type hints.
|
||||
- Provide docstrings following PEP 257 conventions.
|
||||
- Use the `typing` module for type annotations (e.g., `List[str]`, `Dict[str, int]`).
|
||||
- Break down complex functions into smaller, more manageable functions.
|
||||
|
||||
## General Instructions
|
||||
|
||||
- Always prioritize readability and clarity.
|
||||
- For algorithm-related code, include explanations of the approach used.
|
||||
- Write code with good maintainability practices, including comments on why certain design decisions were made.
|
||||
- Handle edge cases and write clear exception handling.
|
||||
- For libraries or external dependencies, mention their usage and purpose in comments.
|
||||
- Use consistent naming conventions and follow language-specific best practices.
|
||||
- Write concise, efficient, and idiomatic code that is also easily understandable.
|
||||
|
||||
## Code Style and Formatting
|
||||
|
||||
- Follow the **PEP 8** style guide for Python.
|
||||
- Maintain proper indentation (use 4 spaces for each level of indentation).
|
||||
- Ensure lines do not exceed 79 characters.
|
||||
- Place function and class docstrings immediately after the `def` or `class` keyword.
|
||||
- Use blank lines to separate functions, classes, and code blocks where appropriate.
|
||||
|
||||
## Edge Cases and Testing
|
||||
|
||||
- Always include test cases for critical paths of the application.
|
||||
- Account for common edge cases like empty inputs, invalid data types, and large datasets.
|
||||
- Include comments for edge cases and the expected behavior in those cases.
|
||||
- Write unit tests for functions and document them with docstrings explaining the test cases.
|
||||
|
||||
## Example of Proper Documentation
|
||||
|
||||
```python
|
||||
def calculate_area(radius: float) -> float:
|
||||
"""
|
||||
Calculate the area of a circle given the radius.
|
||||
|
||||
Parameters:
|
||||
radius (float): The radius of the circle.
|
||||
|
||||
Returns:
|
||||
float: The area of the circle, calculated as π * radius^2.
|
||||
"""
|
||||
import math
|
||||
return math.pi * radius ** 2
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue