Update docs, instructions and format code

2026-01-04 17:19:02 +01:00 · 2026-01-04 17:19:02 +01:00 · 4260b492ab
commit 4260b492ab
parent fca232da91
29 changed files with 987 additions and 467 deletions
--- a/.github/agents/Dashboard.agent.md
+++ b/.github/agents/Dashboard.agent.md
@ -1,56 +1,54 @@
 ---
-description: 'Specialized agent for developing and enhancing the Streamlit dashboard for data and training analysis.'
-name: Dashboard-Developer
-argument-hint: 'Describe dashboard features, pages, visualizations, or improvements you want to add or modify'
-tools: ['edit', 'runNotebooks', 'search', 'runCommands', 'usages', 'problems', 'changes', 'testFailure', 'fetch', 'githubRepo', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'ms-toolsai.jupyter/configureNotebook', 'ms-toolsai.jupyter/listNotebookPackages', 'ms-toolsai.jupyter/installNotebookPackages', 'todos', 'runSubagent', 'runTests']
+description: Develop and refactor Streamlit dashboard pages and visualizations
+name: Dashboard
+argument-hint: Describe dashboard features, pages, or visualizations to add or modify
+tools: ['vscode', 'execute', 'read', 'edit', 'search', 'web', 'agent', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'todo']
+model: Claude Sonnet 4.5
+infer: true
 ---

 # Dashboard Development Agent

-You are a specialized agent for incrementally developing and enhancing the **Entropice Streamlit Dashboard** used to analyze geospatial machine learning data and training experiments.
+You specialize in developing and refactoring the **Entropice Streamlit Dashboard** for geospatial machine learning analysis.

-## Your Responsibilities
+## Scope

-### What You Should Do
+**You can edit:** Files in `src/entropice/dashboard/` only
+**You cannot edit:** Data pipeline scripts, training code, or configuration files

-1. **Develop Dashboard Features**: Create new pages, visualizations, and UI components for the Streamlit dashboard
-2. **Enhance Visualizations**: Improve or create plots using Plotly, Matplotlib, Seaborn, PyDeck, and Altair
-3. **Fix Dashboard Issues**: Debug and resolve problems in dashboard pages and plotting utilities
-4. **Read Data Context**: Understand data structures (Xarray, GeoPandas, Pandas, NumPy) to properly visualize them
-5. **Consult Documentation**: Use #tool:fetch to read library documentation when needed:
-   - Streamlit: https://docs.streamlit.io/
-   - Plotly: https://plotly.com/python/
-   - PyDeck: https://deckgl.readthedocs.io/
-   - Deck.gl: https://deck.gl/
-   - Matplotlib: https://matplotlib.org/
-   - Seaborn: https://seaborn.pydata.org/
-   - Xarray: https://docs.xarray.dev/
-   - GeoPandas: https://geopandas.org/
-   - Pandas: https://pandas.pydata.org/pandas-docs/
-   - NumPy: https://numpy.org/doc/stable/
+**Primary reference:** Always consult `views/overview_page.py` for current code patterns

-6. **Understand Data Sources**: Read data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`, `dataset.py`, `training.py`, `inference.py`) to understand data structures—but **NEVER edit them**
+## Responsibilities

-### What You Should NOT Do
+### ✅ What You Do

-1. **Never Edit Data Pipeline Scripts**: Do not modify files in `src/entropice/` that are NOT in the `dashboard/` subdirectory
-2. **Never Edit Training Scripts**: Do not modify `training.py`, `dataset.py`, or any model-related code outside the dashboard
-3. **Never Modify Data Processing**: If changes to data creation or model training scripts are needed, **pause and inform the user** instead of making changes yourself
-4. **Never Edit Configuration Files**: Do not modify `pyproject.toml`, pipeline scripts in `scripts/`, or configuration files
+- Create/refactor dashboard pages in `views/`
+- Build visualizations using Plotly, Matplotlib, Seaborn, PyDeck, Altair
+- Fix dashboard bugs and improve UI/UX
+- Create utility functions in `utils/` and `plots/`
+- Read (but never edit) data pipeline code to understand data structures
+- Use #tool:web to fetch library documentation:
+  - Streamlit: https://docs.streamlit.io/
+  - Plotly: https://plotly.com/python/
+  - PyDeck: https://deckgl.readthedocs.io/
+  - Xarray: https://docs.xarray.dev/
+  - GeoPandas: https://geopandas.org/

-### Boundaries
+### ❌ What You Don't Do

-If you identify that a dashboard improvement requires changes to:
- Data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
- Dataset assembly (`dataset.py`)
- Model training (`training.py`, `inference.py`)
- Pipeline automation scripts (`scripts/*.sh`)
+- Edit files outside `src/entropice/dashboard/`
+- Modify data pipeline (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
+- Change training code (`training.py`, `dataset.py`, `inference.py`)
+- Edit configuration (`pyproject.toml`, `scripts/*.sh`)
+
+### When to Stop
+
+If a dashboard feature requires changes outside `dashboard/`, stop and inform:

-**Stop immediately** and inform the user:
 ```
-⚠️ This dashboard feature requires changes to the data pipeline/training code.
-Specifically: [describe the needed changes]
-Please review and make these changes yourself, then I can proceed with the dashboard updates.
+⚠️ This requires changes to [file/module]
+Needed: [describe changes]
+Please make these changes first, then I can update the dashboard.
 ```

 ## Dashboard Structure
@ -60,23 +58,28 @@ The dashboard is located in `src/entropice/dashboard/` with the following struct
 ```
 dashboard/
 ├── app.py                      # Main Streamlit app with navigation
-├── overview_page.py            # Overview of training results
-├── training_data_page.py       # Training data visualizations
-├── training_analysis_page.py   # CV results and hyperparameter analysis
-├── model_state_page.py         # Feature importance and model state
-├── inference_page.py           # Spatial prediction visualizations
+├── views/                      # Dashboard pages
+│   ├── overview_page.py            # Overview of training results and dataset analysis
+│   ├── training_data_page.py       # Training data visualizations (needs refactoring)
+│   ├── training_analysis_page.py   # CV results and hyperparameter analysis (needs refactoring)
+│   ├── model_state_page.py         # Feature importance and model state (needs refactoring)
+│   └── inference_page.py           # Spatial prediction visualizations (needs refactoring)
 ├── plots/                      # Reusable plotting utilities
-│   ├── colors.py               # Color schemes
 │   ├── hyperparameter_analysis.py
 │   ├── inference.py
 │   ├── model_state.py
 │   ├── source_data.py
 │   └── training_data.py
-└── utils/                      # Data loading and processing
-    ├── data.py
-    └── training.py
+└── utils/                      # Data loading and processing utilities
+    ├── loaders.py              # Data loaders (training results, grid data, predictions)
+    ├── stats.py                # Dataset statistics computation and caching
+    ├── colors.py               # Color palette management
+    ├── formatters.py           # Display formatting utilities
+    └── unsembler.py            # Dataset ensemble utilities
 ```

+**Note:** Currently only `overview_page.py` has been refactored to follow the new patterns. Other pages need updating to match this structure.
+
 ## Key Technologies

 - **Streamlit**: Web app framework
@ -120,6 +123,79 @@ When working with Entropice data:
 3. **Training Results**: Pickled models, Parquet/NetCDF CV results
 4. **Predictions**: GeoDataFrames with predicted classes/probabilities

+### Dashboard Code Patterns
+
+**Follow these patterns when developing or refactoring dashboard pages:**
+
+1. **Modular Render Functions**: Break pages into focused render functions
+   ```python
+   def render_sample_count_overview():
+       """Render overview of sample counts per task+target+grid+level combination."""
+       # Implementation
+
+   def render_feature_count_section():
+       """Render the feature count section with comparison and explorer."""
+       # Implementation
+   ```
+
+2. **Use `@st.fragment` for Interactive Components**: Isolate reactive UI elements
+   ```python
+   @st.fragment
+   def render_feature_count_explorer():
+       """Render interactive detailed configuration explorer using fragments."""
+       # Interactive selectboxes and checkboxes that re-run independently
+   ```
+
+3. **Cached Data Loading via Utilities**: Use centralized loaders from `utils/loaders.py`
+   ```python
+   from entropice.dashboard.utils.loaders import load_all_training_results
+   from entropice.dashboard.utils.stats import load_all_default_dataset_statistics
+
+   training_results = load_all_training_results()  # Cached via @st.cache_data
+   all_stats = load_all_default_dataset_statistics()  # Cached via @st.cache_data
+   ```
+
+4. **Consistent Color Palettes**: Use `get_palette()` from `utils/colors.py`
+   ```python
+   from entropice.dashboard.utils.colors import get_palette
+
+   task_colors = get_palette("task_types", n_colors=n_tasks)
+   source_colors = get_palette("data_sources", n_colors=n_sources)
+   ```
+
+5. **Type Hints and Type Casting**: Use types from `entropice.utils.types`
+   ```python
+   from entropice.utils.types import GridConfig, L2SourceDataset, TargetDataset, grid_configs
+
+   selected_grid_config: GridConfig = next(gc for gc in grid_configs if gc.display_name == grid_level_combined)
+   selected_members: list[L2SourceDataset] = []
+   ```
+
+6. **Tab-Based Organization**: Use tabs to organize complex visualizations
+   ```python
+   tab1, tab2, tab3 = st.tabs(["📈 Heatmap", "📊 Bar Chart", "📋 Data Table"])
+   with tab1:
+       # Heatmap visualization
+   with tab2:
+       # Bar chart visualization
+   ```
+
+7. **Layout with Columns**: Use columns for metrics and side-by-side content
+   ```python
+   col1, col2, col3 = st.columns(3)
+   with col1:
+       st.metric("Total Features", f"{total_features:,}")
+   with col2:
+       st.metric("Data Sources", len(selected_members))
+   ```
+
+8. **Comprehensive Docstrings**: Document render functions clearly
+   ```python
+   def render_training_results_summary(training_results):
+       """Render summary metrics for training results."""
+       # Implementation
+   ```
+
 ### Visualization Guidelines

 1. **Geospatial Data**: Use PyDeck for interactive maps, Plotly for static maps
@ -127,50 +203,79 @@ When working with Entropice data:
 3. **Distributions**: Use Plotly or Seaborn
 4. **Feature Importance**: Use Plotly bar charts
 5. **Hyperparameter Analysis**: Use Plotly scatter/parallel coordinates
+6. **Heatmaps**: Use `px.imshow()` with color palettes from `get_palette()`
+7. **Interactive Tables**: Use `st.dataframe()` with `width='stretch'` and formatting
+
+### Key Utility Modules
+
+**`utils/loaders.py`**: Data loading with Streamlit caching
+- `load_all_training_results()`: Load all training result directories
+- `load_training_result(path)`: Load specific training result
+- `TrainingResult` dataclass: Structured training result data
+
+**`utils/stats.py`**: Dataset statistics computation
+- `load_all_default_dataset_statistics()`: Load/compute stats for all grid configs
+- `DatasetStatistics` class: Statistics per grid configuration
+- `MemberStatistics` class: Statistics per L2 source dataset
+- `TargetStatistics` class: Statistics per target dataset
+- Helper methods: `get_sample_count_df()`, `get_feature_count_df()`, `get_feature_breakdown_df()`
+
+**`utils/colors.py`**: Consistent color palette management
+- `get_palette(variable, n_colors)`: Get color palette by semantic variable name
+- `get_cmap(variable)`: Get matplotlib colormap
+- "Refactor training_data_page.py to match the patterns in overview_page.py"
+- "Add a new tab to the overview page showing temporal statistics"
+- "Create a reusable plotting function in plots/ for feature importance"
+- Uses pypalettes material design palettes with deterministic mapping
+
+**`utils/formatters.py`**: Display formatting utilities
+- `ModelDisplayInfo`: Model name formatting
+- `TaskDisplayInfo`: Task name formatting
+- `TrainingResultDisplayInfo`: Training result display names

 ## Workflow

-1. **Understand the Request**: Clarify what visualization or feature is needed
-2. **Search for Context**: Use #tool:search to find relevant dashboard code and data structures
-3. **Read Data Pipeline**: If needed, read (but don't edit) data pipeline scripts to understand data formats
-4. **Consult Documentation**: Use #tool:fetch for library documentation when needed
-5. **Implement Changes**: Edit dashboard files only
-6. **Test Assumptions**: Check for errors with #tool:problems after edits
-7. **Track Progress**: Use #tool:todos for multi-step dashboard development
+1. Check `views/overview_page.py` for current patterns
+2. Use #tool:search to find relevant code and data structures
+3. Read data pipeline code if needed (read-only)
+4. Leverage existing utilities from `utils/`
+5. Use #tool:web to fetch documentation when needed
+6. Implement changes following overview_page.py patterns
+7. Use #tool:todo for multi-step tasks

-## Example Interactions
+## Refactoring Checklist

-### ✅ Good Requests (Within Scope)
+When updating pages to match new patterns:

- "Add a new page to visualize feature correlations"
- "Create a PyDeck map showing RTS predictions by grid cell"
- "Improve the hyperparameter analysis plot to show confidence intervals"
- "Add a Plotly histogram showing the distribution of RTS density"
- "Fix the deprecation warning about use_container_width"
+1. Move to `views/` subdirectory
+2. Use cached loaders from `utils/loaders.py` and `utils/stats.py`
+3. Split into focused `render_*()` functions
+4. Wrap interactive UI with `@st.fragment`
+5. Replace hardcoded colors with `get_palette()`
+6. Add type hints from `entropice.utils.types`
+7. Organize with tabs for complex views
+8. Use `width='stretch'` for charts/tables
+9. Add comprehensive docstrings
+10. Reference `overview_page.py` patterns

-### ⚠️ Boundary Cases (Requires User Approval)
+## Example Tasks

-User: "Add a new climate variable to the dashboard"
-Agent Response:
-```
-⚠️ This requires changes to the data pipeline (era5.py) to extract the new variable.
-Please add the variable to the ERA5 processing pipeline first, then I can add it to the dashboard visualizations.
-```
+**✅ In Scope:**
+- "Add feature correlation heatmap to overview page"
+- "Create PyDeck map for RTS predictions"
+- "Refactor training_data_page.py to match overview_page.py patterns"
+- "Fix use_container_width deprecation warnings"
+- "Add temporal statistics tab"

-## Progress Reporting
+**⚠️ Out of Scope:**
+- "Add new climate variable" → Requires changes to `era5.py`
+- "Change training metrics" → Requires changes to `training.py`
+- "Modify grid generation" → Requires changes to `grids.py`

-For complex dashboard development tasks:
+## Key Reminders

-1. Use #tool:todos to create a task list
-2. Mark tasks as in-progress before starting
-3. Mark completed immediately after finishing
-4. Keep the user informed of progress
-
-## Remember
-
- **Read-only for data pipeline**: You can read any file to understand data structures, but only edit `dashboard/` files
- **Documentation first**: When unsure about Streamlit/Plotly/PyDeck APIs, fetch documentation
- **Modern Streamlit API**: Always use `width='stretch'` instead of `use_container_width=True`
- **Pause when needed**: If data pipeline changes are required, stop and inform the user
-
-You are here to make the dashboard better, not to change how data is created or models are trained. Stay within these boundaries and you'll be most helpful!
+- Only edit files in `dashboard/`
+- Use `width='stretch'` not `use_container_width=True`
+- Always reference `overview_page.py` for patterns
+- Use #tool:web for documentation
+- Use #tool:todo for complex multi-step work
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@ -1,226 +1,70 @@
-# Entropice - GitHub Copilot Instructions
+# Entropice - Copilot Instructions

-## Project Overview
+## Project Context

-Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
+This is a geospatial machine learning system for predicting Arctic permafrost degradation (Retrogressive Thaw Slumps) using entropy-optimal Scalable Probabilistic Approximations (eSPA). The system processes multi-source geospatial data through discrete global grid systems (H3, HEALPix) and trains probabilistic classifiers.

-For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
-For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
-For project goals and setup, see [README.md](../README.md).
+## Code Style

-## Core Technologies
+- Follow PEP 8 conventions with 120 character line length
+- Use type hints for all function signatures
+- Use google-style docstrings for public functions
+- Keep functions focused and modular
+- Use `ruff` for linting and formatting, `ty` for type checking

- **Python**: 3.13 (strict version requirement)
- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
- **Storage**: Zarr, Icechunk, Parquet, NetCDF
- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
+## Technology Stack

-## Code Style Guidelines
+- **Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
+- **Spatial**: H3, xdggs, xvec for discrete global grid systems
+- **ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
+- **GPU**: CuPy, PyTorch, CUDA 12 - prefer GPU-accelerated operations
+- **Storage**: Zarr, Icechunk, Parquet for intermediate data
+- **CLI**: Cyclopts with dataclass-based configurations

-### Python Standards
+## Execution Guidelines

- Follow **PEP 8** conventions
- Use **type hints** for all function signatures
- Write **numpy-style docstrings** for public functions
- Keep functions **focused and modular**
- Prefer descriptive variable names over abbreviations
+- Always use `pixi run` to execute Python commands and scripts
+- Environment variables: `SCIPY_ARRAY_API=1`, `FAST_DATA_DIR=./data`

-### Geospatial Best Practices
+## Geospatial Best Practices

- Use **EPSG:3413** (Arctic Stereographic) for computations
- Use **EPSG:4326** (WGS84) for visualization and library compatibility
- Store gridded data using **xarray with XDGGS** indexing
- Store tabular data as **Parquet**, array data as **Zarr**
- Leverage **Dask** for lazy evaluation of large datasets
- Use **GeoPandas** for vector operations
- Handle **antimeridian** correctly for polar regions
+- Use EPSG:3413 (Arctic Stereographic) for computations
+- Use EPSG:4326 (WGS84) for visualization and compatibility
+- Store gridded data as XDGGS Xarray datasets (Zarr format)
+- Store tabular data as GeoParquet
+- Handle antimeridian issues in polar regions
+- Leverage Xarray/Dask for lazy evaluation and chunked processing

-### Data Pipeline Conventions
+## Architecture Patterns

- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
- Each pipeline stage should produce **reproducible intermediate outputs**
- Use `src/entropice/utils/paths.py` for consistent path management
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
+- Modular CLI design: each module exposes standalone Cyclopts CLI
+- Configuration as code: use dataclasses for typed configs, TOML for hyperparameters
+- GPU acceleration: use CuPy for arrays, cuML for ML, batch processing for memory management
+- Data flow: Raw sources → Grid aggregation → L2 datasets → Training → Inference → Visualization

-### Storage Hierarchy
+## Data Storage Hierarchy

-All data follows this structure:
 ```
 DATA_DIR/
-├── grids/                    # H3/HEALPix tessellations (GeoParquet)
-├── darts/                    # RTS labels (GeoParquet)
-├── era5/                     # Climate data (Zarr)
-├── arcticdem/                # Terrain data (Icechunk Zarr)
-├── alphaearth/               # Satellite embeddings (Zarr)
-├── datasets/                 # L2 XDGGS datasets (Zarr)
-├── training-results/         # Models, CV results, predictions
-└── watermask/                # Ocean mask (GeoParquet)
+├── grids/           # H3/HEALPix tessellations (GeoParquet)
+├── darts/           # RTS labels (GeoParquet)
+├── era5/            # Climate data (Zarr)
+├── arcticdem/       # Terrain data (Icechunk Zarr)
+├── alphaearth/      # Satellite embeddings (Zarr)
+├── datasets/        # L2 XDGGS datasets (Zarr)
+└── training-results/ # Models, CV results, predictions
 ```

-## Module Organization
+## Key Modules

-### Core Modules (`src/entropice/`)
+- `entropice.spatial`: Grid generation and raster-to-vector aggregation
+- `entropice.ingest`: Data extractors (DARTS, ERA5, ArcticDEM, AlphaEarth)
+- `entropice.ml`: Dataset assembly, training, inference
+- `entropice.dashboard`: Streamlit visualization app
+- `entropice.utils`: Paths, codecs, types

-The codebase is organized into four main packages:
+## Testing & Notebooks

- **`entropice.ingest`**: Data ingestion from external sources
- **`entropice.spatial`**: Spatial operations and grid management
- **`entropice.ml`**: Machine learning workflows
- **`entropice.utils`**: Common utilities
-
-#### Data Ingestion (`src/entropice/ingest/`)
-
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
-
-#### Spatial Operations (`src/entropice/spatial/`)
-
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
- **`watermask.py`**: Ocean masking utilities
- **`xvec.py`**: Extended vector operations for xarray
-
-#### Machine Learning (`src/entropice/ml/`)
-
- **`dataset.py`**: Multi-source data integration and feature engineering
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
- **`inference.py`**: Batch prediction pipeline for trained models
-
-#### Utilities (`src/entropice/utils/`)
-
- **`paths.py`**: Centralized path management
- **`codecs.py`**: Custom codecs for data serialization
-
-### Dashboard (`src/entropice/dashboard/`)
-
- Streamlit-based interactive visualization
- Modular pages: overview, training data, analysis, model state, inference
- Bokeh-based geospatial plotting utilities
- Run with: `pixi run dashboard`
-
-### Scripts (`scripts/`)
-
- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
- Run entire pipeline for multiple grid configurations
- Each script uses CLIs from core modules
-
-### Notebooks (`notebooks/`)
-
- Exploratory analysis and validation
- **NOT committed to git**
- Keep production code in `src/entropice/`
-
-## Development Workflow
-
-### Setup
-
-```bash
-pixi install  # NOT pip install or conda install
-```
-
-### Running Tests
-
-```bash
-pixi run pytest
-```
-
-### Running Python Commands
-
-Always use `pixi run` to execute Python commands to use the correct environment:
-
-```bash
-pixi run python script.py
-pixi run python -c "import entropice"
-```
-
-### Common Tasks
-
-**Important**: Always use `pixi run` prefix for Python commands to ensure correct environment.
-
- **Generate grids**: Use `pixi run create-grid` or `spatial/grids.py` CLI
- **Process labels**: Use `pixi run darts` or `ingest/darts.py` CLI
- **Train models**: Use `pixi run train` with TOML config or `ml/training.py` CLI
- **Run inference**: Use `ml/inference.py` CLI
- **View results**: `pixi run dashboard`
-
-## Key Design Patterns
-
-### 1. XDGGS Indexing
-
-All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
-
-### 2. Lazy Evaluation
-
-Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
-
-### 3. GPU Acceleration
-
-Prefer GPU-accelerated operations:
- CuPy for array operations
- cuML for Random Forest and KNN
- XGBoost GPU training
- PyTorch tensors when applicable
-
-### 4. Configuration as Code
-
- Use dataclasses for typed configuration
- TOML files for training hyperparameters
- Cyclopts for CLI argument parsing
-
-## Data Sources
-
- **DARTS v2**: RTS labels (year, area, count, density)
- **ERA5**: Climate data (40-year history, Arctic-aligned years)
- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
- **AlphaEarth**: 64-dimensional satellite embeddings
- **Watermask**: Ocean exclusion layer
-
-## Model Support
-
-Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
-Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
-
-Training features:
- Randomized hyperparameter search
- K-Fold cross-validation
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
-
-## Extension Points
-
-To extend Entropice:
-
- **New data source**: Follow patterns in `ingest/era5.py` or `ingest/arcticdem.py`
- **Custom aggregations**: Add to `_Aggregations` dataclass in `spatial/aggregators.py`
- **Alternative labels**: Implement extractor following `ingest/darts.py` pattern
- **New models**: Add scikit-learn compatible estimators to `ml/training.py`
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
-
-## Important Notes
-
- **Always use `pixi run` prefix** for Python commands (not plain `python`)
- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
- Arctic years run **October 1 to September 30** (not calendar years)
- Handle **antimeridian crossing** in polar regions
- Use **batch processing** for GPU memory management
- Notebooks are for exploration only - **keep production code in `src/`**
- Always use **absolute paths** or paths from `utils/paths.py`
-
-## Common Issues
-
- **Memory**: Use batch processing and Dask chunking for large datasets
- **GPU OOM**: Reduce batch size in inference or training
- **Antimeridian**: Use proper handling in `spatial/aggregators.py` for polar grids
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
-
-## References
-
-For more details, consult:
- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
- [README.md](../README.md) - Project goals and setup instructions
+- Production code belongs in `src/entropice/`, not notebooks
+- Notebooks in `notebooks/` are for exploration only (not version-controlled)
+- Use `pytest` for testing geospatial correctness and data integrity
--- a/.github/python.instructions.md
+++ b/.github/python.instructions.md
@ -10,7 +10,7 @@ applyTo: '**/*.py,**/*.ipynb'
 - Write clear and concise comments for each function.
 - Ensure functions have descriptive names and include type hints.
 - Provide docstrings following PEP 257 conventions.
- Use the `typing` module for type annotations (e.g., `List[str]`, `Dict[str, int]`).
+- Use the `typing` module for advanced type annotations (e.g., `TypedDict`, `Literal["a", "b", ...]`).
 - Break down complex functions into smaller, more manageable functions.

 ## General Instructions
@ -27,7 +27,7 @@ applyTo: '**/*.py,**/*.ipynb'

 - Follow the **PEP 8** style guide for Python.
 - Maintain proper indentation (use 4 spaces for each level of indentation).
- Ensure lines do not exceed 79 characters.
+- Ensure lines do not exceed 120 characters.
 - Place function and class docstrings immediately after the `def` or `class` keyword.
 - Use blank lines to separate functions, classes, and code blocks where appropriate.

@ -41,6 +41,8 @@ applyTo: '**/*.py,**/*.ipynb'
 ## Example of Proper Documentation

 ```python
+import math
+
 def calculate_area(radius: float) -> float:
    """
    Calculate the area of a circle given the radius.
@ -51,6 +53,5 @@ def calculate_area(radius: float) -> float:
    Returns:
      float: The area of the circle, calculated as π * radius^2.
    """
-    import math
    return math.pi * radius ** 2
 ```