Add some docs for copilot
This commit is contained in:
parent
1ee3d532fc
commit
f8df10f687
9 changed files with 908 additions and 2027 deletions
176
.github/agents/Dashboard.agent.md
vendored
Normal file
176
.github/agents/Dashboard.agent.md
vendored
Normal file
|
|
@ -0,0 +1,176 @@
|
||||||
|
---
|
||||||
|
description: 'Specialized agent for developing and enhancing the Streamlit dashboard for data and training analysis.'
|
||||||
|
name: Dashboard-Developer
|
||||||
|
argument-hint: 'Describe dashboard features, pages, visualizations, or improvements you want to add or modify'
|
||||||
|
tools: ['edit', 'runNotebooks', 'search', 'runCommands', 'usages', 'problems', 'changes', 'testFailure', 'fetch', 'githubRepo', 'ms-python.python/getPythonEnvironmentInfo', 'ms-python.python/getPythonExecutableCommand', 'ms-python.python/installPythonPackage', 'ms-python.python/configurePythonEnvironment', 'ms-toolsai.jupyter/configureNotebook', 'ms-toolsai.jupyter/listNotebookPackages', 'ms-toolsai.jupyter/installNotebookPackages', 'todos', 'runSubagent', 'runTests']
|
||||||
|
---
|
||||||
|
|
||||||
|
# Dashboard Development Agent
|
||||||
|
|
||||||
|
You are a specialized agent for incrementally developing and enhancing the **Entropice Streamlit Dashboard** used to analyze geospatial machine learning data and training experiments.
|
||||||
|
|
||||||
|
## Your Responsibilities
|
||||||
|
|
||||||
|
### What You Should Do
|
||||||
|
|
||||||
|
1. **Develop Dashboard Features**: Create new pages, visualizations, and UI components for the Streamlit dashboard
|
||||||
|
2. **Enhance Visualizations**: Improve or create plots using Plotly, Matplotlib, Seaborn, PyDeck, and Altair
|
||||||
|
3. **Fix Dashboard Issues**: Debug and resolve problems in dashboard pages and plotting utilities
|
||||||
|
4. **Read Data Context**: Understand data structures (Xarray, GeoPandas, Pandas, NumPy) to properly visualize them
|
||||||
|
5. **Consult Documentation**: Use #tool:fetch to read library documentation when needed:
|
||||||
|
- Streamlit: https://docs.streamlit.io/
|
||||||
|
- Plotly: https://plotly.com/python/
|
||||||
|
- PyDeck: https://deckgl.readthedocs.io/
|
||||||
|
- Deck.gl: https://deck.gl/
|
||||||
|
- Matplotlib: https://matplotlib.org/
|
||||||
|
- Seaborn: https://seaborn.pydata.org/
|
||||||
|
- Xarray: https://docs.xarray.dev/
|
||||||
|
- GeoPandas: https://geopandas.org/
|
||||||
|
- Pandas: https://pandas.pydata.org/pandas-docs/
|
||||||
|
- NumPy: https://numpy.org/doc/stable/
|
||||||
|
|
||||||
|
6. **Understand Data Sources**: Read data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`, `dataset.py`, `training.py`, `inference.py`) to understand data structures—but **NEVER edit them**
|
||||||
|
|
||||||
|
### What You Should NOT Do
|
||||||
|
|
||||||
|
1. **Never Edit Data Pipeline Scripts**: Do not modify files in `src/entropice/` that are NOT in the `dashboard/` subdirectory
|
||||||
|
2. **Never Edit Training Scripts**: Do not modify `training.py`, `dataset.py`, or any model-related code outside the dashboard
|
||||||
|
3. **Never Modify Data Processing**: If changes to data creation or model training scripts are needed, **pause and inform the user** instead of making changes yourself
|
||||||
|
4. **Never Edit Configuration Files**: Do not modify `pyproject.toml`, pipeline scripts in `scripts/`, or configuration files
|
||||||
|
|
||||||
|
### Boundaries
|
||||||
|
|
||||||
|
If you identify that a dashboard improvement requires changes to:
|
||||||
|
- Data pipeline scripts (`grids.py`, `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`)
|
||||||
|
- Dataset assembly (`dataset.py`)
|
||||||
|
- Model training (`training.py`, `inference.py`)
|
||||||
|
- Pipeline automation scripts (`scripts/*.sh`)
|
||||||
|
|
||||||
|
**Stop immediately** and inform the user:
|
||||||
|
```
|
||||||
|
⚠️ This dashboard feature requires changes to the data pipeline/training code.
|
||||||
|
Specifically: [describe the needed changes]
|
||||||
|
Please review and make these changes yourself, then I can proceed with the dashboard updates.
|
||||||
|
```
|
||||||
|
|
||||||
|
## Dashboard Structure
|
||||||
|
|
||||||
|
The dashboard is located in `src/entropice/dashboard/` with the following structure:
|
||||||
|
|
||||||
|
```
|
||||||
|
dashboard/
|
||||||
|
├── app.py # Main Streamlit app with navigation
|
||||||
|
├── overview_page.py # Overview of training results
|
||||||
|
├── training_data_page.py # Training data visualizations
|
||||||
|
├── training_analysis_page.py # CV results and hyperparameter analysis
|
||||||
|
├── model_state_page.py # Feature importance and model state
|
||||||
|
├── inference_page.py # Spatial prediction visualizations
|
||||||
|
├── plots/ # Reusable plotting utilities
|
||||||
|
│ ├── colors.py # Color schemes
|
||||||
|
│ ├── hyperparameter_analysis.py
|
||||||
|
│ ├── inference.py
|
||||||
|
│ ├── model_state.py
|
||||||
|
│ ├── source_data.py
|
||||||
|
│ └── training_data.py
|
||||||
|
└── utils/ # Data loading and processing
|
||||||
|
├── data.py
|
||||||
|
└── training.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Technologies
|
||||||
|
|
||||||
|
- **Streamlit**: Web app framework
|
||||||
|
- **Plotly**: Interactive plots (preferred for most visualizations)
|
||||||
|
- **Matplotlib/Seaborn**: Statistical plots
|
||||||
|
- **PyDeck/Deck.gl**: Geospatial visualizations
|
||||||
|
- **Altair**: Declarative visualizations
|
||||||
|
- **Bokeh**: Alternative interactive plotting (already used in some places)
|
||||||
|
|
||||||
|
## Critical Code Standards
|
||||||
|
|
||||||
|
### Streamlit Best Practices
|
||||||
|
|
||||||
|
**❌ INCORRECT** (deprecated):
|
||||||
|
```python
|
||||||
|
st.plotly_chart(fig, use_container_width=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
**✅ CORRECT** (current API):
|
||||||
|
```python
|
||||||
|
st.plotly_chart(fig, width='stretch')
|
||||||
|
```
|
||||||
|
|
||||||
|
**Common width values**:
|
||||||
|
- `width='stretch'` - Use full container width (replaces `use_container_width=True`)
|
||||||
|
- `width='content'` - Use content width (replaces `use_container_width=False`)
|
||||||
|
|
||||||
|
This applies to:
|
||||||
|
- `st.plotly_chart()`
|
||||||
|
- `st.altair_chart()`
|
||||||
|
- `st.vega_lite_chart()`
|
||||||
|
- `st.dataframe()`
|
||||||
|
- `st.image()`
|
||||||
|
|
||||||
|
### Data Structure Patterns
|
||||||
|
|
||||||
|
When working with Entropice data:
|
||||||
|
|
||||||
|
1. **Grid Data**: GeoDataFrames with H3/HEALPix cell IDs
|
||||||
|
2. **L2 Datasets**: Xarray datasets with XDGGS dimensions
|
||||||
|
3. **Training Results**: Pickled models, Parquet/NetCDF CV results
|
||||||
|
4. **Predictions**: GeoDataFrames with predicted classes/probabilities
|
||||||
|
|
||||||
|
### Visualization Guidelines
|
||||||
|
|
||||||
|
1. **Geospatial Data**: Use PyDeck for interactive maps, Plotly for static maps
|
||||||
|
2. **Time Series**: Prefer Plotly for interactivity
|
||||||
|
3. **Distributions**: Use Plotly or Seaborn
|
||||||
|
4. **Feature Importance**: Use Plotly bar charts
|
||||||
|
5. **Hyperparameter Analysis**: Use Plotly scatter/parallel coordinates
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. **Understand the Request**: Clarify what visualization or feature is needed
|
||||||
|
2. **Search for Context**: Use #tool:search to find relevant dashboard code and data structures
|
||||||
|
3. **Read Data Pipeline**: If needed, read (but don't edit) data pipeline scripts to understand data formats
|
||||||
|
4. **Consult Documentation**: Use #tool:fetch for library documentation when needed
|
||||||
|
5. **Implement Changes**: Edit dashboard files only
|
||||||
|
6. **Test Assumptions**: Check for errors with #tool:problems after edits
|
||||||
|
7. **Track Progress**: Use #tool:todos for multi-step dashboard development
|
||||||
|
|
||||||
|
## Example Interactions
|
||||||
|
|
||||||
|
### ✅ Good Requests (Within Scope)
|
||||||
|
|
||||||
|
- "Add a new page to visualize feature correlations"
|
||||||
|
- "Create a PyDeck map showing RTS predictions by grid cell"
|
||||||
|
- "Improve the hyperparameter analysis plot to show confidence intervals"
|
||||||
|
- "Add a Plotly histogram showing the distribution of RTS density"
|
||||||
|
- "Fix the deprecation warning about use_container_width"
|
||||||
|
|
||||||
|
### ⚠️ Boundary Cases (Requires User Approval)
|
||||||
|
|
||||||
|
User: "Add a new climate variable to the dashboard"
|
||||||
|
Agent Response:
|
||||||
|
```
|
||||||
|
⚠️ This requires changes to the data pipeline (era5.py) to extract the new variable.
|
||||||
|
Please add the variable to the ERA5 processing pipeline first, then I can add it to the dashboard visualizations.
|
||||||
|
```
|
||||||
|
|
||||||
|
## Progress Reporting
|
||||||
|
|
||||||
|
For complex dashboard development tasks:
|
||||||
|
|
||||||
|
1. Use #tool:todos to create a task list
|
||||||
|
2. Mark tasks as in-progress before starting
|
||||||
|
3. Mark completed immediately after finishing
|
||||||
|
4. Keep the user informed of progress
|
||||||
|
|
||||||
|
## Remember
|
||||||
|
|
||||||
|
- **Read-only for data pipeline**: You can read any file to understand data structures, but only edit `dashboard/` files
|
||||||
|
- **Documentation first**: When unsure about Streamlit/Plotly/PyDeck APIs, fetch documentation
|
||||||
|
- **Modern Streamlit API**: Always use `width='stretch'` instead of `use_container_width=True`
|
||||||
|
- **Pause when needed**: If data pipeline changes are required, stop and inform the user
|
||||||
|
|
||||||
|
You are here to make the dashboard better, not to change how data is created or models are trained. Stay within these boundaries and you'll be most helpful!
|
||||||
193
.github/copilot-instructions.md
vendored
Normal file
193
.github/copilot-instructions.md
vendored
Normal file
|
|
@ -0,0 +1,193 @@
|
||||||
|
# Entropice - GitHub Copilot Instructions
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**. The system processes multi-source geospatial data, aggregates it into discrete global grids (H3/HEALPix), and trains probabilistic classifiers to estimate RTS occurrence patterns.
|
||||||
|
|
||||||
|
For detailed architecture information, see [ARCHITECTURE.md](../ARCHITECTURE.md).
|
||||||
|
For contributing guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
|
||||||
|
For project goals and setup, see [README.md](../README.md).
|
||||||
|
|
||||||
|
## Core Technologies
|
||||||
|
|
||||||
|
- **Python**: 3.13 (strict version requirement)
|
||||||
|
- **Package Manager**: [Pixi](https://pixi.sh/) (not pip/conda directly)
|
||||||
|
- **GPU**: CUDA 12 with RAPIDS (CuPy, cuML)
|
||||||
|
- **Geospatial**: xarray, xdggs, GeoPandas, H3, Rasterio
|
||||||
|
- **ML**: scikit-learn, XGBoost, entropy (eSPA), cuML
|
||||||
|
- **Storage**: Zarr, Icechunk, Parquet, NetCDF
|
||||||
|
- **Visualization**: Streamlit, Bokeh, Matplotlib, Cartopy
|
||||||
|
|
||||||
|
## Code Style Guidelines
|
||||||
|
|
||||||
|
### Python Standards
|
||||||
|
|
||||||
|
- Follow **PEP 8** conventions
|
||||||
|
- Use **type hints** for all function signatures
|
||||||
|
- Write **numpy-style docstrings** for public functions
|
||||||
|
- Keep functions **focused and modular**
|
||||||
|
- Prefer descriptive variable names over abbreviations
|
||||||
|
|
||||||
|
### Geospatial Best Practices
|
||||||
|
|
||||||
|
- Use **EPSG:3413** (Arctic Stereographic) for computations
|
||||||
|
- Use **EPSG:4326** (WGS84) for visualization and library compatibility
|
||||||
|
- Store gridded data using **xarray with XDGGS** indexing
|
||||||
|
- Store tabular data as **Parquet**, array data as **Zarr**
|
||||||
|
- Leverage **Dask** for lazy evaluation of large datasets
|
||||||
|
- Use **GeoPandas** for vector operations
|
||||||
|
- Handle **antimeridian** correctly for polar regions
|
||||||
|
|
||||||
|
### Data Pipeline Conventions
|
||||||
|
|
||||||
|
- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → `02alphaearth.sh` → `03era5.sh` → `04arcticdem.sh` → `05train.sh`
|
||||||
|
- Each pipeline stage should produce **reproducible intermediate outputs**
|
||||||
|
- Use `src/entropice/paths.py` for consistent path management
|
||||||
|
- Environment variable `FAST_DATA_DIR` controls data directory location (default: `./data`)
|
||||||
|
|
||||||
|
### Storage Hierarchy
|
||||||
|
|
||||||
|
All data follows this structure:
|
||||||
|
```
|
||||||
|
DATA_DIR/
|
||||||
|
├── grids/ # H3/HEALPix tessellations (GeoParquet)
|
||||||
|
├── darts/ # RTS labels (GeoParquet)
|
||||||
|
├── era5/ # Climate data (Zarr)
|
||||||
|
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
||||||
|
├── alphaearth/ # Satellite embeddings (Zarr)
|
||||||
|
├── datasets/ # L2 XDGGS datasets (Zarr)
|
||||||
|
├── training-results/ # Models, CV results, predictions
|
||||||
|
└── watermask/ # Ocean mask (GeoParquet)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Module Organization
|
||||||
|
|
||||||
|
### Core Modules (`src/entropice/`)
|
||||||
|
|
||||||
|
- **`grids.py`**: H3/HEALPix spatial grid generation with watermask
|
||||||
|
- **`darts.py`**: RTS label extraction from DARTS v2 dataset
|
||||||
|
- **`era5.py`**: Climate data processing from ERA5 (Arctic-aligned years: Oct 1 - Sep 30)
|
||||||
|
- **`arcticdem.py`**: Terrain analysis from 32m Arctic elevation data
|
||||||
|
- **`alphaearth.py`**: Satellite image embeddings via Google Earth Engine
|
||||||
|
- **`aggregators.py`**: Raster-to-vector spatial aggregation engine
|
||||||
|
- **`dataset.py`**: Multi-source data integration and feature engineering
|
||||||
|
- **`training.py`**: Model training with eSPA, XGBoost, Random Forest, KNN
|
||||||
|
- **`inference.py`**: Batch prediction pipeline for trained models
|
||||||
|
- **`paths.py`**: Centralized path management
|
||||||
|
|
||||||
|
### Dashboard (`src/entropice/dashboard/`)
|
||||||
|
|
||||||
|
- Streamlit-based interactive visualization
|
||||||
|
- Modular pages: overview, training data, analysis, model state, inference
|
||||||
|
- Bokeh-based geospatial plotting utilities
|
||||||
|
- Run with: `pixi run dashboard`
|
||||||
|
|
||||||
|
### Scripts (`scripts/`)
|
||||||
|
|
||||||
|
- Numbered pipeline scripts (`00grids.sh` through `05train.sh`)
|
||||||
|
- Run entire pipeline for multiple grid configurations
|
||||||
|
- Each script uses CLIs from core modules
|
||||||
|
|
||||||
|
### Notebooks (`notebooks/`)
|
||||||
|
|
||||||
|
- Exploratory analysis and validation
|
||||||
|
- **NOT committed to git**
|
||||||
|
- Keep production code in `src/entropice/`
|
||||||
|
|
||||||
|
## Development Workflow
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pixi install # NOT pip install or conda install
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pixi run pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Tasks
|
||||||
|
|
||||||
|
- **Generate grids**: Use `grids.py` CLI
|
||||||
|
- **Process labels**: Use `darts.py` CLI
|
||||||
|
- **Train models**: Use `training.py` CLI with TOML config
|
||||||
|
- **Run inference**: Use `inference.py` CLI
|
||||||
|
- **View results**: `pixi run dashboard`
|
||||||
|
|
||||||
|
## Key Design Patterns
|
||||||
|
|
||||||
|
### 1. XDGGS Indexing
|
||||||
|
|
||||||
|
All geospatial data uses discrete global grid systems (H3 or HEALPix) via `xdggs` library for consistent spatial indexing across sources.
|
||||||
|
|
||||||
|
### 2. Lazy Evaluation
|
||||||
|
|
||||||
|
Use Xarray/Dask for out-of-core computation with Zarr/Icechunk chunked storage to manage large datasets.
|
||||||
|
|
||||||
|
### 3. GPU Acceleration
|
||||||
|
|
||||||
|
Prefer GPU-accelerated operations:
|
||||||
|
- CuPy for array operations
|
||||||
|
- cuML for Random Forest and KNN
|
||||||
|
- XGBoost GPU training
|
||||||
|
- PyTorch tensors when applicable
|
||||||
|
|
||||||
|
### 4. Configuration as Code
|
||||||
|
|
||||||
|
- Use dataclasses for typed configuration
|
||||||
|
- TOML files for training hyperparameters
|
||||||
|
- Cyclopts for CLI argument parsing
|
||||||
|
|
||||||
|
## Data Sources
|
||||||
|
|
||||||
|
- **DARTS v2**: RTS labels (year, area, count, density)
|
||||||
|
- **ERA5**: Climate data (40-year history, Arctic-aligned years)
|
||||||
|
- **ArcticDEM**: 32m resolution terrain (slope, aspect, indices)
|
||||||
|
- **AlphaEarth**: 256-dimensional satellite embeddings
|
||||||
|
- **Watermask**: Ocean exclusion layer
|
||||||
|
|
||||||
|
## Model Support
|
||||||
|
|
||||||
|
Primary: **eSPA** (entropy-optimal Scalable Probabilistic Approximations)
|
||||||
|
Alternatives: XGBoost, Random Forest, K-Nearest Neighbors
|
||||||
|
|
||||||
|
Training features:
|
||||||
|
- Randomized hyperparameter search
|
||||||
|
- K-Fold cross-validation
|
||||||
|
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
|
||||||
|
|
||||||
|
## Extension Points
|
||||||
|
|
||||||
|
To extend Entropice:
|
||||||
|
|
||||||
|
- **New data source**: Follow patterns in `era5.py` or `arcticdem.py`
|
||||||
|
- **Custom aggregations**: Add to `_Aggregations` dataclass in `aggregators.py`
|
||||||
|
- **Alternative labels**: Implement extractor following `darts.py` pattern
|
||||||
|
- **New models**: Add scikit-learn compatible estimators to `training.py`
|
||||||
|
- **Dashboard pages**: Add Streamlit pages to `dashboard/` module
|
||||||
|
|
||||||
|
## Important Notes
|
||||||
|
|
||||||
|
- Grid resolutions: **H3** (3-6), **HEALPix** (6-10)
|
||||||
|
- Arctic years run **October 1 to September 30** (not calendar years)
|
||||||
|
- Handle **antimeridian crossing** in polar regions
|
||||||
|
- Use **batch processing** for GPU memory management
|
||||||
|
- Notebooks are for exploration only - **keep production code in `src/`**
|
||||||
|
- Always use **absolute paths** or paths from `paths.py`
|
||||||
|
|
||||||
|
## Common Issues
|
||||||
|
|
||||||
|
- **Memory**: Use batch processing and Dask chunking for large datasets
|
||||||
|
- **GPU OOM**: Reduce batch size in inference or training
|
||||||
|
- **Antimeridian**: Use proper handling in `aggregators.py` for polar grids
|
||||||
|
- **Temporal alignment**: ERA5 uses Arctic-aligned years (Oct-Sep)
|
||||||
|
- **CRS**: Compute in EPSG:3413, visualize in EPSG:4326
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
For more details, consult:
|
||||||
|
- [ARCHITECTURE.md](../ARCHITECTURE.md) - System architecture and design patterns
|
||||||
|
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Development workflow and standards
|
||||||
|
- [README.md](../README.md) - Project goals and setup instructions
|
||||||
56
.github/python.instructions.md
vendored
Normal file
56
.github/python.instructions.md
vendored
Normal file
|
|
@ -0,0 +1,56 @@
|
||||||
|
---
|
||||||
|
description: 'Python coding conventions and guidelines'
|
||||||
|
applyTo: '**/*.py,**/*.ipynb'
|
||||||
|
---
|
||||||
|
|
||||||
|
# Python Coding Conventions
|
||||||
|
|
||||||
|
## Python Instructions
|
||||||
|
|
||||||
|
- Write clear and concise comments for each function.
|
||||||
|
- Ensure functions have descriptive names and include type hints.
|
||||||
|
- Provide docstrings following PEP 257 conventions.
|
||||||
|
- Use the `typing` module for type annotations (e.g., `List[str]`, `Dict[str, int]`).
|
||||||
|
- Break down complex functions into smaller, more manageable functions.
|
||||||
|
|
||||||
|
## General Instructions
|
||||||
|
|
||||||
|
- Always prioritize readability and clarity.
|
||||||
|
- For algorithm-related code, include explanations of the approach used.
|
||||||
|
- Write code with good maintainability practices, including comments on why certain design decisions were made.
|
||||||
|
- Handle edge cases and write clear exception handling.
|
||||||
|
- For libraries or external dependencies, mention their usage and purpose in comments.
|
||||||
|
- Use consistent naming conventions and follow language-specific best practices.
|
||||||
|
- Write concise, efficient, and idiomatic code that is also easily understandable.
|
||||||
|
|
||||||
|
## Code Style and Formatting
|
||||||
|
|
||||||
|
- Follow the **PEP 8** style guide for Python.
|
||||||
|
- Maintain proper indentation (use 4 spaces for each level of indentation).
|
||||||
|
- Ensure lines do not exceed 79 characters.
|
||||||
|
- Place function and class docstrings immediately after the `def` or `class` keyword.
|
||||||
|
- Use blank lines to separate functions, classes, and code blocks where appropriate.
|
||||||
|
|
||||||
|
## Edge Cases and Testing
|
||||||
|
|
||||||
|
- Always include test cases for critical paths of the application.
|
||||||
|
- Account for common edge cases like empty inputs, invalid data types, and large datasets.
|
||||||
|
- Include comments for edge cases and the expected behavior in those cases.
|
||||||
|
- Write unit tests for functions and document them with docstrings explaining the test cases.
|
||||||
|
|
||||||
|
## Example of Proper Documentation
|
||||||
|
|
||||||
|
```python
|
||||||
|
def calculate_area(radius: float) -> float:
|
||||||
|
"""
|
||||||
|
Calculate the area of a circle given the radius.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
radius (float): The radius of the circle.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
float: The area of the circle, calculated as π * radius^2.
|
||||||
|
"""
|
||||||
|
import math
|
||||||
|
return math.pi * radius ** 2
|
||||||
|
```
|
||||||
217
ARCHITECTURE.md
Normal file
217
ARCHITECTURE.md
Normal file
|
|
@ -0,0 +1,217 @@
|
||||||
|
# Entropice Architecture
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Entropice is a geospatial machine learning system for predicting Retrogressive Thaw Slump (RTS) density across the Arctic using entropy-optimal Scalable Probabilistic Approximations (eSPA).
|
||||||
|
The system processes multi-source geospatial data, aggregates it into discrete global grids, and trains probabilistic classifiers to estimate RTS occurrence patterns.
|
||||||
|
It allows for rigorous experimentation with the data and the models, modularizing and abstracting most parts of the data flow pipeline.
|
||||||
|
Thus, alternative models, datasets and target labels can be used to compare, but also to explore new methodologies other than RTS prediction or eSPA modelling.
|
||||||
|
|
||||||
|
## Core Architecture
|
||||||
|
|
||||||
|
### Data Flow Pipeline
|
||||||
|
|
||||||
|
```txt
|
||||||
|
Data Sources → Grid Aggregation → Feature Engineering → Dataset Assembly → Model Training → Inference → Visualization
|
||||||
|
```
|
||||||
|
|
||||||
|
The pipeline follows a sequential processing approach where each stage produces intermediate datasets that feed into subsequent stages:
|
||||||
|
|
||||||
|
1. **Grid Generation** (H3/HEALPix) → Parquet files
|
||||||
|
2. **Label Extraction** (DARTS) → Grid-enriched labels
|
||||||
|
3. **Feature Extraction** (ERA5, ArcticDEM, AlphaEarth) → L2 Datasets (XDGGS Xarray)
|
||||||
|
4. **Dataset Ensemble** → Training-ready tabular data
|
||||||
|
5. **Model Training** → Pickled classifiers + CV results
|
||||||
|
6. **Inference** → GeoDataFrames with predictions
|
||||||
|
7. **Dashboard** → Interactive visualizations
|
||||||
|
|
||||||
|
### System Components
|
||||||
|
|
||||||
|
#### 1. Spatial Grid System (`grids.py`)
|
||||||
|
|
||||||
|
- **Purpose**: Creates global tessellations (discrete global grid systems) for spatial aggregation
|
||||||
|
- **Grid Types & Levels**: H3 hexagonal grids and HEALPix grids
|
||||||
|
- Hex: 3, 4, 5, 6
|
||||||
|
- Healpix: 6, 7, 8, 9, 10
|
||||||
|
- **Functionality**:
|
||||||
|
- Generates cell IDs and geometries at configurable resolutions
|
||||||
|
- Applies watermask to exclude ocean areas
|
||||||
|
- Provides spatial indexing for efficient data aggregation
|
||||||
|
- **Output**: GeoDataFrames with cell IDs, geometries, and land areas
|
||||||
|
|
||||||
|
#### 2. Label Management (`darts.py`)
|
||||||
|
|
||||||
|
- **Purpose**: Extracts RTS labels from DARTS v2 dataset
|
||||||
|
- **Processing**:
|
||||||
|
- Spatial overlay of DARTS polygons with grid cells
|
||||||
|
- Temporal aggregation across multiple observation years
|
||||||
|
- Computes RTS count, area, density, and coverage per cell
|
||||||
|
- Supports both standard DARTS and ML training labels
|
||||||
|
- **Output**: Grid-enriched parquet files with RTS metrics
|
||||||
|
|
||||||
|
#### 3. Feature Extractors
|
||||||
|
|
||||||
|
**ERA5 Climate Data (`era5.py`)**
|
||||||
|
|
||||||
|
- Downloads hourly climate variables from Copernicus Climate Data Store
|
||||||
|
- Computes daily aggregates: temperature extrema, precipitation, snow metrics
|
||||||
|
- Derives thaw/freeze metrics: degree days, thawing period length
|
||||||
|
- Temporal aggregations: yearly, seasonal, shoulder seasons
|
||||||
|
- Uses Arctic-aligned years (October 1 - September 30)
|
||||||
|
|
||||||
|
**ArcticDEM Terrain (`arcticdem.py`)**
|
||||||
|
|
||||||
|
- Processes 32m resolution Arctic elevation data
|
||||||
|
- Computes terrain derivatives: slope, aspect, curvature
|
||||||
|
- Calculates terrain indices: TPI, TRI, ruggedness, VRM
|
||||||
|
- Applies watermask clipping and GPU-accelerated convolutions
|
||||||
|
- Aggregates terrain statistics per grid cell
|
||||||
|
|
||||||
|
**AlphaEarth Embeddings (`alphaearth.py`)**
|
||||||
|
|
||||||
|
- Extracts 256-dimensional satellite image embeddings via Google Earth Engine
|
||||||
|
- Uses foundation models to capture visual patterns
|
||||||
|
- Partitions large grids using KMeans clustering
|
||||||
|
- Temporal sampling across multiple years
|
||||||
|
|
||||||
|
#### 4. Spatial Aggregation Framework (`aggregators.py`)
|
||||||
|
|
||||||
|
- **Core Capability**: Raster-to-vector aggregation engine
|
||||||
|
- **Methods**:
|
||||||
|
- Exact geometry-based aggregation using polygon overlay
|
||||||
|
- Grid-aligned batch processing for memory efficiency
|
||||||
|
- Statistical aggregations: mean, sum, std, min, max, median, quantiles
|
||||||
|
- **Optimization**:
|
||||||
|
- Antimeridian handling for polar regions
|
||||||
|
- Parallel processing with worker pools
|
||||||
|
- GPU acceleration via CuPy/CuML where applicable
|
||||||
|
|
||||||
|
#### 5. Dataset Assembly (`dataset.py`)
|
||||||
|
|
||||||
|
- **DatasetEnsemble Class**: Orchestrates multi-source data integration
|
||||||
|
- **L2 Datasets**: Standardized XDGGS Xarray datasets per data source
|
||||||
|
- **Features**:
|
||||||
|
- Dynamic dataset configuration via dataclasses
|
||||||
|
- Multi-task support: binary classification, count, density estimation
|
||||||
|
- Target binning: quantile-based categorical bins
|
||||||
|
- Train/test splitting with spatial awareness
|
||||||
|
- GPU-accelerated data loading (PyTorch/CuPy)
|
||||||
|
- **Output**: Tabular feature matrices ready for scikit-learn API
|
||||||
|
|
||||||
|
#### 6. Model Training (`training.py`)
|
||||||
|
|
||||||
|
- **Supported Models**:
|
||||||
|
- **eSPA**: Entropy-optimal probabilistic classifier (primary)
|
||||||
|
- **XGBoost**: Gradient boosting (GPU-accelerated)
|
||||||
|
- **Random Forest**: cuML implementation
|
||||||
|
- **K-Nearest Neighbors**: cuML implementation
|
||||||
|
- **Training Strategy**:
|
||||||
|
- Randomized hyperparameter search (Scipy distributions)
|
||||||
|
- K-Fold cross-validation
|
||||||
|
- Multi-metric evaluation (accuracy, F1, Jaccard, precision, recall)
|
||||||
|
- **Configuration**: TOML-based configuration with Cyclopts CLI
|
||||||
|
- **Output**: Pickled models, CV results, feature importance
|
||||||
|
|
||||||
|
#### 7. Inference (`inference.py`)
|
||||||
|
|
||||||
|
- Batch prediction pipeline for trained classifiers
|
||||||
|
- GPU memory management with configurable batch sizes
|
||||||
|
- Outputs GeoDataFrames with predicted classes and probabilities
|
||||||
|
- Supports all trained model types via unified interface
|
||||||
|
|
||||||
|
#### 8. Interactive Dashboard (`dashboard/`)
|
||||||
|
|
||||||
|
- **Technology**: Streamlit-based web application
|
||||||
|
- **Pages**:
|
||||||
|
- **Overview**: Result directory summary and metrics
|
||||||
|
- **Training Data**: Feature distribution visualizations
|
||||||
|
- **Training Analysis**: CV results, hyperparameter analysis
|
||||||
|
- **Model State**: Feature importance, decision boundaries
|
||||||
|
- **Inference**: Spatial prediction maps
|
||||||
|
- **Plotting Modules**: Bokeh-based interactive geospatial plots
|
||||||
|
|
||||||
|
## Key Design Patterns
|
||||||
|
|
||||||
|
### 1. XDGGS Integration
|
||||||
|
|
||||||
|
All geospatial data is indexed using discrete global grid systems (H3 or HEALPix) via the `xdggs` library, enabling:
|
||||||
|
|
||||||
|
- Consistent spatial indexing across data sources
|
||||||
|
- Efficient spatial joins and aggregations
|
||||||
|
- Multi-resolution analysis capabilities
|
||||||
|
|
||||||
|
### 2. Lazy Evaluation & Chunking
|
||||||
|
|
||||||
|
- Xarray/Dask for out-of-core computation
|
||||||
|
- Zarr/Icechunk for chunked storage
|
||||||
|
- Batch processing to manage GPU memory
|
||||||
|
|
||||||
|
### 3. GPU Acceleration
|
||||||
|
|
||||||
|
- CuPy for array operations
|
||||||
|
- cuML for machine learning (RF, KNN)
|
||||||
|
- XGBoost GPU training
|
||||||
|
- PyTorch tensor operations
|
||||||
|
|
||||||
|
### 4. Modular CLI Design
|
||||||
|
|
||||||
|
Each module exposes a Cyclopts-based CLI for standalone execution. Super-scripts utilize them to run parts of the data flow for all grid-level combinations:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scripts/00grids.sh # Grid generation
|
||||||
|
scripts/01darts.sh # Label extraction
|
||||||
|
scripts/02alphaearth.sh # Satellite embeddings
|
||||||
|
scripts/03era5.sh # Climate features
|
||||||
|
scripts/04arcticdem.sh # Terrain features
|
||||||
|
scripts/05train.sh # Model training
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Configuration as Code
|
||||||
|
|
||||||
|
- Dataclasses for typed configuration
|
||||||
|
- TOML files for training hyperparameters
|
||||||
|
- Environment-based path management (`paths.py`)
|
||||||
|
|
||||||
|
## Data Storage Hierarchy
|
||||||
|
|
||||||
|
```sh
|
||||||
|
DATA_DIR/
|
||||||
|
├── grids/ # Hexagonal/HEALPix tessellations (GeoParquet)
|
||||||
|
├── darts/ # RTS labels (GeoParquet)
|
||||||
|
├── era5/ # Climate data (Zarr)
|
||||||
|
├── arcticdem/ # Terrain data (Icechunk Zarr)
|
||||||
|
├── alphaearth/ # Satellite embeddings (Zarr)
|
||||||
|
├── datasets/ # L2 XDGGS datasets (Zarr)
|
||||||
|
├── training-results/ # Models, CV results, predictions (Parquet, NetCDF, Pickle)
|
||||||
|
└── watermask/ # Ocean mask (GeoParquet)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Technology Stack
|
||||||
|
|
||||||
|
**Core**: Python 3.13, NumPy, Pandas, Xarray, GeoPandas
|
||||||
|
**Spatial**: H3, xdggs, xvec, Shapely, Rasterio
|
||||||
|
**ML**: scikit-learn, XGBoost, cuML, entropy (eSPA)
|
||||||
|
**GPU**: CuPy, PyTorch, CUDA
|
||||||
|
**Storage**: Zarr, Icechunk, Parquet, NetCDF
|
||||||
|
**Visualization**: Matplotlib, Seaborn, Bokeh, Streamlit, Cartopy, PyDeck, Altair, Plotly
|
||||||
|
**CLI**: Cyclopts, Rich
|
||||||
|
**External APIs**: Google Earth Engine, Copernicus Climate Data Store
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
- **Memory Management**: Batch processing with configurable chunk sizes
|
||||||
|
- **Parallelization**: Multi-process aggregation, Dask distributed computing
|
||||||
|
- **GPU Utilization**: CuPy/cuML for array operations and ML training
|
||||||
|
- **Storage Optimization**: Blosc compression, Zarr chunking strategies
|
||||||
|
- **Spatial Indexing**: Grid-based partitioning for large-scale operations
|
||||||
|
|
||||||
|
## Extension Points
|
||||||
|
|
||||||
|
The architecture supports extension through:
|
||||||
|
|
||||||
|
- **New Data Sources**: Implement feature extractor following ERA5/ArcticDEM patterns
|
||||||
|
- **Custom Aggregations**: Add methods to `_Aggregations` dataclass
|
||||||
|
- **Alternative Targets**: Implement label extractor following DARTS pattern
|
||||||
|
- **Alternative Models**: Extend training CLI with new scikit-learn compatible estimators
|
||||||
|
- **Dashboard Pages**: Add Streamlit pages to `dashboard/` module
|
||||||
|
- **Grid Systems**: Support additional DGGS via xdggs integration
|
||||||
116
CONTRIBUTING.md
Normal file
116
CONTRIBUTING.md
Normal file
|
|
@ -0,0 +1,116 @@
|
||||||
|
# Contributing to Entropice
|
||||||
|
|
||||||
|
Thank you for your interest in contributing to Entropice! This document provides guidelines for contributing to the project.
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- Python 3.13
|
||||||
|
- CUDA 12 compatible GPU (for full functionality)
|
||||||
|
- [Pixi package manager](https://pixi.sh/)
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pixi install
|
||||||
|
```
|
||||||
|
|
||||||
|
This will set up the complete environment including RAPIDS, PyTorch, and all geospatial dependencies.
|
||||||
|
|
||||||
|
## Development Workflow
|
||||||
|
|
||||||
|
### Code Organization
|
||||||
|
|
||||||
|
- **`src/entropice/`**: Core modules (grids, data sources, training, inference)
|
||||||
|
- **`src/entropice/dashboard/`**: Streamlit visualization dashboard
|
||||||
|
- **`scripts/`**: Data processing pipeline scripts (numbered 00-05)
|
||||||
|
- **`notebooks/`**: Exploratory analysis and validation notebooks
|
||||||
|
- **`tests/`**: Unit tests
|
||||||
|
|
||||||
|
### Key Modules
|
||||||
|
|
||||||
|
- `grids.py`: H3/HEALPix spatial grid systems
|
||||||
|
- `darts.py`, `era5.py`, `arcticdem.py`, `alphaearth.py`: Data source processors
|
||||||
|
- `dataset.py`: Dataset assembly and feature engineering
|
||||||
|
- `training.py`: Model training with eSPA/SPARTAn
|
||||||
|
- `inference.py`: Prediction generation
|
||||||
|
|
||||||
|
## Coding Standards
|
||||||
|
|
||||||
|
### Python Style
|
||||||
|
|
||||||
|
- Follow PEP 8 conventions
|
||||||
|
- Use type hints for function signatures
|
||||||
|
- Prefer numpy-style docstrings for public functions
|
||||||
|
- Keep functions focused and modular
|
||||||
|
|
||||||
|
### Geospatial Best Practices
|
||||||
|
|
||||||
|
- Use **xarray** with XDGGS for gridded data storage
|
||||||
|
- Store intermediate results as **Parquet** (tabular) or **Zarr** (arrays)
|
||||||
|
- Leverage **Dask** for lazy evaluation of large datasets
|
||||||
|
- Use **GeoPandas** for vector operations
|
||||||
|
- Use EPSG:3413 (Arctic Stereographic) coordinate reference system (CRS) for any computation on the data and EPSG:4326 (WGS84) for data visualization and compatability with some libraries
|
||||||
|
|
||||||
|
### Data Pipeline
|
||||||
|
|
||||||
|
- Follow the numbered script sequence: `00grids.sh` → `01darts.sh` → ... → `05train.sh`
|
||||||
|
- Each stage should produce reproducible intermediate outputs
|
||||||
|
- Document data dependencies in module docstrings
|
||||||
|
- Use `paths.py` for consistent path management
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
Run tests for specific modules:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pixi run pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
When adding features, include tests that verify:
|
||||||
|
|
||||||
|
- Correct handling of geospatial coordinates and projections
|
||||||
|
- Proper aggregation to grid cells
|
||||||
|
- Data integrity through pipeline stages
|
||||||
|
|
||||||
|
## Submitting Changes
|
||||||
|
|
||||||
|
### Pull Request Process
|
||||||
|
|
||||||
|
1. **Branch**: Create a feature branch from `main`
|
||||||
|
2. **Commit**: Write clear, descriptive commit messages
|
||||||
|
3. **Test**: Verify your changes don't break existing functionality
|
||||||
|
4. **Document**: Update relevant docstrings and documentation
|
||||||
|
5. **PR**: Submit a pull request with a clear description of changes
|
||||||
|
|
||||||
|
### Commit Messages
|
||||||
|
|
||||||
|
- Use present tense: "Add feature" not "Added feature"
|
||||||
|
- Reference issues when applicable: "Fix #123: Correct grid aggregation"
|
||||||
|
- Keep first line under 72 characters
|
||||||
|
|
||||||
|
## Working with Data
|
||||||
|
|
||||||
|
### Local Development
|
||||||
|
|
||||||
|
- Set `FAST_DATA_DIR` environment variable for data directory (default: `./data`)
|
||||||
|
|
||||||
|
### Notebooks
|
||||||
|
|
||||||
|
- Notebooks in `notebooks/` are for exploration and validation, they are not commited to git
|
||||||
|
- Keep production code in `src/entropice/`
|
||||||
|
|
||||||
|
## Dashboard Development
|
||||||
|
|
||||||
|
Run the dashboard locally:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pixi run dashboard
|
||||||
|
```
|
||||||
|
|
||||||
|
Dashboard code is in `src/entropice/dashboard/` with modular pages and plotting utilities.
|
||||||
|
|
||||||
|
## Questions?
|
||||||
|
|
||||||
|
For questions about the architecture, see `ARCHITECTURE.md`. For scientific background, see `README.md`.
|
||||||
172
README.md
172
README.md
|
|
@ -1,45 +1,145 @@
|
||||||
# eSPA for RTS
|
# Entropice
|
||||||
|
|
||||||
Goal of this project is to utilize the entropy-optimal Scalable Probabilistic Approximations algorithm (eSPA) to create a model which can estimate the density of Retrogressive-Thaw-Slumps (RTS) across the globe with different levels of detail.
|
**Geospatial Machine Learning for Arctic Permafrost Degradation.**
|
||||||
Hoping, that a successful training could gain new knowledge about RTX-proxies.
|
Entropice is a geospatial machine learning system for predicting **Retrogressive Thaw Slump (RTS)** density across the Arctic using **entropy-optimal Scalable Probabilistic Approximations (eSPA)**.
|
||||||
|
The system integrates multi-source geospatial data (climate, terrain, satellite imagery) into discrete global grids and trains probabilistic classifiers to estimate RTS occurrence patterns at multiple spatial resolutions.
|
||||||
|
|
||||||
## Setup
|
## Scientific Background
|
||||||
|
|
||||||
```sh
|
### Retrogressive Thaw Slumps
|
||||||
uv sync
|
|
||||||
|
Retrogressive Thaw Slumps are Arctic landslides caused by permafrost degradation. As ice-rich permafrost thaws, ground collapses create distinctive bowl-shaped features that retreat upslope over time. RTS are:
|
||||||
|
|
||||||
|
- **Climate indicators**: Sensitive to warming temperatures and changing precipitation
|
||||||
|
- **Ecological disruptors**: Release sediment, nutrients, and greenhouse gases into Arctic waterways
|
||||||
|
- **Infrastructure hazards**: Threaten communities and industrial facilities in permafrost regions
|
||||||
|
- **Feedback mechanisms**: Accelerate local warming through albedo changes and carbon release
|
||||||
|
|
||||||
|
Understanding RTS distribution patterns is critical for predicting permafrost stability under climate change.
|
||||||
|
|
||||||
|
### The Challenge
|
||||||
|
|
||||||
|
Current remote sensing approaches try to map a specific landscape feature and then try to extract spatio-temporal statistical information from that dataset.
|
||||||
|
|
||||||
|
Traditional RTS mapping relies on manual digitization from satellite imagery (e.g., the DARTS v2 training-dataset), which is:
|
||||||
|
|
||||||
|
- Labor-intensive and limited in spatial/temporal coverage
|
||||||
|
- Challenging due to cloud cover and seasonal visibility
|
||||||
|
- Insufficient for pan-Arctic prediction at decision-relevant scales
|
||||||
|
|
||||||
|
Modern mapping approaches utilize machine learning to create segmented labels from satellite imagery (e.g. the DARTS dataset), which comes with it own problems:
|
||||||
|
|
||||||
|
- Huge data transfer needed between satellite imagery providers and HPC where the models are run
|
||||||
|
- Large energy consumtion in both data transfer and inference
|
||||||
|
- Uncertainty about the quality of the results
|
||||||
|
- Pot. compute waste when running inference on regions where it is clear that the searched landscape feature does not exist
|
||||||
|
|
||||||
|
### Our Approach
|
||||||
|
|
||||||
|
Instead of global mapping followed by calculation of spatio-temporal statistics, Entropice tries to learn spatio-temporal patterns from a small subset based on a large varyity of data features to get an educated guess about the spatio-temporal statistics of a landscape feature.
|
||||||
|
|
||||||
|
Entropice addresses this by:
|
||||||
|
|
||||||
|
1. **Spatial Discretization across scales**: Representing the Arctic using discrete global grid systems (H3 hexagonal grids, HEALPix) on different low to mid resolutions (levels)
|
||||||
|
2. **Multi-Source Integration**: Aggregating climate (ERA5), terrain (ArcticDEM), and satellite embeddings (AlphaEarth) into feature-rich datasets to obtain environmental proxies across spatio-temporal scales
|
||||||
|
3. **Probabilistic Modeling**: Training eSPA classifiers to predict RTS density classes based on environmental proxies
|
||||||
|
|
||||||
|
This hopefully leads to the following advances in permafrost research:
|
||||||
|
|
||||||
|
- Better understanding of RTS occurance
|
||||||
|
- Potential proxy for Ice-Rich permafrost
|
||||||
|
- Reduction of compute waste of image segmentation pipelines
|
||||||
|
- Better modelling by providing better starting conditions
|
||||||
|
|
||||||
|
### Entropy-Optimal Scalable Probabilistic Approximations (eSPA)
|
||||||
|
|
||||||
|
eSPA is a probabilistic classification framework that:
|
||||||
|
|
||||||
|
- Provides calibrated probability estimates (not just point predictions)
|
||||||
|
- Handles imbalanced datasets common in geospatial phenomena
|
||||||
|
- Captures uncertainty in predictions across poorly-sampled regions
|
||||||
|
- Enables interpretable feature importance analysis
|
||||||
|
|
||||||
|
This approach aims to discover which environmental variables best predict RTS occurrence, potentially revealing new proxies for permafrost vulnerability.
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
- **Modular Data Pipeline**: Sequential processing stages from raw data to trained models
|
||||||
|
- **Multiple Grid Systems**: H3 (resolutions 3-6) and HEALPix (resolutions 6-10)
|
||||||
|
- **GPU-Accelerated**: RAPIDS (CuPy, cuML) and PyTorch for large-scale computation
|
||||||
|
- **Interactive Dashboard**: Streamlit-based visualization of training data, results, and predictions
|
||||||
|
- **Reproducible Workflows**: Configuration-as-code with TOML files and CLI tools
|
||||||
|
- **Extensible Architecture**: Support for alternative models (XGBoost, Random Forest, KNN) and data sources
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Installation
|
||||||
|
|
||||||
|
Requires Python 3.13 and CUDA 12 compatible GPU.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pixi install
|
||||||
```
|
```
|
||||||
|
|
||||||
## Project Plan
|
This sets up the complete environment including RAPIDS, PyTorch, and geospatial libraries.
|
||||||
|
|
||||||
1. Create global hexagon grids with h3
|
### Running the Pipeline
|
||||||
2. Enrich the grids with data from various sources and with labels from DARTS v2
|
|
||||||
3. Use eSPA for simple classification: hex has [many slumps / some slumps / few slumps / no slumps]
|
|
||||||
4. use SPARTAn for regression: one for slumps density (area) and one for total number of slumps
|
|
||||||
|
|
||||||
### Data Sources and Engineering
|
Execute the numbered scripts to process data and train models:
|
||||||
|
|
||||||
- Labels
|
```bash
|
||||||
- `"year"`: Year of observation
|
scripts/00grids.sh # Generate spatial grids
|
||||||
- `"area"`: Total land-area of the hexagon
|
scripts/01darts.sh # Extract RTS labels from DARTS v2
|
||||||
- `"rts_density"`: Area of RTS divided by total land-area
|
scripts/02alphaearth.sh # Extract satellite embeddings
|
||||||
- `"rts_count"`: Number of single RTS instances
|
scripts/03era5.sh # Process climate data
|
||||||
- ERA5 (starting 40 years from `"year"`)
|
scripts/04arcticdem.sh # Compute terrain features
|
||||||
- `"temp_yearXXXX_qY"`: Y-th quantile temperature of year XXXX. Used to enter the temperature distribution into the model.
|
scripts/05train.sh # Train models
|
||||||
- `"thawing_days_yearXXXX"`: Number of thawing-days of year XXXX.
|
```
|
||||||
- `"precip_yearXXXX_qY"`: Y-th quantile precipitation of year XXXX. Similar to temperature.
|
|
||||||
- `"temp_5year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 5 years difference.
|
|
||||||
- `"temp_10year_diff_XXXXtoXXXX_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
|
|
||||||
- `"temp_diff_qY"`: Difference of the Y-th quantile temperature between year XXXX and XXXX. Always 10 years difference.
|
|
||||||
- ArcticDEM
|
|
||||||
- `"dissection_index"`: Dissection Index, (max - min) / max
|
|
||||||
- `"max_elevation"`: Maximum elevation
|
|
||||||
- `"elevationX_density"`: Area where the elevation is larger than X divided by the total land-area
|
|
||||||
- TCVIS
|
|
||||||
- ???
|
|
||||||
- Wildfire???
|
|
||||||
- Permafrost???
|
|
||||||
- GroundIceContent???
|
|
||||||
- Biome
|
|
||||||
|
|
||||||
**About temporals** Every label has its own year - all temporal dependent data features, e.g. `"temp_5year_diff_XXXXtoXXXX_qY"` are calculated respective to that year.
|
### Visualizing Results
|
||||||
The number of years added from a dataset is always the same, e.g. for ERA5 for an observation in 2024 the ERA5 data would start in 1984 and for an observation from 2023 in 1983.
|
|
||||||
|
Launch the interactive dashboard:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pixi run dashboard
|
||||||
|
```
|
||||||
|
|
||||||
|
Explore training data distributions, cross-validation results, feature importance, and spatial predictions.
|
||||||
|
|
||||||
|
## Data Sources
|
||||||
|
|
||||||
|
- **DARTS v2**: RTS labels (polygons with year, area, count)
|
||||||
|
- **ERA5**: Climate reanalysis (40-year history, Arctic-aligned years)
|
||||||
|
- **ArcticDEM**: 32m resolution terrain elevation
|
||||||
|
- **AlphaEarth**: 64-dimensional satellite image embeddings
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
- `src/entropice/`: Core modules (grids, data processors, training, inference)
|
||||||
|
- `src/entropice/dashboard/`: Streamlit visualization application
|
||||||
|
- `scripts/`: Data processing pipeline automation
|
||||||
|
- `notebooks/`: Exploratory analysis (not version-controlled)
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
- **[ARCHITECTURE.md](ARCHITECTURE.md)**: System design, components, and data flow
|
||||||
|
- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Development guidelines and standards
|
||||||
|
|
||||||
|
## Research Goals
|
||||||
|
|
||||||
|
1. **Predictive Modeling**: Estimate RTS density at unobserved locations
|
||||||
|
2. **Proxy Discovery**: Identify environmental variables most predictive of RTS occurrence
|
||||||
|
3. **Multi-Scale Analysis**: Compare model performance across spatial resolutions
|
||||||
|
4. **Uncertainty Quantification**: Provide calibrated probabilities for decision-making
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
TODO
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
If you use Entropice in your research, please cite:
|
||||||
|
|
||||||
|
```txt
|
||||||
|
TODO
|
||||||
|
```
|
||||||
|
|
|
||||||
|
|
@ -284,7 +284,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
|
||||||
fig.update_traces(text=pivot_df.values, texttemplate="%{text:,}", textfont_size=10)
|
fig.update_traces(text=pivot_df.values, texttemplate="%{text:,}", textfont_size=10)
|
||||||
|
|
||||||
fig.update_layout(height=400)
|
fig.update_layout(height=400)
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
with tab2:
|
with tab2:
|
||||||
st.markdown("### Sample Counts Bar Chart")
|
st.markdown("### Sample Counts Bar Chart")
|
||||||
|
|
@ -311,7 +311,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
|
||||||
# Update facet labels to be cleaner
|
# Update facet labels to be cleaner
|
||||||
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
|
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
|
||||||
fig.update_xaxes(tickangle=-45)
|
fig.update_xaxes(tickangle=-45)
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
with tab3:
|
with tab3:
|
||||||
st.markdown("### Detailed Sample Counts")
|
st.markdown("### Detailed Sample Counts")
|
||||||
|
|
@ -325,7 +325,7 @@ def render_sample_count_overview(cache: DatasetAnalysisCache):
|
||||||
for col in ["Samples (Coverage)", "Samples (Labels)", "Samples (Both)"]:
|
for col in ["Samples (Coverage)", "Samples (Labels)", "Samples (Both)"]:
|
||||||
display_df[col] = display_df[col].apply(lambda x: f"{x:,}")
|
display_df[col] = display_df[col].apply(lambda x: f"{x:,}")
|
||||||
|
|
||||||
st.dataframe(display_df, hide_index=True, use_container_width=True)
|
st.dataframe(display_df, hide_index=True, width="stretch")
|
||||||
|
|
||||||
|
|
||||||
def render_feature_count_comparison(cache: DatasetAnalysisCache):
|
def render_feature_count_comparison(cache: DatasetAnalysisCache):
|
||||||
|
|
@ -363,7 +363,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
|
||||||
)
|
)
|
||||||
|
|
||||||
fig.update_layout(height=500, xaxis_tickangle=-45)
|
fig.update_layout(height=500, xaxis_tickangle=-45)
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
# Add secondary metrics
|
# Add secondary metrics
|
||||||
col1, col2 = st.columns(2)
|
col1, col2 = st.columns(2)
|
||||||
|
|
@ -384,7 +384,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
|
||||||
)
|
)
|
||||||
fig_cells.update_traces(texttemplate="%{text:,}", textposition="outside")
|
fig_cells.update_traces(texttemplate="%{text:,}", textposition="outside")
|
||||||
fig_cells.update_layout(xaxis_tickangle=-45, showlegend=False)
|
fig_cells.update_layout(xaxis_tickangle=-45, showlegend=False)
|
||||||
st.plotly_chart(fig_cells, use_container_width=True)
|
st.plotly_chart(fig_cells, width="stretch")
|
||||||
|
|
||||||
with col2:
|
with col2:
|
||||||
fig_samples = px.bar(
|
fig_samples = px.bar(
|
||||||
|
|
@ -399,7 +399,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
|
||||||
)
|
)
|
||||||
fig_samples.update_traces(texttemplate="%{text:,}", textposition="outside")
|
fig_samples.update_traces(texttemplate="%{text:,}", textposition="outside")
|
||||||
fig_samples.update_layout(xaxis_tickangle=-45, showlegend=False)
|
fig_samples.update_layout(xaxis_tickangle=-45, showlegend=False)
|
||||||
st.plotly_chart(fig_samples, use_container_width=True)
|
st.plotly_chart(fig_samples, width="stretch")
|
||||||
|
|
||||||
with comp_tab2:
|
with comp_tab2:
|
||||||
st.markdown("#### Feature Breakdown by Data Source")
|
st.markdown("#### Feature Breakdown by Data Source")
|
||||||
|
|
@ -435,7 +435,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
|
||||||
)
|
)
|
||||||
fig.update_traces(textposition="inside", textinfo="percent")
|
fig.update_traces(textposition="inside", textinfo="percent")
|
||||||
fig.update_layout(showlegend=True, height=350)
|
fig.update_layout(showlegend=True, height=350)
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
with comp_tab3:
|
with comp_tab3:
|
||||||
st.markdown("#### Detailed Feature Count Comparison")
|
st.markdown("#### Detailed Feature Count Comparison")
|
||||||
|
|
@ -452,7 +452,7 @@ def render_feature_count_comparison(cache: DatasetAnalysisCache):
|
||||||
# Format boolean as Yes/No
|
# Format boolean as Yes/No
|
||||||
display_df["AlphaEarth"] = display_df["AlphaEarth"].apply(lambda x: "✓" if x else "✗")
|
display_df["AlphaEarth"] = display_df["AlphaEarth"].apply(lambda x: "✓" if x else "✗")
|
||||||
|
|
||||||
st.dataframe(display_df, hide_index=True, use_container_width=True)
|
st.dataframe(display_df, hide_index=True, width="stretch")
|
||||||
|
|
||||||
|
|
||||||
@st.fragment
|
@st.fragment
|
||||||
|
|
@ -581,10 +581,10 @@ def render_feature_count_explorer(cache: DatasetAnalysisCache):
|
||||||
color_discrete_sequence=source_colors,
|
color_discrete_sequence=source_colors,
|
||||||
)
|
)
|
||||||
fig.update_traces(textposition="inside", textinfo="percent+label")
|
fig.update_traces(textposition="inside", textinfo="percent+label")
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
# Show detailed table
|
# Show detailed table
|
||||||
st.dataframe(breakdown_df, hide_index=True, use_container_width=True)
|
st.dataframe(breakdown_df, hide_index=True, width="stretch")
|
||||||
|
|
||||||
# Detailed member information
|
# Detailed member information
|
||||||
with st.expander("📦 Detailed Source Information", expanded=False):
|
with st.expander("📦 Detailed Source Information", expanded=False):
|
||||||
|
|
|
||||||
|
|
@ -108,7 +108,7 @@ def render_class_distribution_histogram(predictions_gdf: gpd.GeoDataFrame, task:
|
||||||
xaxis={"tickangle": -45 if len(categories) > 3 else 0},
|
xaxis={"tickangle": -45 if len(categories) > 3 else 0},
|
||||||
)
|
)
|
||||||
|
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
# Show percentages in a table
|
# Show percentages in a table
|
||||||
with st.expander("📋 Detailed Class Distribution", expanded=False):
|
with st.expander("📋 Detailed Class Distribution", expanded=False):
|
||||||
|
|
@ -119,7 +119,7 @@ def render_class_distribution_histogram(predictions_gdf: gpd.GeoDataFrame, task:
|
||||||
"Percentage": (class_counts.to_numpy() / len(predictions_gdf) * 100).round(2),
|
"Percentage": (class_counts.to_numpy() / len(predictions_gdf) * 100).round(2),
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
st.dataframe(distribution_df, hide_index=True, use_container_width=True)
|
st.dataframe(distribution_df, hide_index=True, width="stretch")
|
||||||
|
|
||||||
|
|
||||||
def render_spatial_distribution_stats(predictions_gdf: gpd.GeoDataFrame):
|
def render_spatial_distribution_stats(predictions_gdf: gpd.GeoDataFrame):
|
||||||
|
|
@ -394,7 +394,7 @@ def render_class_comparison(predictions_gdf: gpd.GeoDataFrame, task: str):
|
||||||
showlegend=True,
|
showlegend=True,
|
||||||
)
|
)
|
||||||
|
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
||||||
with col2:
|
with col2:
|
||||||
st.markdown("**Cumulative Distribution")
|
st.markdown("**Cumulative Distribution")
|
||||||
|
|
@ -426,4 +426,4 @@ def render_class_comparison(predictions_gdf: gpd.GeoDataFrame, task: str):
|
||||||
yaxis={"range": [0, 105]},
|
yaxis={"range": [0, 105]},
|
||||||
)
|
)
|
||||||
|
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, width="stretch")
|
||||||
|
|
|
||||||
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue