# Module Explanation After fitting IsoGraph on a dataset, the `isograph.explain` module identifies the transcript-level and gene-local features that best explain each discovered module. The key biological object is a within-gene isoform switch: a module driver should be a directional transcript-usage contrast within a gene, not just a transcript with high marginal correlation. ## Quick Start ```bash isograph explain-module \ --artifact-dir artifacts/fits/my_dataset \ --feature-table features.parquet \ --feature-meta feature_metadata.parquet \ --module-ids M000 M001 \ --plot \ --output-dir artifacts/explain/run1 ``` ## Python API ```python from isograph.explain import explain_module, ExplainConfig results = explain_module( artifact_dir="artifacts/fits/my_dataset", feature_table=feature_df, # samples × features DataFrame feature_meta=meta_df, # feature_id, gene_id, feature_type columns module_ids=["M000", "M001"], config=ExplainConfig(plot=True), output_dir="artifacts/explain/run1", ) # results["M000"].gene_driver_table — top gene drivers sorted by |r| # results["M000"].transcript_polarity_table — per-transcript correlations # results["M000"].high_vs_low_table — high- vs. low-module contrast ``` `explain_module` returns a `dict[str, ExplainResult]`. ## Inputs | Input | Required | Description | |---|---|---| | `artifact_dir` | yes | Must contain `modules.parquet` and `feature_scores.parquet` | | `feature_table` | yes | Samples × features DataFrame (index or `sample_id` column) | | `feature_meta` | yes | Columns: `feature_id`, `gene_id`, `feature_type` (required); `gene_name`, `transcript_id`, `exon_id`, `event_id`, `source_coordinate` (optional) | | `module_ids` | no | Subset of module IDs to explain (default: all modules) | | `module_score_table` | no | Precomputed eigengene scores (overrides computed eigengenes) | | `annotation_table` | no | Structural labels from `isograph annotate-structure` | ## Outputs For each module, `explain-module` writes a subdirectory `{output_dir}/{module_id}/`: ### `gene_driver_table.parquet` One row per gene. Columns: `gene_id`, `r` (Pearson correlation with module eigengene), `pvalue`, `qvalue` (BH-corrected), `n_samples`, `missing_fraction`. Sorted by `|r|`. ### `transcript_polarity_table.parquet` One row per feature. Columns: `feature_id`, `gene_id`, `r`, `pvalue`, `qvalue`, `n_samples`, `missing_fraction`, `switch_strength`. `switch_strength = max(r) − min(r)` across transcripts within a gene — high values identify genes with directional isoform switching. ### `high_vs_low_table.parquet` One row per feature. Contrasts mean feature usage in the top vs. bottom percentile of samples ranked by module eigengene. Columns: `feature_id`, `gene_id`, `mean_high`, `mean_low`, `delta`, `se`, `tstat`, `pvalue`, `n_high`, `n_low`, `missing_fraction`. ### `module_explanation_manifest.json` Shared manifest at `{output_dir}/module_explanation_manifest.json` recording config, module IDs, file paths, and optional feature availability. ## Publication-Ready Plots Pass `--plot` (or `ExplainConfig(plot=True)`) to write per-module figures: - **Top driver genes barplot** — |r| with 95% CI, colored by polarity direction - **Transcript usage gradient** — x = module eigengene score, y = transcript usage, smoothed trend per transcript for the top N driver genes - **Positive/negative driver heatmap** — heatmap of normalized usage for top positive and negative driver genes Use `--output-format pdf` or `--output-format png pdf` to control format. ## Structural Annotation `isograph annotate-structure` assigns GTF-based structural labels to transcript switch pairs (exon changes, CDS/UTR shifts, biotype switches). Pass the output to `--annotation-table` to merge those labels into the driver tables. ```bash # Step 1 — annotate switch pairs isograph annotate-structure \ --gtf gencode.v47.annotation.gtf.gz \ --switch-pairs switch_pairs.tsv \ --gtf-cache gencode_v47_cache.parquet \ --output transcript_structure_annotations.tsv # Step 2 — explain with structural context isograph explain-module \ --artifact-dir artifacts/fits/my_dataset \ --feature-table features.parquet \ --feature-meta feature_metadata.parquet \ --annotation-table transcript_structure_annotations.tsv \ --output-dir artifacts/explain/run1 ``` The GTF cache (written on the first run) avoids re-parsing the full GTF on subsequent runs. Parsing GENCODE v47 from scratch takes ~20 minutes on NFS; the cache loads in seconds. ## VAE Decoder Attribution When the fit artifact directory contains a VAE checkpoint (`vae_checkpoint.pt`), pass `--vae-attribution` to compute a finite-difference Jacobian attribution: ```bash isograph explain-module \ --artifact-dir artifacts/fits/my_dataset \ --feature-table features.parquet \ --feature-meta feature_metadata.parquet \ --vae-attribution \ --vae-fdr-threshold 0.05 \ --vae-percentile-threshold 90.0 \ --output-dir artifacts/explain/run1 ``` This perturbs the module-associated latent dimension and decodes back into gene space via the VAE decoder. Genes passing three simultaneous criteria are classified as high-confidence drivers: 1. Association FDR ≤ `--vae-fdr-threshold` (from `gene_driver_table.qvalue`) 2. `|decoded_delta|` in the top `--vae-percentile-threshold` percentile among module genes 3. `sign(decoded_delta)` agrees with `sign(r)` (direction-corrected for anti-correlated latent dims) Output: `{module_id}/vae_drivers.parquet`. ## Captum Integrated Gradients Requires optional install: `pip install isograph[torch-explain]`. Captum Integrated Gradients (IG) attributes module eigengene prediction to individual transcript features via the VAE encoder. Unlike decoder Jacobian attribution, IG captures the full encoder nonlinearity and is complementary to the association-based approaches. ```bash isograph explain-module \ --artifact-dir artifacts/fits/my_dataset \ --feature-table features.parquet \ --feature-meta feature_metadata.parquet \ --integrated-gradients \ --ig-n-steps 100 \ --ig-baseline zero \ --output-dir artifacts/explain/run1 ``` Output: `{module_id}/ig_attributions.parquet` — `feature_id`, `ig_score`, sorted by `|ig_score|`. IG baseline options: - `zero` (default) — attributes relative to no isoform switching - `mean` — attributes relative to the per-sample gene mean usage