Module Explanation
After fitting IsoGraph on a dataset, the isograph.explain module identifies the
transcript-level and gene-local features that best explain each discovered module. The
key biological object is a within-gene isoform switch: a module driver should be a
directional transcript-usage contrast within a gene, not just a transcript with high
marginal correlation.
Quick Start
isograph explain-module \
--artifact-dir artifacts/fits/my_dataset \
--feature-table features.parquet \
--feature-meta feature_metadata.parquet \
--module-ids M000 M001 \
--plot \
--output-dir artifacts/explain/run1
Python API
from isograph.explain import explain_module, ExplainConfig
results = explain_module(
artifact_dir="artifacts/fits/my_dataset",
feature_table=feature_df, # samples × features DataFrame
feature_meta=meta_df, # feature_id, gene_id, feature_type columns
module_ids=["M000", "M001"],
config=ExplainConfig(plot=True),
output_dir="artifacts/explain/run1",
)
# results["M000"].gene_driver_table — top gene drivers sorted by |r|
# results["M000"].transcript_polarity_table — per-transcript correlations
# results["M000"].high_vs_low_table — high- vs. low-module contrast
explain_module returns a dict[str, ExplainResult].
Inputs
Input |
Required |
Description |
|---|---|---|
|
yes |
Must contain |
|
yes |
Samples × features DataFrame (index or |
|
yes |
Columns: |
|
no |
Subset of module IDs to explain (default: all modules) |
|
no |
Precomputed eigengene scores (overrides computed eigengenes) |
|
no |
Structural labels from |
Outputs
For each module, explain-module writes a subdirectory {output_dir}/{module_id}/:
gene_driver_table.parquet
One row per gene. Columns: gene_id, r (Pearson correlation with module eigengene),
pvalue, qvalue (BH-corrected), n_samples, missing_fraction. Sorted by |r|.
transcript_polarity_table.parquet
One row per feature. Columns: feature_id, gene_id, r, pvalue, qvalue,
n_samples, missing_fraction, switch_strength. switch_strength = max(r) − min(r)
across transcripts within a gene — high values identify genes with directional isoform
switching.
high_vs_low_table.parquet
One row per feature. Contrasts mean feature usage in the top vs. bottom percentile of
samples ranked by module eigengene. Columns: feature_id, gene_id, mean_high,
mean_low, delta, se, tstat, pvalue, n_high, n_low, missing_fraction.
module_explanation_manifest.json
Shared manifest at {output_dir}/module_explanation_manifest.json recording config,
module IDs, file paths, and optional feature availability.
Publication-Ready Plots
Pass --plot (or ExplainConfig(plot=True)) to write per-module figures:
Top driver genes barplot — |r| with 95% CI, colored by polarity direction
Transcript usage gradient — x = module eigengene score, y = transcript usage, smoothed trend per transcript for the top N driver genes
Positive/negative driver heatmap — heatmap of normalized usage for top positive and negative driver genes
Use --output-format pdf or --output-format png pdf to control format.
Structural Annotation
isograph annotate-structure assigns GTF-based structural labels to transcript switch
pairs (exon changes, CDS/UTR shifts, biotype switches). Pass the output to
--annotation-table to merge those labels into the driver tables.
# Step 1 — annotate switch pairs
isograph annotate-structure \
--gtf gencode.v47.annotation.gtf.gz \
--switch-pairs switch_pairs.tsv \
--gtf-cache gencode_v47_cache.parquet \
--output transcript_structure_annotations.tsv
# Step 2 — explain with structural context
isograph explain-module \
--artifact-dir artifacts/fits/my_dataset \
--feature-table features.parquet \
--feature-meta feature_metadata.parquet \
--annotation-table transcript_structure_annotations.tsv \
--output-dir artifacts/explain/run1
The GTF cache (written on the first run) avoids re-parsing the full GTF on subsequent runs. Parsing GENCODE v47 from scratch takes ~20 minutes on NFS; the cache loads in seconds.
VAE Decoder Attribution
When the fit artifact directory contains a VAE checkpoint (vae_checkpoint.pt), pass
--vae-attribution to compute a finite-difference Jacobian attribution:
isograph explain-module \
--artifact-dir artifacts/fits/my_dataset \
--feature-table features.parquet \
--feature-meta feature_metadata.parquet \
--vae-attribution \
--vae-fdr-threshold 0.05 \
--vae-percentile-threshold 90.0 \
--output-dir artifacts/explain/run1
This perturbs the module-associated latent dimension and decodes back into gene space via the VAE decoder. Genes passing three simultaneous criteria are classified as high-confidence drivers:
Association FDR ≤
--vae-fdr-threshold(fromgene_driver_table.qvalue)|decoded_delta|in the top--vae-percentile-thresholdpercentile among module genessign(decoded_delta)agrees withsign(r)(direction-corrected for anti-correlated latent dims)
Output: {module_id}/vae_drivers.parquet.
Captum Integrated Gradients
Requires optional install: pip install isograph[torch-explain].
Captum Integrated Gradients (IG) attributes module eigengene prediction to individual transcript features via the VAE encoder. Unlike decoder Jacobian attribution, IG captures the full encoder nonlinearity and is complementary to the association-based approaches.
isograph explain-module \
--artifact-dir artifacts/fits/my_dataset \
--feature-table features.parquet \
--feature-meta feature_metadata.parquet \
--integrated-gradients \
--ig-n-steps 100 \
--ig-baseline zero \
--output-dir artifacts/explain/run1
Output: {module_id}/ig_attributions.parquet — feature_id, ig_score, sorted by |ig_score|.
IG baseline options:
zero(default) — attributes relative to no isoform switchingmean— attributes relative to the per-sample gene mean usage