Annotation Sources

Data sources available for variant annotation.

Core

  • GENCODE: Gene annotations from GENCODE — transcripts, exons, CDS coordinates

Annotation Sources (opt-in)

SourceMatch LevelAssemblyData SizeDescription
OncoKBGene symbolAny~50 KBCancer gene classification (ONCOGENE/TSG) from OncoKB
AlphaMissenseGenomic (chr:pos:ref:alt)GRCh38~643 MBMissense pathogenicity scores from AlphaMissense (Cheng et al., Science 2023). CC BY 4.0
ClinVarGenomic (chr:pos:ref:alt)GRCh38~182 MBClinical significance from ClinVar (4.1M variants)
Cancer HotspotsProtein position (transcript + AA pos)Any~200 KBRecurrent mutation hotspots from cancerhotspots.org
SIGNALGenomic (chr:pos:ref:alt)GRCh37 only~32 MBGermline mutation frequencies from SIGNAL
SIFTProtein (peptide MD5)Any~4.1 GB (shared)SIFT missense prediction scores via Ensembl variation database
PolyPhen-2Protein (peptide MD5)Any~4.1 GB (shared)PolyPhen-2 HDIV missense prediction scores via Ensembl variation database
dbSNPGenomic (chr:pos:ref:alt)GRCh38~17 GBRS identifiers from dbSNP

Match Levels

Match level determines how each source links to variants:

  • Genomic: matches on exact chr:pos:ref:alt — assembly-specific (GRCh37 vs GRCh38)
  • Protein position: matches on transcript + amino acid position — transcript-version sensitive. Hotspot positions are only annotated when the annotation’s transcript matches the hotspot’s transcript.
  • Protein (peptide MD5): matches on MD5 hash of the protein sequence + amino acid position + alternate residue. Assembly-independent since it operates on protein sequences, not genomic coordinates.
  • Gene symbol: matches on gene name only — assembly-independent

Storage

Genomic index

Genomic annotation sources (AlphaMissense, ClinVar, SIGNAL, gnomAD, dbSNP) are merged into a single SQLite database (genomic_annotations.sqlite) with a WITHOUT ROWID clustered primary key on (chrom, pos, ref, alt). This gives ~1-5us point lookups with near-zero Go heap via mmap — one DB lookup per variant instead of many.

All variants are stored with normalized coordinates:

  • Chromosome: without “chr” prefix (e.g. “12”, not “chr12”)
  • Position and alleles: canonical MAF-style (no anchor base for indels). VCF-style indels are normalized during build and lookup, so both formats match correctly.

SIFT/PolyPhen-2 predictions

SIFT and PolyPhen-2 predictions are stored in a separate SQLite database (ensembl_sift_polyphen.sqlite), built from Ensembl’s variation database MySQL dumps. The data contains pre-computed prediction matrices for every possible amino acid substitution in every Ensembl protein.

Data source: Ensembl runs SIFT 6.2.1 and PolyPhen-2 2.2.3 on their proteome and publishes the results as protein_function_predictions.txt.gz in the MySQL dump for each release. This is the same data that Ensembl VEP uses — no registration or external tools required.

How it works:

  1. vibe-vep download fetches two files from the Ensembl FTP (~4.1 GB total):
    • translation_md5.txt.gz — maps translation IDs to peptide sequence MD5 hashes
    • protein_function_predictions.txt.gz — gzip-compressed binary prediction matrices
  2. vibe-vep prepare (or first annotation run) builds a local SQLite keyed by (md5, analysis)
  3. During annotation, for each missense variant, the CDS sequence is translated to protein, MD5-hashed, and used to look up the prediction matrix. The matrix is indexed by amino acid position and alternate residue to retrieve the SIFT/PolyPhen score and qualitative prediction.

Binary matrix format: Each prediction is a 16-bit little-endian value: top 2 bits encode the qualitative prediction, bottom 10 bits encode the score (0-1000, divided by 1000). A value of 0xFFFF indicates the reference amino acid (no prediction). Matrices have a 3-byte header followed by 20 entries per amino acid position (one per possible substitution in alphabetical order: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y).

Version matching: vibe-vep downloads predictions from the Ensembl release that matches the GENCODE gene models in use (GENCODE v45 = Ensembl 111, v19 = Ensembl 75). This mapping is maintained in gencodeEnsemblMap in download.go.

Coverage: 92.8% of GENCODE v45 protein-coding transcripts have matching predictions in Ensembl 111 (99.0% of canonical transcripts). The remaining ~7% are proteins where Ensembl did not compute SIFT/PolyPhen predictions (e.g. too short, non-standard). Variants in unmatched proteins simply receive no SIFT/PolyPhen annotation — the same behavior as Ensembl VEP itself.

Output columns:

ColumnRangeDescription
sift.score0-1SIFT score (lower = more damaging)
sift.predictiontolerated, deleterious, tolerated_low_confidence, deleterious_low_confidenceQualitative SIFT prediction
polyphen.score0-1PolyPhen-2 HDIV score (higher = more damaging)
polyphen.predictionprobably_damaging, possibly_damaging, benign, unknownQualitative PolyPhen-2 prediction

Variant cache

The DuckDB variant cache (variant_cache.duckdb) is separate from both — used for --save-results / --from-cache / export parquet post-analysis.

Enabling Sources

# AlphaMissense (GRCh38): download + prepare + enable
vibe-vep config set annotations.alphamissense true
vibe-vep download  # fetches ~643 MB
vibe-vep prepare   # builds SQLite index

# ClinVar (GRCh38): download + enable
vibe-vep config set annotations.clinvar true
vibe-vep download  # fetches ~182 MB clinvar.vcf.gz

# Hotspots: point to TSV file
vibe-vep config set annotations.hotspots /path/to/hotspots_v2_and_3d.txt

# SIGNAL (GRCh37 only): enable
vibe-vep config set annotations.signal true

# SIFT + PolyPhen-2 (via Ensembl): download + enable
vibe-vep config set annotations.sift true
vibe-vep config set annotations.polyphen true
vibe-vep download  # fetches ~4.1 GB from Ensembl FTP

# dbSNP RS IDs: download + enable
vibe-vep config set annotations.dbsnp true
vibe-vep download  # fetches ~17 GB dbsnp.vcf.gz

Use vibe-vep version to see which sources are loaded and vibe-vep version --maf-columns for the full column mapping.