Annotation Sources

Data sources available for variant annotation.

Core

GENCODE: Gene annotations from GENCODE — transcripts, exons, CDS coordinates

Annotation Sources (opt-in)

Source	Match Level	Assembly	Data Size	Description
OncoKB	Gene symbol	Any	~50 KB	Cancer gene classification (ONCOGENE/TSG) from OncoKB
AlphaMissense	Genomic (chr:pos:ref:alt)	GRCh38	~643 MB	Missense pathogenicity scores from AlphaMissense (Cheng et al., Science 2023). CC BY 4.0
ClinVar	Genomic (chr:pos:ref:alt)	GRCh38	~182 MB	Clinical significance from ClinVar (4.1M variants)
Cancer Hotspots	Protein position (transcript + AA pos)	Any	~200 KB	Recurrent mutation hotspots from cancerhotspots.org
SIGNAL	Genomic (chr:pos:ref:alt)	GRCh37 only	~32 MB	Germline mutation frequencies from SIGNAL
SIFT	Protein (peptide MD5)	Any	~4.1 GB (shared)	SIFT missense prediction scores via Ensembl variation database
PolyPhen-2	Protein (peptide MD5)	Any	~4.1 GB (shared)	PolyPhen-2 HDIV missense prediction scores via Ensembl variation database
dbSNP	Genomic (chr:pos:ref:alt)	GRCh38	~17 GB	RS identifiers from dbSNP

Match Levels

Match level determines how each source links to variants:

Genomic: matches on exact chr:pos:ref:alt — assembly-specific (GRCh37 vs GRCh38)
Protein position: matches on transcript + amino acid position — transcript-version sensitive. Hotspot positions are only annotated when the annotation’s transcript matches the hotspot’s transcript.
Protein (peptide MD5): matches on MD5 hash of the protein sequence + amino acid position + alternate residue. Assembly-independent since it operates on protein sequences, not genomic coordinates.
Gene symbol: matches on gene name only — assembly-independent

Storage

Genomic index

Genomic annotation sources (AlphaMissense, ClinVar, SIGNAL, gnomAD, dbSNP) are merged into a single SQLite database (genomic_annotations.sqlite) with a WITHOUT ROWID clustered primary key on (chrom, pos, ref, alt). This gives ~1-5us point lookups with near-zero Go heap via mmap — one DB lookup per variant instead of many.

All variants are stored with normalized coordinates:

Chromosome: without “chr” prefix (e.g. “12”, not “chr12”)
Position and alleles: canonical MAF-style (no anchor base for indels). VCF-style indels are normalized during build and lookup, so both formats match correctly.

SIFT/PolyPhen-2 predictions

SIFT and PolyPhen-2 predictions are stored in a separate SQLite database (ensembl_sift_polyphen.sqlite), built from Ensembl’s variation database MySQL dumps. The data contains pre-computed prediction matrices for every possible amino acid substitution in every Ensembl protein.

Data source: Ensembl runs SIFT 6.2.1 and PolyPhen-2 2.2.3 on their proteome and publishes the results as protein_function_predictions.txt.gz in the MySQL dump for each release. This is the same data that Ensembl VEP uses — no registration or external tools required.

How it works:

vibe-vep download fetches two files from the Ensembl FTP (~4.1 GB total):
- translation_md5.txt.gz — maps translation IDs to peptide sequence MD5 hashes
- protein_function_predictions.txt.gz — gzip-compressed binary prediction matrices
vibe-vep prepare (or first annotation run) builds a local SQLite keyed by (md5, analysis)
During annotation, for each missense variant, the CDS sequence is translated to protein, MD5-hashed, and used to look up the prediction matrix. The matrix is indexed by amino acid position and alternate residue to retrieve the SIFT/PolyPhen score and qualitative prediction.

Binary matrix format: Each prediction is a 16-bit little-endian value: top 2 bits encode the qualitative prediction, bottom 10 bits encode the score (0-1000, divided by 1000). A value of 0xFFFF indicates the reference amino acid (no prediction). Matrices have a 3-byte header followed by 20 entries per amino acid position (one per possible substitution in alphabetical order: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y).

Version matching: vibe-vep downloads predictions from the Ensembl release that matches the GENCODE gene models in use (GENCODE v45 = Ensembl 111, v19 = Ensembl 75). This mapping is maintained in gencodeEnsemblMap in download.go.

Coverage: 92.8% of GENCODE v45 protein-coding transcripts have matching predictions in Ensembl 111 (99.0% of canonical transcripts). The remaining ~7% are proteins where Ensembl did not compute SIFT/PolyPhen predictions (e.g. too short, non-standard). Variants in unmatched proteins simply receive no SIFT/PolyPhen annotation — the same behavior as Ensembl VEP itself.

Output columns:

Column	Range	Description
`sift.score`	0-1	SIFT score (lower = more damaging)
`sift.prediction`	tolerated, deleterious, tolerated_low_confidence, deleterious_low_confidence	Qualitative SIFT prediction
`polyphen.score`	0-1	PolyPhen-2 HDIV score (higher = more damaging)
`polyphen.prediction`	probably_damaging, possibly_damaging, benign, unknown	Qualitative PolyPhen-2 prediction

Variant cache

The DuckDB variant cache (variant_cache.duckdb) is separate from both — used for --save-results / --from-cache / export parquet post-analysis.

Enabling Sources

# AlphaMissense (GRCh38): download + prepare + enable
vibe-vep config set annotations.alphamissense true
vibe-vep download  # fetches ~643 MB
vibe-vep prepare   # builds SQLite index

# ClinVar (GRCh38): download + enable
vibe-vep config set annotations.clinvar true
vibe-vep download  # fetches ~182 MB clinvar.vcf.gz

# Hotspots: point to TSV file
vibe-vep config set annotations.hotspots /path/to/hotspots_v2_and_3d.txt

# SIGNAL (GRCh37 only): enable
vibe-vep config set annotations.signal true

# SIFT + PolyPhen-2 (via Ensembl): download + enable
vibe-vep config set annotations.sift true
vibe-vep config set annotations.polyphen true
vibe-vep download  # fetches ~4.1 GB from Ensembl FTP

# dbSNP RS IDs: download + enable
vibe-vep config set annotations.dbsnp true
vibe-vep download  # fetches ~17 GB dbsnp.vcf.gz

Use vibe-vep version to see which sources are loaded and vibe-vep version --maf-columns for the full column mapping.