Development

Setting up a development environment and contributing to vibe-vep.

Prerequisites

  • Go 1.24+ with CGO_ENABLED=1 (required for DuckDB and SQLite)
  • Git

Building

git clone https://github.com/inodb/vibe-vep.git
cd vibe-vep
CGO_ENABLED=1 go build -o vibe-vep ./cmd/vibe-vep

Running Tests

# Run all tests (fast mode, ~1s)
go test ./... -short -count=1

# Run corner case tests (66 tests, 7ms)
go test ./internal/annotate/ -run 'TestDatahub_|TestEdge_|TestCorner_' -v

# Run benchmarks
go test ./internal/annotate/ -bench . -benchmem

# Run fast validation benchmark (GDC, ~2 min)
go test ./internal/output/ -run TestValidationBenchmarkGDC -v -count=1 -timeout 10m

# Run full validation benchmark (all assemblies, ~21 min)
go test ./internal/output/ -run 'TestValidationBenchmark(GDC|All|AllGRCh37)' -v -count=1 -timeout 120m

# Run annotation sources benchmark
go test ./internal/output/ -run TestAnnotationSourcesBenchmark -v -count=1 -timeout 30m

Test Data

Three tiers of test data are available:

TierSizeStudiesCommand
datahub_gdc~5 GB32 GDC studies (GRCh38)make download-datahub-gdc
datahub_all~50 GB431+ studies (GRCh38 + GRCh37)make download-datahub-all

For backward compatibility, make download-testdata still downloads the original 7-study TCGA subset.

Project Structure

cmd/vibe-vep/       CLI entry point (annotate, download commands)
internal/
  annotate/         Consequence prediction (PredictConsequence, Annotator)
    corner_case_test.go       13 edge case tests
    vep_edge_cases_test.go    17 Ensembl VEP test patterns
    datahub_mismatch_test.go  36 datahub mismatch tests
  cache/            Transcript cache (GENCODE GTF/FASTA loader)
  duckdb/           DuckDB cache for transcripts and variant results
  genomicindex/     Unified SQLite index for annotation source lookups (AM, ClinVar, SIGNAL, gnomAD, SIFT/PP2, dbSNP)
  maf/              MAF file parser
  output/           Output formatting and validation comparison
  vcf/              VCF file parser
testdata/
  cache/            Test transcript data (JSON)
  datahub_gdc/      32 GDC studies for GRCh38 validation (~5 GB)
  datahub_all/      All datahub studies (GRCh38 + GRCh37, ~50 GB)

Roadmap

  • Feature parity for MAF annotation — Match the annotation capabilities of the genome-nexus-annotation-pipeline
    • Consequence prediction (99.8% concordance)
    • HGVSp/HGVSc notation
    • Full MAF output format
    • Cancer gene annotations (OncoKB)
    • VCF to MAF conversion
    • --pick / --most-severe annotation filtering
  • Additional annotation sources
    • AlphaMissense pathogenicity scores
    • ClinVar clinical significance
    • Cancer Hotspots
    • SIGNAL germline frequencies
    • SIFT predictions
    • PolyPhen-2 predictions
    • gnomAD allele frequencies
  • Re-annotate datahub GDC studies
  • Replace genome-nexus-annotation-pipeline for datahub

License

MIT License