How I rebuilt Variant Effect Predictor to be 100x faster (fastVEP!)

Watch on YouTube

If you work with genomic variants, you know VEP. Ensembl's Variant Effect Predictor is the standard tool — the thing your pipeline calls to figure out whether a given mutation breaks a protein, hits a splice site, or sits harmlessly in some intron. It's been around forever and it works. It's also written in Perl, ships with a Perl 5.22+ requirement, ten-plus CPAN modules, a DBI dependency, and a small graveyard of installation issues anyone who's set up VEP from scratch will recognize.

The annotation itself is fine. The speed is not. Annotating 50,000 variants with VEP takes about 206 seconds. Point it at a full human WGS (~4 million variants) and it doesn't finish on the newest MacBook Pro. People work around this by splitting their VCFs, running parallel processes, and stitching the outputs back together. That works, but it's a huge time tax. A lab running thousands of samples pays that tax every day.

So I rebuilt it in Rust.

The numbers

fastVEP runs the same 50,000-variant file in 1.59 seconds. That's a 130x speedup. The full WGS that VEP can't finish? fastVEP does it in 86 seconds.

Peak memory drops from ~500 MB to 2.8 MB. The installed binary is 3.3 MB instead of ~200 MB of Perl plus dependencies. There are no CPAN modules to chase. You cargo install, you run a binary, that's it.

That's the headline. The interesting part is what actually made it fast. It wasn't one thing. It was the dumb stuff Perl couldn't do well, layered on top of a few good ideas.

What Rust gets you for free

A lot of the speedup is just what you get when you stop paying for an interpreter and a garbage-collected dynamic language. Tight loops over variant records compile to real machine code. Strings don't allocate when they don't need to. Parallelism is rayon and works; you don't fork ten Perl processes and reconstitute their output.

Thanks to agentic coding, doing this manageable with one person's effort for a full month. This involves knowing exactly how the algorithm works to instruct the coding agents, and verify extensively with tests and outputs. Mostly, the Sequence Ontology has 49 consequence terms; you map a variant's coordinates against a transcript and figure out which ones apply. The bottleneck in the Perl version is the Perl, not the algorithm.

If you stop there, you get maybe 10–20x. The rest came from somewhere else.

The next real win: rebuilding the annotation lookup

VEP's slowest path is annotation lookup: pulling in ClinVar, gnomAD, dbSNP, COSMIC, all the supplementary databases that turn raw consequence into something a clinician can act on. The default workflow round-trips through SQLite or remote APIs. For a million variants, that's a million lookups, and every one of them costs more than the consequence prediction itself.

The fix is to put the annotations in a format designed for the access pattern. fastVEP has its own binary format called fastSA, and the v2 design is shamelessly inspired by echtvar: thanks to Brent Pedersen's work & credit where it's due. The key improvements in my understanding:

Chunked ZIP layout with Var32 encoding for variant keys.
Parallel u32 value arrays per annotation field.
Delta encoding on sorted positions.
An LRU chunk cache, because variant lookups in a real VCF are clustered.
A Bloom filter in front of the index for negative lookups.

Putting ClinVar, gnomAD, and dbSNP into this format and querying them as a single in-process call is most of what closes the gap on the heaviest workloads. You're not asking a database anymore. You're doing memory-mapped byte arithmetic.

What surprised me

A few things I didn't expect going in.

The FASTA handling matters more than I thought. You need the reference sequence for HGVS notation, and a naïve read of the GRCh38 primary assembly is enough to wreck your memory budget on its own. Memory-mapping the indexed FASTA and pulling spans on demand was the difference between "fastVEP runs on a laptop" and "fastVEP needs a server." Apparent simplicity hides this kind of thing; samtools faidx is doing a lot of work for you.

Structural variants are genuinely separate code. SNVs and short indels share a clean abstraction. <DEL>, <DUP>, <INV>, <BND> and the rest don't slot into it cleanly. I tried for a while to unify them, eventually gave up, and wrote a separate SV consequence predictor.

HGVS was the worst part. Generating correct HGVSc and HGVSp notation with 3' normalization across all the edge cases — overlapping CDS, mitochondrial circular coordinates, start-loss variants in non-Met-starting transcripts — required more test cases than the consequence engine itself. There's a reason VEP has been worked on for a decade. The annoying details are plenty and real.

Correctness

A faster but wrongly annotated VCF isn't useful. fastVEP is validated against VEP's output on shared test sets and matches on the consequences that matter. The repo has 233 tests across the workspace, not because that number is magic, but because every annoying HGVS edge case eventually became one. If you find a case where fastVEP disagrees with VEP and you think VEP is right, open an issue. Let me know here!

Try it

It's on GitHub at Huang-lab/fastVEP, Apache 2.0. There's a hosted web version at fastVEP.org if you want to paste in some VCF and see what it does. If you have Rust installed, it's a single cargo install away.

It works on yeast, fly, arabidopsis, mouse, human, anything with a GFF3. The web server can switch between organisms if you point it at a directory of them. The preprint is on bioRxiv. If it saves your group some compute time, that's the point and I'm glad :) Watch on YouTube