InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Model for the Multi-Species Genomics Foundation, Designed for 1 Mb Core Length at Single-Nucleotide Resolution

Genomic prediction and design now require models that link local motifs with megabase-scale regulatory contexts and are applicable to many organisms. Nucleotide Transformer v3, or NTv3, is InstaDeep's new multi-species genomics model for this setting. It combines representational learning, active track and gene annotation prediction, and controllable sequence generation on a single core working with a 1 Mb core at single nucleotide resolution.
Previous Nucleotide Transformer models have already shown that self-training on thousands of genomes yields strong predictors of molecular phenotype. The original series included models from 50M to 2.5B parameters trained on 3,200 human genomes and an additional 850 genomes from various species. NTv3 keeps this sequence only from the pre-training perspective but extends it to longer scenarios and adds an active transparent monitoring and production mode.

Architecture of 1 Mb genomic windows
NTv3 uses a U-Net style architecture targeting very long genomic windows. The convolutional sampling tower compresses the input sequence, the transformer stack models the long-range dependence in that compressed space, and the deconvolution tower returns the base-level correction for prediction and processing. Input is done by character level tokens in addition to A, T, C, G, N and special tokens like , , , , again . The length of the sequence must be a multiple of 128 tokens, and the reference implementation uses padding to enforce this constraint. All public checkpoints use a single base token with a word size of 11 tokens.
The smallest public model, the NTv3 8M pre, has about 7.69M parameters with 256 hidden dimensions, 1,024 FFN dimensions, 2 transformer layers, 8 heads, and 7 down-sampling stages. Finally, the NTv3 650M uses 1,536 hidden dimensions, 6,144 FFN dimensions, 12 transformer layers, 24 attention heads, and 7 sub-sampling stages, and adds state layers to the heads for some types of prediction.
Training data
The NTv3 model was pre-trained with 9 trillion base pairs from the OpenGenome2 resource using a latent language model for base resolution. After this stage, the model is trained for a joint task that includes continuous monitoring and supervised learning on approximately 16,000 active tracks and annotation labels from 24 animals and plants.
Performance and Ntv3 Benchmark
After post-training NTv3 achieves state-of-the-art accuracy for functional track prediction and gene annotation across species. It outperforms a strict sequence of performance models and previous genomic base models in existing public benchmarks and in the new Ntv3 Benchmark, defined as a controlled low-level tuning suite with standard 32 kb input windows and base correction output.
Ntv3 Benchmark currently contains 106 long range, single nucleotide, cross assay, multispecies functions. Because NTv3 recognizes thousands of tracks across 24 species during training after training, the model learns a shared control grammar that transfers between organisms and experiments and supports a long-range functional genome to function.
From forecasting to controlled sequential production
Beyond the assumption, NTv3 can be efficiently processed into a controllable output model by using a hidden language model. In this mode the model receives state signals that include desired promoter activity levels and promoter selectivity, and fills in hidden gaps in the DNA sequence in a manner consistent with those conditions.
In experiments described in the startup materials, the team designs 1,000 enhancer sequences with defined activity and promoter information and validates them in vitro using STARR seq assays in collaboration with Stark Lab. The results show that these generated enhancements restore the target order of the activity levels and achieve an improved advertiser specification more than 2 times compared to the foundations.
Key Takeaways
- NTv3 is a long-range, multi-species genomics model: Integrates representational learning, functional track prediction, gene annotation, and controlled sequence generation in a single U Net-style structure that supports 1 Mb nucleotide context across 24 animal and plant species.
- The model is trained on 9 trillion base pairs with both self-supervised and supervised objectives: NTv3 is pre-trained with 9 trillion base pairs from OpenGenome2 with a hidden representation of language processing, then sent with more than 16,000 active tracks and annotation labels from 24 species using a joint objective that includes continuous self-monitoring and supervised learning.
- NTv3 achieves high performance in the Ntv3 Benchmark: After training after training, NTv3 reaches the state of the art accuracy of functional track prediction and gene annotation in all species and surpasses the previous sequence of functional models and base models of genomics in public benchmarks and in Ntv3 Benchmark, which contains 106 fixed long-range functions with a distance of 32 kb for the input and output of the base decision.
- The same core supports controllable enhancer design validated with STARR seq: NTv3 can be fine-tuned as a controllable generative model using a hidden distribution language model to design enhancer sequences with specified activity levels and promoter selectivity, and these designs are validated by STARR testing of seq assays that confirm target activity ordering and improved promoter specificity.
Check out Repo, Model on HF and technical details. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.



