Evola: An 80B-Parameter Multimodal Protein-Language Model for Decoding Protein Functions through Natural Language Dialogue
Proteins, essential molecular machines that have evolved over millions of years, perform important life-supporting functions that are sequenced and revealed by their 3D structures. Delineating their functional mechanisms remains a major challenge in biology despite advances in experimental and synthesis tools. Although AlphaFold and similar models have revolutionized structure prediction, the gap between structural knowledge and functional understanding persists, compounded by the significant growth of unannotated protein sequences. Traditional tools rely on evolutionary uniformity, which limits their scope. Emerging models of protein language offer promise, encouraging deep learning to understand protein “language,” but limited, heterogeneous, and context-rich training data hinders their effectiveness.
Researchers from Westlake University and Nankai University have developed Evola, a multimodal protein language model designed to interpret protein molecular mechanisms through natural language conversation. Evola combines a protein language model (PLM) as an input, an LLM as an output, and an alignment module, allowing accurate protein function prediction. Trained on an unprecedented dataset of 546 million protein query-answer pairs and 150 billion tokens, Evola uses Retrieval-Augmented Generation (RAG) and Direct Preference Optimization (DPO) to optimize response matching. Tested using the novel Instructional Response Framework (IRS), Evola provides expert-level information, advancing proteomics research.
Evola is a multifactorial model designed to answer protein functional questions. Combines protein-specific information with LLMs for accurate and context-aware answers. Evola features a frozen protein encoder, a trainable sequence compressor and aligner, and a pre-trained LLM decoder. Uses DPO optimization based on GPT and RAG preferences to improve response accuracy using Swiss-Prot and ProTrek datasets. Applications include protein functional annotation, enzyme classification, gene ontology, subcellular localization, and disease association. Evola is available in two versions: the 10B parameter model and the 80B parameter model that was trained.
The study introduces Evola, an 80-billion-dollar advanced protein language model designed to translate protein functions through natural language conversation. Evola includes a protein language model as an encoder, a large language model as an exporter, and a central module for compaction and alignment. Uses RAG to integrate external information and DPO to improve response quality and improve results based on preferred signals. Experiments using the IRS framework demonstrate Evola's ability to generate precise and contextually relevant information on protein functions, thereby improving proteomics and functional genomics research.
The results show that Evola outperforms existing models in predicting protein function and natural language conversation functions. Evola was tested on various datasets and achieved state-of-the-art performance in generating accurate, context-sensitive answers to protein-related queries. Evaluation of the IRS framework revealed its high accuracy, interpretability, and consistency of responses. Qualitative analysis highlighted Evola's ability to tackle complex questions and produce protein annotations comparable to expert-selected information. In addition, ablation research has confirmed the effectiveness of its training techniques, including recovery-enhanced production and optimization of specific preferences, in improving the quality of the response and in accordance with the biological conditions. This establishes Evola as a robust proteomics tool.
In conclusion, Evola is an 80 billion parameter generating protein language model designed to specify the molecular language of proteins. Using natural language dialogue, it combines protein sequences, structures, and biological functions. Evola's innovation lies in its training on a combined AI dataset of 546 million question-answer pairs, covering 150 billion tokens—unprecedented at scale. Using DPO and RAG improves the quality of the response and integrates external information. Tested using IRS, Evola delivers expert-level information, improving proteomics and functional genomics while providing a powerful tool to unravel the molecular complexity of proteins and their biological roles.
Check it out Paper. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 60k+ ML SubReddit.
🚨 UPCOMING FREE AI WEBINAR (JAN 15, 2025): Increase LLM Accuracy with Artificial Data and Experimental Intelligence–Join this webinar for actionable insights into improving LLM model performance and accuracy while protecting data privacy.
Sana Hassan, a consulting intern at Marktechpost and a dual graduate student at IIT Madras, is passionate about using technology and AI to address real-world challenges. With a deep interest in solving real-world problems, he brings a fresh perspective to the intersection of AI and real-life solutions.
✅ [Recommended Read] Nebius AI Studio expands with vision models, new language models, embedded and LoRA (Enhanced)