Generative AI

Stanford Researchers Introduce BIOMEDICA: A Emerging AI Framework for Developing Biomedical Language Models with Large-Scale Datasets

The development of VLMs in the biomedical domain faces challenges due to the lack of extensive, annotated, and publicly accessible datasets across various fields. Although datasets created from biomedical literature, such as PubMed, tend to focus less on domains such as radiology and pathology, they neglect complementary areas such as molecular biology and pharmacogenomics that are important for a comprehensive clinical understanding. Concerns about privacy, the complexity of expert-level annotation, and methodological limitations greatly hinder the creation of comprehensive datasets. Previous methods, such as ROCO, MEDICAT, and PMC-15M, rely on domain-specific filtering and supervised models to extract millions of image pairs. However, these techniques often fail to capture the wide variety of biomedical information required to develop standard biomedical VLMs.

In addition to dataset limitations, training and testing biomedical VLMs presents unique challenges. Cross-learning methods, such as PMC-CLIP and BiomedCLIP, have shown promise by using literature-based datasets and models to transform the view of text-to-text alignment. However, their performance is limited by small data sets and limited computational resources compared to conventional VLMs. In addition, the current principles of evaluation, which are mainly focused on the activities of radiology and pathology, lack of standardization and extensive work. Relying on more readable parameters and smaller data sets undermines the reliability of these tests, highlighting the need for scalable data sets and robust test frameworks that can address the diverse needs of biomedical imaging language applications.

Researchers from Stanford University have launched BIOMEDICA, an open source framework designed to extract, annotate, and organize an entire subset of PubMed Central Open Access into a user-friendly dataset. This archive includes over 24 million image-text pairs from 6 million articles annotated with metadata and expert annotations. They also released BMCA-CLIP, a collection of CLIP-style models pretrained on BIOMEDICA via streaming, eliminating the need for local storage of 27 TB of data. These models achieve state-of-the-art performance across 40 tasks, including radiology, dermatology, and molecular biology, with an average improvement of 6.56% in trivial classification and reduced computational requirements.

The BIOMEDICA data processing process includes dataset extraction, concept labeling, and classification. Articles and media files are downloaded from the NCBI server, metadata, abstracts, and statistical indexes are extracted from nXML files and the Entrez API. Images are grouped using DINOv2 embedding and labeled with an expert-refined hierarchical taxonomy. Labels are given by majority vote and are distributed across the clusters. The dataset, which contains more than 24 million caption pairs and extensive metadata, is organized in the WebDataset format for efficient distribution. With 12 global and 170 local imaging concepts, the taxonomy includes categories such as clinical imaging, microscopy, and data visualization, emphasizing scalability and accessibility.

A continuous pre-training test on the BIOMEDICA dataset used 39 established biomedical classification tasks and a new retrieval dataset from Flickr, which includes 40 datasets. The classification benchmark includes pathology, radiology, biology, surgery, dermatology, and ophthalmology functions. Metrics such as average classification accuracy and recall (out of 1, 10, and 100) are used. Concept filtering, which excludes over-represented topics, performed better than concept estimation or pre-training of the full data set. The models trained in BIOMEDICA achieved state-of-the-art results, more efficient than previous methods, with improved performance in all classification, retrieval, and microscopy tasks using less data and calculations.

In conclusion, BIOMEDICA is a comprehensive framework that transforms a subset of PubMed Central Open Access (PMC-OA) into the largest dataset ready for deep learning, containing 24 million caption pairs enriched with 27 metadata fields. Designed to address the shortage of heterogeneous, annotated biomedical datasets, BIOMEDICA provides a rapid, open-source solution to extract and interpret multimodal data from more than 6 million articles. By continuously pre-training CLIP-style models using BIOMEDICA, the framework achieves high-level zero-shot classification and image text retrieval for all 40 biomedical tasks, requiring 10 times less computing and 2.5 times less data x. All resources, including models, datasets, and code, are publicly available.


Check it out Paper and Design Page. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 65k+ ML SubReddit.

🚨 Recommend Open Source Platform: Parlant is a framework that is changing the way AI agents make decisions in customer-facing situations. (Promoted)


Sana Hassan, a consulting intern at Marktechpost and a dual graduate student at IIT Madras, is passionate about using technology and AI to address real-world challenges. With a deep interest in solving real-world problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

📄 Meet 'Height': The only standalone project management tool (Sponsored)

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button