Generative AI

Google DeepMind Introduces Visual Banana: A Tuned Image Generator That Beats SAM 3 in Segmentation and Depth Any V3 in Metric Depth Estimation

For years, the computer vision community has worked on two different tracks: generative models (which generate images) and discriminative models (which understand them). The guesswork was straightforward – good models for it to do the pictures are not really good reading see. A new paper from Google, titled “Photographers are students of Generalist Vision” (arXiv:2604.20329), published on April 22, 2026, dispels that thought.

A team of Google DeepMind researchers have been introduced Idea Girla single integrated model that outperforms or matches high-level expert systems across a wide range of visual perception tasks – including semantic classification, instance classification, one-metric depth estimation, and general spatial estimation – simultaneously. to keep the ability to produce a true image of its base model.

The LLM Simulation That Changes Everything

If you've worked with large language models, you already understand the two-stage playbook: first, pre-train the base model on large text data using a generative principle, then use command configuration to match downstream functions. The training phase is where the model develops a rich internal representation of the language that can be targeted for almost anything.

The claim of the Google team is that image production training plays the same fundamental role of perception. Their basic model, Nano Banana Pro (NBP)is Google's best photo editor. By performing a lightweight instruction processing pass — combining a small portion of the data for a computer vision task at a very low rate in combination with NBP's original training — they created the Vision Banana. Key insight: producing photorealistic images clearly requires a model to understand geometry, semantics, depth, and object relationships. Vision Banana is learning to reveal that hidden information in measurable, separable formats.

Importantly, no training data from any test benchmarks are included in the instruction tuning mix – ensuring that all results reflect true generalist skill rather than memorization within the domain.

How It Works: Seeing as Image Generation

Instead of adding special decoder heads or reverse modules for each function, all the results of the vision function parameter as RGB images. The model is programmed to produce visualizations that follow precise, unalterable color schemes – meaning the images produced can be converted into quantitative results for benchmarking.

The research team identified three key advantages of this strategy. First, it supports a wide range of functions in one integrated model – after the command is set, only quick changes, not weights. Second, it requires relatively little training data, since the instruction set only teaches the model how to format the computer vision output as RGB. Third, it helps the model to maintain its ability to produce real images, since the output is new RGB images.

Because semantic classificationthe model is given instructions like this: “Generate the visibility of a segment of this image, using the color map: {'cat': 'red', 'background': 'yellow'}.” Each pixel is colored by its predicted category, and because the color assignments are specified in the declaration, no fixed label vocabulary is required.

Because example classificationsince the number of instances is not known in advance, Vision Banana uses a per-class inference strategy – it uses a different pass for each class and dynamically assigns different colors to each instance. Masks are obtained by combining pixels of the same color using a threshold.

Metric depth measurement uses a bijective mapping between infinite metric depth values [0, ∞) and bounded RGB values in [0, 1]³. Power transformation (shape parameter λ = −3, scale parameter c = 10/3) initial values ​​“curves” of the depth metric, which are then coded as false color perception crossing the edges of the RGB cube, following the structure of the 3D Hilbert curve. This change is completely irreversible, so the generated depth image is fine-tuned back to the visible metric range. Importantly, no camera parameters – neither intrinsics nor extrinsics – are required during training or judging. The model takes the full scale from the visual cues and the embedded world knowledge during pre-training. Depth training data is also completely simulated, generated by simulation engines, using zero real-world depth data.

Because general area measurementthe mapping is very straightforward: the surface normal is the vector units (x, y, z) from −1.0 to 1.0, naturally mapped to the RGB channels. Left-facing rules include red and pink; the general upward direction includes as light green; normal pointers towards the camera encode as light green/purple.

The Numbers: Beating the Professionals at Their Own Game

Vision Banana results in all benchmarks – all in the passive transfer settings, where the model has never seen any training data from the tested dataset – they are important:

  • Semantic classification in Cityscapes val:mIo of 0.699compared to SAM 3's 0.652 – a gain of 4.7 points.
  • Speech segment referring to RefCOCOg UMD val:cIo of 0.738arranges SAM 3 Agent's 0.734.
  • The reasoning segment in ReasonSeg val:gIo of 0.793which beats SAM 3 Agent's 0.770 – and significantly outperforms even non-zero methods trained on background data, including X-SAM.
  • Example classification in SA-Co/Gold:pmf1 for 0.540in line with DINO-X (0.552), and ahead of Gemini 2.5 (0.461), APE-D (0.369), and OWLv2 (0.420) under zero-shot transmission.
  • Metric depth measurement: the ratio δ1 of 0.882 in all six major dimensions; for the four datasets on which Depth Anything V3 was tested (NYU, ETH3D, DIODE-Indoor, KITTI), Vision Banana scores 0.929 compared to the depth of anything else the V3's 0.918 — while using zero real-world training data and no camera parameters.
  • Normal face measurement: the average error of the measurement angle of 18.928° in four datasets, compared to Lotus-2's 19.642°. For indoor datasets specifically, Vision Banana achieves the lowest mean angle error (15.549°) and the lowest median angle error (9.300°) among all compared methods.

In productivity benchmarks, the Vision Banana holds its own against its base model: it achieves a 53.5% win rate against Nano Banana Pro in GenAI-Bench (text-to-image), and 47.8% win rate in ImgEdit (image editing), where Nano Banana Pro scored 52.2%. Overall, the results confirm that simple teaching does not reduce the productivity of the model.

Key Takeaways

  • Pretraining for image production by a general vision student: Just as LLM pre-training opens up an acute understanding of language, Google research shows that training in image processing naturally improves strong internal visual representations that transfer to visual tasks such as segmentation, depth estimation, and general spatial estimation.
  • Vision Banana goes beyond professional models without architectural expertise: Built with the lightweight instructions-tuning of Nano Banana Pro, Vision Banana surpasses SAM 3 in the benchmarks of three categories, Depth Anything V3 in the metric depth ratio (δ1: 0.929 vs 0.918), and Lotus-2 in the standard upper ratio (vertical angle 28° in 18 ° all)—18° in 18. gun transfer settings.
  • All the functions of vision are reworked as image creation: By parameterizing the output of the vision task as RGB images with separable color schemes, Vision Banana uses a single set of weights and quickly switches between semantic segmentation, instance segmentation, depth estimation, and general spatial estimation – no task-specific modules are required.
  • The depth estimation metric works without camera parameters or real-world data: Using the great power to convert depth map values ​​to the RGB color space, Vision Banana takes full metric scaling from the visual context – it doesn't need camera intrinsics or extrinsics, and is trained entirely on artificial data from simulation engines.
  • Visualization can serve as an interface for the overall vision: Similar to how text production includes language functions, image production may be a global output interface for computer vision, which points to a paradigm shift where generative vision trains the true power of Basic Vision Models in both production and understanding.

Check it out Paper again Project page here. Also, feel free to follow us Twitter and don't forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button