Zinyu Ai releases GLM-4.6V: 128k core language model with native tool

Zinyu Ai opened the pepper GLM-4.6V series as a pair of Vision language models that handle images, videos and tools as the first-class installation of agents, not to be written down behind the text.
Model line and height condition
The series has 2 models. GLM-4.6V is 106B A cloud-based parameter model for high-rise apartments. Glm-4.6v-flash Is the 9B parameter variant different for local transmission and low latency usage.
GLM-4.6V expands the training context window to 128k tokens. In doing this it supports about 150 pages of dense documents, 200 slide pages or one hour of video in one pass because the pages are encoded in a virtual Encoder.
Use of Native Multimodal Tool
The biggest technological change is indigenous Multimodal Call functionality. The use of traditional tools in LLM Systems routes everything through text. Images or pages are first turned into descriptions, the model calls tools using text arguments and reads the text responses. This drops data and increases latency.
GLM-4.6V presents Traditional multimodal work is expensive. Images, screenshots and document pages are passed directly as a tool parameter. Tools can return grids of search results, charts, rendered web pages or product images. The model consumes those visuals and manipulates them with text in a similar way of thinking. This closes the loop from seeing and understanding execution and is clearly positioned as a bridge between visual perception and action using multimodal agents.
To support this, Zibuu Ai extends the Model Control System with URL-based multimodal management. Tools find and retrieve URLs that find URLs that display images or specific frames, prevent file size restrictions and allow specific selection within multiple image contexts.
Rich text content, web search and previous iterations
Zinyu Ai's research team describes 4 cases of literature:
First, understanding the content of the text is rich and creative. The GLM-4.6V reads mixed input such as papers, reports or slide decks and produces a unique structured image text. Understands text, charts, figures, tables and formulas in the same document. During the generation it can be opened for proper visualization or retrieval of external images with tools, then use the visual research step that filters low-quality images and names the final article with inline calculations.
Second, virtual web search. The model can find user intent, plan which search tools to drive and combine text and image with image in text search. It re-aligns images and text, selects relevant evidence and produces structured feedback, for example visual comparisons of products or locations.
Third, Frontlend Repping and visual interaction. The GLM-4.6V is designed for code flow design. From the UI screen, rebuild Pixel-accurate HTML, CSS and JavaScript. Developers can then mark the region in the sample from the screen and issue natural language commands, for example to move this button to the left or change this card background. The model places these commands back into Code and returns the Updated Snippet.
Fourth, multimodal document comprehension in long context. GLM-4.6V can read multi-document input up to 128k token limit for treating pages as images. The research team reports a case where the model processes financial reports from 4 public companies, extracts key metrics and creates a complete football table while summarizing the ability to answer questions about specific objectives and timestamps.
Structures, data and reinforcement
The GLM-4.6V models belong to the GLM-V family and are based on the GLM-4.5V and GL-4.1V-Thinking tech reports. The research team highlights three key technical ingredients.
First, it's a long-term follow-up. GLM-4.6V expands the training context window to 128k tokens and conducts pre-training on the content of large image text Corpora. It uses compression alignment ideas from the glyph so that visual tokens can carry four information corresponding to linguistic tokens.
Second, the development of world knowledge. Zibuu Ai Team adds billion Scale Multimodal Perception and world knowledge data during pre-training. This includes encyclopedic concepts and everyday visual structures. The stated goal is to improve basic visualization and the cross modal question answered to perfection, not just benches.
Third, agentic data synthesis and extended MCP. The research team produces a large synthetic footprint when the model calls tools, processes visual output and iterates through programs. They extend MCP with URL-based multimodal management and centralized checkout. The generation stack follows the draft, the selection of the image, the final sequence of Poland. The model can independently call tools for cropping or searching between these categories to place the images in the correct positions for the output.
Instrumental persuasion is part of the purpose of reinforcement learning. The GLM-4.6V uses RL for programming, rendering tracking and formatting in complex toolchains.
To do

Key acquisition
- GLM-4.6V is a 106B Multimodal Foundation Model with a core of 128k tokens, and GLM-4.6V-Flash is a variant of 9B designed for local use and low latency.
- Both models support traditional multimodal calling so tools can consume and retrieve images, video frames and document pages directly, linking visual perception to agentless actions.
- GLM-4.6V is trained for long-range multimodal understanding and integrated generation, so it can read large sets of integrated documents and complete structured ITIT with inline images and selected tool in one pass.
- The series reaches state of the art performance on large multimodal benches with the same parameter scale and is released as an open source mass under the mit license in face kiss and mode.
Look Model card on HF and Technical details. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and engineer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



