Deepseek has recently released a 3B OCR model: 3b VLM designed for high quality OCR and structured document conversion

Deepseek-AI released 3b Deepseek-OCR, an end-to-end OCR and document recognition system (VLM) that compresses long documents into a small set of visual tokens, and maps those tokens to a language model. The method is simple, the images carry integrated representations of the text, which reduces the length of the recorder. The research group reports a target accuracy of 97% when text tokens are within 10 times of visual tokens in the fox benchmark, and useful behavior even at 20 times compression. It also reports competitive results on omnidocbench with fewer tokens than standard benchmarks.

Architecture, what's new?
Deepseek-OCR-3B consists of two components, Encoder Encoder named DevenCoder and professional mixer Defed3B-MOE-A570M. Encoder is designed for high throughput with low activation cost and few output tokens. It uses a windowed attention stage based on SAM to get a local view, a 2 layer token compressor for 16 × Token, and a global attention stage based on a piece of visual information processing. This design keeps the activation memory controlled at high resolution, and keeps the input token count low. The decoder is a 3b parameter model of moe (named as depseek3b-moe-a570m) with 570m active parameters per token.


Multiple solutions solutions, for developers of token budgets
Deencoder supports native and dynamic methods. Traditional types are smaller at 64 tokens by 512 by 512 pixels, 640 tokens by 1024 by 1020. 126 species in 1280. Gundam yields N × 100 and 256 tokens, or n776 400 tokens, with 2 to 9 ways. These methods allow AI developers and researchers to adjust token budgets by page difficulty.




Pressure effects, meaning values… ..
The FOX benchmark study measures clarity as a direct text entry after identification. With 100 tone tokens, pages with 600 to 700 text tokens achieve 98.5% clarity with 6.7 × compression. Pages with 900 to 1000 text tokens achieved 96.8% accuracy with 9.7 × compression. With 64 tokens, the accuracy decreases as the compression increases, for example 59.1% about 19.7 × 1200 to 1300 to 1300 to 1300 to 1300 text tokens. These values appear directly in table 2.


On Omnidocbench, Deepseek-OCR mysteriously reports that it surpasses God-OCR 2.0 when using 100 visual tokens per page, and under 800 visual tokens ends with Mineru 2.0, which uses more than 6000 tokens per page on average. The Benchmark section shows the overall performance according to the programming range.


Important training details….
The research team describes a two-stage training pipeline. First train the Deencoder with the following Token Prediction with OCR 1.0 and OCR 2.0 Data and 100m laion samples, then train the complete system with pipe matching. For hardware, the run used 20 locations, each with 8 A100 40G GPUS, and used Adamw. The group reports a training speed of 90b tokens per day in text only data, and 70b tokens per day in multimodal. In production, it reports the ability to do more than 200k pages per day with a100 40g node.
How to test it in a working cell
If your target documents are standard reports or letters, start in small mode with 100 tokens, and adjust higher only if the editing distance is unacceptable. If your pages contain very small fonts or very high tokens, use gundam mode, because it combines global and local fields with a clear token budget. If your workload includes charts, tables, or chemical structures, review the “Deep stability” section, which shows the conversion to HTML tables and smiles and structured geometry, then the output is easy to verify.


Key acquisition
- Deepseek OCR aims for good token performance using high natural compression with decomming close to random decomming with about 10 compression, and about 60 percent clarity with 20 compression.
- The HF release reveals clear tokens, which use 64 by 512 by 640 tokens, tokens that use 640 by 640 by 640 and 1024 by 1024 by 1024 tokens.
- The structure of the system is to defy pages on Vision Tokens and Deepseek3b Moe Decoder with about 570m operating parameters, as described by the research team in a technical report.
- Hugging face model cards are setup tested for immediate use, Python 3.12.9, Cuda 11.8, Pytorch 2.6.3, Transformers 4.40.3, Flash Attention 2.7.3.
Deepseek OCR is an effective document AI step, treating pages as compact optical carriers that reduce the sequential specification of the decoder without obtaining a specification of 97 percent with the fox benchmark, which is the starting claim for testing in real operations. The model released by Confoder of 3b of Moe with the front end, equipped with transformers, and types tested with Pytorch 2.6.0, Cuda 11.8, and low food 2.7.3, which takes the cost of setting up developers. The storage area shows a single 6.67 GB SHARD, compatible with a standard GPU. In general, Expseek OCR works with optical context pollution and Confoder of 3B Moe, reported 97% of decoding in FOX, and includes the budget setting of Transformers, and includes a claim to use with your pipeline.
Look Technical paper, model in hf and Github repo. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



