Is Language Recognized? Experimenting with Chinese characters

widely discussed on Douban – a Chinese social network – about a broken printer. The owner noticed that when the printer ran out of ink, all the letters came out with only the top half printed. However, the text was perfectly legible.
Check out these three versions of artificial intelligence:
You can read all three quickly: Full letter, 80% memorized, 50% memorized. That's not a trick – that's probably something based on the Chinese system.
One definition: 80% and 50% refer to the proportion of picture itself is stored, not individual characters. Noting that each letter takes up a different number of pixels in the image, we simply crop the image horizontally at a constant length.
This got me thinking: is language — at least Chinese — fundamentally recognizable? I spent a few days answering this in my mind, and finally decided to find out the way I know how: train some language models and see what really happens.
Experiment: Pixels In, Tokens Out
Every language model has to deal with tokens first. The basic idea is: computers do not understand text, so we give each word or letter an ID, that is, a number. For example, the character 你 becomes 100, 好 becomes 3, etc. Since then, the LLM is learning everything from scratch.
In this sense, when you reduce characters like 山 (mountain) and 水 (water) to whole numbers, you lose their shape. And the Chinese characters have four good shapes — stroke configurations, large parts, spatial structures that carry real information. Another example: 打 (hit), 拍 (pat), and 拉 (pull) all share the radical 扌 (hand). You narrow it down to IDs 423, 1089, and 2341, and that relationship is gone.
So instead of token IDs, I rendered each character as a grayscale image and gave it a language model. The model's task was to predict the next character.
You Don't Need to See Big
If you've ever taken off your glasses to read, you know that blurry text is still legible. The same goal is happening here.
Check out these 8×8 pixel models of artificial intelligence (hold your screen at arm's length):

Each character is 64 pixels. And the model, trained on input at this resolution, works the same way as the one trained on 80×80 images.
Indeed, we tested image resolutions from 4×4 to 80×80, and found that: Going from 8×8 to 80×80—100 times more pixels—really costs nothing.
Crop results are even more impressive and exciting. As 50% of each letter is removed, the accuracy drops below 2%. The model does not require a clear image. It turns out that it takes enough structure to know which strong family a character belongs to.
(Methodological note: in the examples above, I've put the full and truncated versions side by side so you can compare. In real experiments, each training condition is completely independent – the model trained on the truncated characters never sees it complete.)
The Hot-Start Effect
So, a virtual model better rather than the scriptural one?
Not in the end. Both combine to have the same final accuracy. But the journey looks very different, especially at the beginning.
After seeing only 0.4% of training stepsthe visual model is already twice as accurate as the text-based one.

This is what we call it heat-start effect. A virtual model comes to training already knowing something useful: that 打, 拍, and 拉 look the same, and probably behave the same way. A text-based model starts with a random embedding and has to figure it out from scratch.
If you look at the embedding space at the start – before any training – you can see this directly:

You can see that the actors share the same radical group together in the first stage of training. Cosine similarity of large sharing pairs: ~0.27 for virtual embeddings, ~0.002 for random token embeddings.
Which Makes the Race End in a Tie
Here's the main thing: front-end visual codes visible similar, but not in terms of language co-occurrence. However, the next actor's guess ultimately depends on the latter.
Yes, 打, 拍, and 拉 all share 扌 and look the same. But in the real text, they can appear in very different contexts — 拉动白影 (combat crime), 手机电影 (take photos), 拉动经教 (economic stimulation), etc. Once a text-based model has seen enough data to learn these patterns, visual priorities no longer matter.
In other words, visual input initiates development. However, it does not change the knowledge ceiling.
This always reminds me of Ted Chiang's story The Story of Your Life (the basis of the film Arrival). In a story, written and spoken language are two independent systems. But ultimately they serve the same purpose: communication. Two ways, same place.
Where This Matters Most
Apart from common ground, there are real situations where it matters:
Low resource settings. If you don't have a lot of training data, a virtual head start translates into a real practical advantage. In our tests, with only 10K samples, virtual models already outperform ia fully trained text base on Chinese downstream benchmarks (C-eval).
Corrupted historical documents. This is another fun one. Visuals can help examine ancient Chinese manuscripts, damaged books, and handwritten documents where the strokes are missing or faded.
What About Computers?
The good news: almost no more. A simplified virtual encoder that I used has a few parameters than the text base (12.6M vs. 19.0M). High memory: +1.3%. So we argue that the visual front is almost free.
Short answer
Is the Chinese language recognizable in its own way? The answer looks like: at first, yes. In the end, it doesn't matter.
Visual layout gives models a hot start. It is similar to what the human pupil does when it sees 扌 and immediately knows that it is in the area of actions related to the hand. But deeper patterns of language must be learned from the data. Both presentations read equally well.
The paper is on arxiv:



