Why Does My Coding Assistant Start Responding in Korean When I'm Typing Chinese

0 4 3 minutes read

Why Does My Coding Assistant Start Responding in Korean When I'm Typing Chinese

. Mostly, I work with my Chinese coding assistant. However, my writing isa.

Yesterday, I asked my coding assistant in Chinese: “run.py有早停吗？我的恒源云上跑，电影没有时间”, which means, “Does run.py use early termination? I was running a project on a shared GPU service, and I didn't see an early termination.” As always, I naturally typed the technical token run.py in its original English form. The model checked the code and responded as follows:

Photo by author: Screenshot of the coding assistant responding in Korean

All technical tokens remain in English (run.py, config.py, train_unified), while the descriptive structure changes to Korean. This is not an isolated case. It happened from time to time: as long as I mixed Chinese and English engineering words, Korean always appeared.

Photo by author: Another screenshot of the coding assistant responding in Korean

This made me ask: Is this a language issue, or is there something deeper in the embedding?

The hypothesis

Embedding spaces are not designed specifically for linguistic type. Trained along with language models, they are often organized into job registers such as academic writing, conversational writing, and, in the case of coding assistants, engineering/coding. Chinese, although spoken by the majority of people in the world, is not a natural form of engineering register and has limited representation in technology companies.

In such a case, the script can stop behaving as “Chinese” in the embedding as soon as the engineering tokens match update / branch / commit / PR / diff he appeared. Instead, it can drift into the realm of engineering.

We will do some research to provide concrete evidence for this theory.

Controlled Language Drift

We create a controlled sequence of sentences in which the English words gradually take over the Chinese:

Stage 0: Please check
Stage 1: 请帮我 review the 這个电影
Section 2: 请帮我 update the 這个 branch
Step 3: Please review this branch withdrawal request
Step 4: Please review this diff code for the pull request

We now calculate similarity using cosine similarity between sentence embeddings. We define Korean and English “clusters” as the average embedding of a subset of engineering-related sentences in each language. We use Δ (EN − KO) to indicate the difference between English and Korean similarity scores, that is, Δ = similarity(English) − similarity(Korean).

The stage	Similar to Korean	English equivalent	Δ (EN − KO)
0	0.4783	0.5141	0.0358
1	0.5235	0.5728	0.0492
2	0.5474	0.6140	0.0665
3	0.5616	0.7314	0.1698
4	0.5427	0.7398	0.1972

We observed an interesting phenomenon: Korean similarity increases first and then surpasses English similarity. Furthermore, the growth of English homogeneity is not linear, suggesting a behavior similar to a phase transition rather than a gradual drift.

When we express the embeddings into two dimensions using PCA, we see a smooth trajectory in the early stages, followed by a sharp jump between Stage 2 and Stage 3, and a subsequent stabilization. This pattern shows that the embedding does not go in the direction of space; instead, they seem to shift between attractive dishes.

Photo by author: Controlled Drift Trajectory in PAC space

Real World Model Behavior

Consider again the sentence we talked about earlier. I asked:

A. “run.py有早停吗？我们恒源云上跑，电影没有时间”, which means “Does run.py use early stop? I was running a project on a shared GPU service, and I didn't see the early stop started.”

B. “원이다 에이스스트이스. 가이: run.py doesn't actually have 조기 자자가. config.py에 USE_EARLY_STOPPING = True” (in Korean).

Translated back into Chinese, we have:

C. “我们么了是生情。 conclusion：run.py actually didn't exist早停。config.py里有USE_EARLY_STOPPING = True.”

We calculate the similarity of A, B, and C using the cosine similarity between sentence embeddings. For comparison, we define three reference sets: the Chinese set as the average embedding of regular Chinese natural language sentences, and the corresponding English and Korean sets.

Text	Korean SIM	English sim	Chinese sim
A. (Chinese information)	0.2003	0.2688	0.3134
B. (Korean Answer)	0.2745	0.2983	0.1641
C. (Translated into Chinese)	0.1634	0.3106	0.2798

As you can see, translating the Korean answer to Chinese does not send you embedding in the Chinese area. Instead, it is closer to English collections.

This suggests: Translation can restore the form of the language, but perhaps not embedding the place.

The conclusion

Both tests give the same conclusion: the embedding environment is not regulated by language boundaries. Instead, it may have been shaped by work situations, where engineering English dominates.
When a sentence enters this region, the language structure may change, but the embedding structure remains in the engineering region, leading to strange behavior such as answering in Korean even if you are not Korean at all.

Source link

nimda 3 weeks ago

0 4 3 minutes read