Bytes Speak All Languages: Cross-Script Name Retrieval via Contrastive Learning

0 3 10 minutes read

Bytes Speak All Languages: Cross-Script Name Retrieval via Contrastive Learning

screening system checks a name against a watchlist, it faces a silent failure mode that nobody talks about. Type “Владимир Путин” into a system indexed on “Vladimir Putin” and most name-matching approaches return nothing. The two strings share zero characters, so edit distance is meaningless, phonetic codes fail (they assume Latin), and BM25 gives up entirely.

This is not an obscure edge case. Immigration databases, hospital record systems, and financial compliance pipelines deal with this daily. And yet, the dominant approaches to this problem are either classical (edit distance, Soundex variants) or heavyweight (fine-tune a multilingual LLM on a few hundred manually labeled pairs). In this post, I’ll walk you through how we trained a compact transformer encoder from scratch on raw UTF-8 bytes, with no tokenizer, no pretrained backbone, and no script detection, to solve cross-script phonetic name retrieval. We achieved 0.775 MRR and 0.897 R@10 across 8 non-Latin scripts, reducing the performance gap between Latin and non-Latin queries by 10x over the best classical baseline.

The full code is on GitHub. This post covers the ideas and the engineering.

Why is this hard?

The problem sits at the intersection of three things that don’t cooperate:

Scripts are disjoint symbol sets. “Schwarzenegger” and “שוורצנגר” (Hebrew) have no shared characters. Edit distance, the go-to for fuzzy matching, produces a maximum-distance score every time a script boundary is crossed. Phonetic hashing (Double Metaphone, Soundex) encodes approximate English pronunciation, so it is useless for non-Latin queries by design.

Romanization is not a function. The Chinese name written as “张” maps to Zhang, Chang, and Cheung depending on dialect, romanization standard, and historical convention. The Korean “박” maps to Park, Pak, and Bak. Any approach that tries to normalize to a canonical Latin form (like ICU transliterate) will get the right answer for one convention and fail for the others.

Names carry no semantic context. Dense retrieval methods like DPR and BGE-M3 are powerful for sentence-level tasks because surrounding words provide semantic grounding. For a 2-word person name there is no context to compensate for surface mismatch. Chari et al. (2025) showed that even strong multilingual retrievers degrade severely when queries are transliterated rather than written in their native script.

The insight behind our approach: every Unicode character decomposes deterministically into 1 to 4 bytes from a fixed 256-symbol alphabet. “Владимир” and “Vladimir” are different byte sequences, but a model trained contrastively on enough phonetic pairs can learn to map them to nearby vectors. The vocabulary is universal by construction.

Building Training Data at Scale

You can’t train this model without data, and there is no dataset of 4 million cross-script phonetic name pairs lying around. We built one with a 4-stage LLM pipeline.

Data generation pipeline (Image by author)

Stage 1: Stratified sampling from Wikidata

We started with 2 million person-name entities from Wikidata, which provides canonical English names plus partial cross-script labels (some entities have Russian or Arabic names in their Wikidata record, most don’t). Naively sampling from this produces a dataset dominated by English-only names. We stratified by script-coverage bucket (0, 1-2, 3-4, 5+ non-English labels) and sampled proportionally within each bucket, yielding 119,040 entities with balanced coverage.

Stage 2: Phonetic Latin variants (Llama-3.1-8B)

For each English anchor name, we asked Llama-3.1-8B-Instruct to generate 4 phonetic spelling variants — the kinds of mishearings and misspellings real people produce. The prompt was strict:

Generate 4 DISTINCT phonetic spelling variants of this name
as it sounds when spoken: "Catherine"

Rules:
- Each variant must be spelled differently from all others and from the original
- Simulate how different people might mishear or misspell the name phonetically
- Do NOT use nicknames, abbreviations, or shortened forms
- Do NOT change language (stay in Latin script)

Return a JSON array of exactly 4 strings, no explanation:
["variant1", "variant2", ...]

Result for “Catherine”: ["Kathryn", "Katerin", "Kathrin", "Katharine"]

Stage 3: Cross-script transliteration (Qwen3-30B)

For each English name and each of its Latin variants, we generated transliterations into 8 scripts: Arabic, Russian, Chinese, Japanese, Hebrew, Hindi, Greek, Korean. We used Qwen3-Coder-30B-A3B-Instruct-FP8:

{
  "Catherine": {"ar": "كاثرين", "ru": "Катрин", "he": "קתרין", ...},
  "Kathryn":   {"ar": "كاثرين", "ru": "Катрин", ...},
  "Katharine": {"ar": "...", "ru": "...", ...}
}

Every stage is independently resumable: it reads existing output, builds a set of already-processed entity IDs, and skips them. A crash loses at most one in-flight batch.

Stage 4: Merge and tag

The final stage merges Wikidata ground-truth labels with LLM output, deduplicates, and tags each positive pair by type:

phonetic: Latin spelling variant of the English anchor (“Catherine” → “Kathryn”)
script: direct transliteration into a non-Latin script (“Catherine” → “كاثرين”)
combined: a phonetic Latin variant that was then transliterated (“Katharine” → “كاثرين”)

Positives are stored per entity; negatives are not stored at all, they are mined dynamically during training. Splits are assigned at the entity level (80/10/10, deterministic MD5 hash of entity ID) so all variants of an identity go to one partition.

Final dataset: 119,040 entities, 4.67 million positive pairs.

The Model

The encoder is genuinely small: 6 transformer layers, 8 attention heads, hidden dim 256, FFN dim 1024, dropout 0.1, max length 256 bytes. Total parameters: ~4M.

class ByteLevelEncoder(PreTrainedModel):
    def __init__(self, config: ByteEncoderConfig):
        super().__init__(config)
        self.embedding = nn.Embedding(
            config.vocab_size,   # 256 — raw UTF-8 bytes
            config.hidden_dim,
            padding_idx=config.pad_token_id,
        )
        self.pos_embedding = nn.Embedding(config.max_len, config.hidden_dim)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=config.hidden_dim,
            nhead=config.n_heads,
            dim_feedforward=config.ffn_dim,
            dropout=config.dropout,
            batch_first=True,
            norm_first=True,   # pre-norm: more stable when training from scratch
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=config.n_layers,
            enable_nested_tensor=False,
        )

    def forward(self, input_ids, attention_mask):
        B, L = input_ids.shape
        positions = torch.arange(L, device=input_ids.device).unsqueeze(0)
        x = self.embedding(input_ids) + self.pos_embedding(positions)
        padding_mask = ~attention_mask  # TransformerEncoder uses True = ignore
        x = self.transformer(x, src_key_padding_mask=padding_mask)
        # mean pool over real tokens only
        mask_f = attention_mask.unsqueeze(-1).float()
        pooled = (x * mask_f).sum(dim=1) / mask_f.sum(dim=1).clamp(min=1)
        return F.normalize(pooled, p=2, dim=-1)  # unit vectors

Why pre-norm (norm_first=True)? When training a transformer from scratch (no pretrained initialization), pre-norm stabilizes gradient flow in early training. Post-norm tends to diverge unless you are careful with learning rate warmup and initialization. For a fine-tuning scenario, you probably don’t need to think about this, but here it mattered.

The output is a unit vector in 256 dimensions. Cosine similarity = inner product on unit vectors, so retrieval is just a dot product.

Training: InfoNCE and Hard Negative Mining

The InfoNCE loss

The loss is standard: an (anchor, positive) pair should have a high inner product; the anchor’s inner product with every other positive in the batch (the in-batch negatives) should be low.

def infonce_loss(anchor, positive, temperature=0.07):
    # anchor, positive: (B, D), L2-normalized
    logits = (anchor @ positive.T) / temperature  # (B, B)
    labels = torch.arange(len(anchor), device=anchor.device)  # diagonal = correct
    return F.cross_entropy(logits, labels)

With batch size 256 and temperature 0.07, this is 255 negatives per anchor per step. The temperature controls how peaked the distribution is: too high and the loss ignores hard negatives, too low and training becomes unstable.

Why in-batch negatives aren’t enough

In-batch negatives are cheap but shallow: they’re random names from the dataset, which tend to be easy to separate. A model that has been training for a few hundred steps can distinguish “Catherine” from “Zhao Wei” effortlessly. What it struggles with is “Katarina” vs “Katherine” — names that are phonetically close but refer to different people. Those are the cases where the gradient signal is actually informative.

This is the motivation for ANCE (Approximate Nearest Neighbour Contrastive Estimation): periodically rebuild a FAISS index from the current model’s embeddings, then for each anchor, find the current nearest non-matching neighbors and use those as negatives. They are hard precisely because the model currently thinks they are similar.

The hard negative schedule

class ANCEBatchSampler(Sampler):
    def _current_mix_ratio(self) -> float:
        if self._step < self.warmup or self.index is None:
            return 0.0
        steps_past_warmup = self._step - self.warmup
        # ramp from 0 → target_mix_ratio over mix_ramp_steps
        return min(
            self.target_mix_ratio,
            self.target_mix_ratio * steps_past_warmup / max(1, self.mix_ramp_steps)
        )

During the first 200 steps: random batches only. The model has no meaningful structure yet; a FAISS index over random embeddings would produce useless hard negatives.

After step 200: the FAISS index is rebuilt periodically from fresh embeddings (every refresh_every steps). Each batch is constructed by taking a seed anchor, finding its nearest neighbors in the current index, filling n_hard = batch_size * mix_ratio slots with those neighbors, and padding the rest with random samples. The mix ratio ramps linearly from 0 to 0.7 over 500 steps after warmup, so the transition is gradual.

The training loop:

for batch in train_loader:
    anchor   = model(batch["anchor"].to(device), batch["anchor_mask"].to(device))
    positive = model(batch["positive"].to(device), batch["positive_mask"].to(device))
    loss = loss_fn(anchor, positive)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    scheduler.step()

    if global_step % refresh_every == 0:
        embs, ids = encode_all(model, train_ds, train_batch_size, device)
        train_sampler.update_index(embs, ids)

Evaluation

The retrieval setup is a standard dense IR evaluation. The corpus is all 11,974 test-split anchor names, each encoded to a unit vector and stored in a FAISS FlatIP index. Each positive variant in the test set is issued as a query; retrieval succeeds if the correct anchor appears in the top-k results.

We report MRR, R@1, R@5, R@10, and NDCG@10, broken down three ways: overall, by query type, and by script.

Overall results:

Overall performance comparison across retriever systems

The classical baselines (Levenshtein, Double Metaphone, BM25) cluster at MRR ~0.09. This looks terrible, but it’s an artifact of what’s being measured: 70% of the evaluation queries are cross-script (script or combined type), on which these methods score near zero because they share no characters with Latin-indexed names. On Latin-only queries, Levenshtein achieves 0.894 MRR — a perfectly respectable number for a classical baseline.

Why overall MRR misleads

The combined type is both the hardest and the most common (70% of queries): the query is a phonetic variant of the anchor that was then transliterated into a non-Latin script (“Katharine” → “كاثرين”, English anchor “Catherine”). Breaking down by query type reveals where each method actually fails.

Performance comparison of all testing scenarios (Image by author)

Table showing comparison of performance — Comparison of performance against the best traditional methods

The model needs to handle phonetic variation and script change simultaneously. Transliterate, which applies a fixed canonical romanization, drops to 0.485 here because a fixed mapping cannot account for phonetic variants in the query.

The byte encoder maintains strong performance across all three types (0.937 / 0.827 / 0.738). The contrastive training signal, which sees all three pair types, successfully aligns phonetically equivalent byte sequences regardless of script.

The script gap

The script gap is the R@10 difference between Latin and non-Latin queries. Classical baselines have gaps of 0.88 to 0.94: they retrieve well within Latin script but fail entirely across script boundaries. The byte encoder reduces this to 0.096.

Importantly, the model also improves Latin R@10 from 0.944 to 0.983. The contrastive objective generalizes within-script as well as across scripts.

The remaining gap (0.096) is almost entirely explained by two scripts:

Scripts with consistent romanization conventions (Arabic, Russian, Hebrew, Hindi, Greek) reach above 0.95. Chinese (0.666) and Korean (0.728) are the outliers. Both have severe romanization ambiguity: “张” maps to Zhang, Chang, and Cheung; “박” maps to Park, Pak, and Bak. The LLM-generated training data contains all of these as positives for the same entity, which produces conflicting gradient signal. The model cannot fully resolve which embedding region a name belongs to when its romanization is genuinely ambiguous.

Notice also that BM25 performs slightly better on Chinese and Korean than other baselines. This is not because BM25 understands phonetics. When the query is already in the target script (Chinese querying a Chinese-indexed corpus), identical CJK characters may appear in both query and document, producing incidental character n-gram overlap. This effect disappears for true cross-script retrieval (Latin query, CJK corpus) and should not be mistaken for phonetic matching.

FAISS index ablation

Performance comparison across Indexing techniques

HNSW matches exact search recall (0.896 vs 0.897 R@10) at 5.7x lower latency. For deployment, HNSW is the choice: the small recall penalty is negligible and the latency improvement compounds at scale. IVF-PQ cuts index size by 96% at a 6.4% R@10 penalty — worth considering if you’re indexing millions of entities and memory is constrained.

At 11,974 entities the difference between 0.03 ms and 0.17 ms is academic. At 50 million entities in a real deployment, HNSW’s recall advantage over IVF-Flat becomes more pronounced as the number of index partitions grows.

What doesn’t work (and why)

The model fails to fully close the gap on Chinese and Korean, and the reason is worth dwelling on. The pipeline generates non-Latin variants exclusively by transliterating from Latin: “Catherine” → Latin variant → Arabic/Chinese/etc. It never generates native-script spelling variation. Alternative Arabic orthographies, Korean spacing conventions, or variant Chinese character forms that refer to the same name do not appear in training data. The model learns to map Latin byte sequences to non-Latin byte sequences, but it hasn’t seen non-Latin spelling variation within a single script.

This is a known limitation. The fix would be a fifth pipeline stage: given a generated Chinese or Arabic name, ask the LLM to produce native-script phonetic variants of it. We didn’t do this, so the model is likely underperforming on queries that represent real-world native-script variation.

A second limitation: 99.5% of positive pairs are LLM-generated. The evaluation uses the same LLM-generated pairs. If the LLM systematically mistransliterates a class of names, both training and evaluation signal would be wrong in the same direction, and we would not catch it. The 0.5% Wikidata ground truth provides a sanity check but not a complete one.

Key takeaways

Byte-level tokenization is an underused tool for multilingual tasks. It eliminates out-of-vocabulary tokens by construction, requires no language-specific tokenizer, and gives you a universal 256-symbol vocabulary that covers every Unicode character. For tasks where surface form matters more than semantics — like name matching — it is a natural fit.

LLMs are a viable data engine for low-resource retrieval tasks. We generated 4.67 million positive pairs across 8 scripts using two open-weight models. The pipeline is 4 stages, each independently resumable. This approach is generalizable to other low-resource entity matching problems where ground-truth labels are scarce but a capable LLM can synthesize realistic variation.

ANCE hard negative mining matters. The transition from random negatives to ANN-mined hard negatives noticeably sharpens the embedding space. Without it, the model would learn to separate easy cases (different names in the same script) but struggle on the hard ones (phonetically similar names across scripts).

Report results by query type and script, not just overall MRR. An overall MRR of 0.775 masks huge variation: 0.937 on phonetic queries, 0.738 on combined. A system that looks mediocre on headline metrics may be near-perfect for one use case and broken for another.

The code, dataset pipeline, trained checkpoint, and evaluation scripts are at github.com/vedant-jumle/cross-language-phonetic-text-alignment.

Note about Wikidata: Wikidata is released under CC0 1.0 Universal (public domain) — no restrictions on use, including commercial.

Source link

nimda 3 weeks ago

0 3 10 minutes read