Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages with 2.8-Second Latency

0 0 7 minutes read

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages with 2.8-Second Latency

Simultaneous translation is one of the hardest problems in applied AI. You ask the model to translate the speech before the speaker finishes the sentence. Every additional second of delay breaks the illusion of real-time communication. Alibaba's Qwen team has been excelling at this with each release. Their latest model, Qwen3.5-LiveTranslate-Flashit brings that latency reduced to 2.8 seconds and expand input language coverage to 60 languages.

Reasonable Overrides of Prior Releases

Qwen3-LiveTranslate-Flash handled 18 input languages with about three seconds of latency. Qwen3.5-LiveTranslate-Flash take that down 2.8 secondsextends input coverage to 60 languages, and adds speech output in 29 languages. That's more than a 3× expansion of language coverage on the input side. For devs building multilingual products, this reduces the need to change the model for each language in multiple global business scenarios.

The improvement in latency comes from processing what the team calls 'learning units.' Rather than waiting for a full sentence to arrive before generating output, the model determines when enough meaning has accumulated in a segment to commit to translation. It broadcasts the output continuously while the speaker is talking. This is the same basic idea as semantic unit prediction but with a strict implementation that reduces those extra 200 milliseconds.

Vision is Now a First Class Input

Most translation systems treat audio as an input signal only. That works well in clean studio situations. It crashes in a crowded conference room, a noisy commercial space, or anywhere with loud voices and bad acoustics.

Qwen3.5-LiveTranslate-Flash takes a different approach. It analyzes visual information in conjunction with on-screen text, audio, physical displays, lip movements, and gestures. If the word is phonetically ambiguous or the sound stream degrades the image, the visual context fills the gap and sharpens the interpretation decision. This is not a small feature. In real-world use, audio quality is rarely guaranteed. Having a vision channel means the model handles the messy reality of live interpretation more kindly than audio-only programs.

Voice Cloning Happens in Real Time

This is the most prominent feature in the Qwen3.5 release. Conventional translation systems replace the speaker's voice with a standard synthesis voice. Qwen3.5-LiveTranslate-Flash instead matches the characteristic voice characteristics of a real speaker during the translation itself. One spoken sentence is enough for the model to perform this acoustic adaptation.

To an aided listener, the translated output sounds like the same person speaking the target language and not the thing being changed. For live conference interpreting, multilingual live streaming, or international customer calls, this is essential. The experience feels significantly more human than what current systems deliver.

Optimize Domain-Specific Keywords

One persistent failure mode of translation models in professional settings is proper nouns and specialized vocabulary. A model that interprets medical information may misinterpret the name of a drug. A legal interpretation session is divided into technical law terms.

Qwen3.5-LiveTranslate-Flash addresses this with dynamic keyword configuration at runtime. Developers can enter a list of brand names, medical terms, legal terms, or technical vocabulary, and the model handles those terms very faithfully. This is not available in many general-purpose translation APIs and fills a real gap for domain-specific business deployments.

Benchmark Performance

In FLEURS and CoVoST2 — two established benchmarks for multilingual speech translation — Qwen3.5-LiveTranslate-Flash outperforms other major vendors. FLEURS evaluates the translation quality of a wide variety of language pairs under real acoustic conditions. CoVoST2 includes 21 translation directions from speech, making it an effective proxy for multilingual pipeline performance.

Marktechpost Visual Explainer

What it does

Qwen3.5-LiveTranslate-Flash at a glance

Qwen3.5-LiveTranslate-Flash is an API-only, closed-loop real-time translation model from Alibaba's Qwen team. It takes audio and video frames as simultaneous input and results in translated text and speech. The model uses a WebSocket-based protocol on top of Alibaba Cloud Model Studio.

The delay

2.8s

According to the audio output token

Input languages

Speech + visual input

Speech output

Voiced languages

The protocol

WebSocket

Continuous communication

✓
Improved perception of perception — lip movements, gestures, and on-screen text all feed into the rendering decision next to the sound
◆
Real-time voice cloning — includes the voice profile of the original speaker in the translated output from a single spoken sentence
◆
Semantic unit prediction — commits to extracting parts before the end of a full sentence, allowing continuous streaming without waiting for complete pronunciation
◆
Dynamic keyword configuration – enter a domain-specific glossary at runtime to find technical, medical, or legal terms

Before you start

What is required

You need an Alibaba Cloud account with Model Studio access and a valid DashScope API key. The model is available with qwen3-livetranslate-flash-realtime model ID.

Create an Alibaba Cloud account

Subscribe to alibabacloud.com and activate Alibaba Cloud Model Studio in your account dashboard.

Get your DashScope API key

Navigate to Model Studio → API Keys. Generate the key and store it as an environment variable DASHSCOPE_API_KEY. Never hardcode it in source files.

Install Python dependencies

Enter the websocket-client package for WebSocket communication. To capture sound, enter again pyaudio.

Check your audio settings

The model accepts 16kHz, 16-bit PCM mono audio input. Make sure your microphone or audio source can output this format before connecting.

BASH

# Install dependencies
pip install websocket-client pyaudio

# Set your API key as an environment variable
export DASHSCOPE_API_KEY="your_key_here"

Step 3 – Communication

Establish a WebSocket connection

The model uses the WebSocket protocol for persistent, bidirectional communication. You authenticate with the Bearer token in the connection header using your DashScope API key.

PYTHON

import json, websocket, os

API_KEY = os.getenv("DASHSCOPE_API_KEY")
API_URL = (
    "wss://dashscope-intl.aliyuncs.com"
    "/api-ws/v1/realtime"
    "?model=qwen3-livetranslate-flash-realtime"
)

def on_open(ws):
    print("Connected to Qwen3.5-LiveTranslate-Flash")

def on_message(ws, message):
    data = json.loads(message)
    print("Translation event:", data)

def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=["Authorization: Bearer " + API_KEY],
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)
ws.run_forever()

ⓘ

The connection remains open full time. You don't reconnect with each voice. Send audio bits and image frames continuously on the same socket.

Step 4 – Streaming audio

Configure and stream audio input

After connecting, send the session configuration event to set the source and target languages. Then stream the PCM audio bits continuously. The model is used session.input_audio_transcription.language to specify the input language.

PYTHON

import base64, pyaudio

# Audio input config: 16kHz, 16-bit PCM mono
INPUT_RATE    = 16000
INPUT_CHUNK   = 1600  # 100ms per chunk
INPUT_FORMAT  = pyaudio.paInt16
INPUT_CHANNELS = 1

def on_open(ws):
    # 1. Send session config first
    session_cfg = {
        "type": "session.update",
        "session": {
            "input_audio_transcription": {
                "language": "zh"  # source: Chinese
            },
            "translation": {
                "target_language": "en"  # target: English
            }
        }
    }
    ws.send(json.dumps(session_cfg))

    # 2. Stream microphone audio
    pa = pyaudio.PyAudio()
    stream = pa.open(
        rate=INPUT_RATE, channels=INPUT_CHANNELS,
        format=INPUT_FORMAT, input=True,
        frames_per_buffer=INPUT_CHUNK
    )
    while True:
        chunk = stream.read(INPUT_CHUNK)
        audio_b64 = base64.b64encode(chunk).decode()
        ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": audio_b64
        }))

⚠

Do not send audio before session.update the event is approved. Wait for server session authentication event before streaming audio clips.

Step 5 — Idea input

Export video frames for improved understanding

Qwen3.5-LiveTranslate-Flash reads lip movements, gestures, and on-screen text from video frames accompanied by audio. Send base64 encoded JPEG frames at regular intervals during the session. Even a low frame rate greatly improves accuracy in noisy audio situations.

PYTHON

import cv2, base64, threading, time

def stream_video_frames(ws):
    cap = cv2.VideoCapture(0)  # 0 = default camera
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        # Encode frame as JPEG → base64
        _, buf = cv2.imencode(".jpg", frame)
        img_b64 = base64.b64encode(buf).decode()
        ws.send(json.dumps({
            "type": "input_image_buffer.append",
            "image": img_b64
        }))
        time.sleep(0.5)  # ~2fps is sufficient

# Run video streaming in a separate thread
threading.Thread(
    target=stream_video_frames,
    args=(ws,), daemon=True
).start()

ⓘ

Vision input is optional but recommended for live human speech situations. For pre-recorded sound files without a camera feed, you can leave the picture frames completely and rely on the sound alone.

Step 6 – Domain accuracy

Dynamic keyword configuration

For technical, medical, legal, or product-specific information, you can inject a list of keywords at the beginning of the session. The model uses this range to significantly improve the reliability of the interpretation of terms that conventional training data may handle inconsistently.

PYTHON

# Add to your session.update payload
session_cfg = {
    "type": "session.update",
    "session": {
        "input_audio_transcription": {
            "language": "zh"
        },
        "translation": {
            "target_language": "en"
        },
        # Inject domain keywords here
        "keywords": [
            {"source": "达芬奇机器人",  "target": "da Vinci Surgical System"},
            {"source": "腹腔镜",      "target": "laparoscope"},
            {"source": "实体瘤",      "target": "solid tumor"}
        ]
    }
}
ws.send(json.dumps(session_cfg))

✓It works with brand names, drug names, legal regulations, and technical model numbers
✓Keywords are transmitted per session and do not persist across connections
◆Keep the list focused – only words where misinterpretation can cause real mistakes

Reference

Supported languages

Qwen3.5-LiveTranslate-Flash understands 60 input languages and can produce speech output in 29 languages. The highlighted tablets below are certified speech output languages. All tablets represent supported inputs.

Chinese

English

French

German

Spanish

Japanese

Korean

Russian

Portuguese

Italian

Arabic

Hindi

Turkish

Indonesian

Thai

Vietnamese

Greek

Mandarin

Cantonese

Wu dialect

Sichuanese

Tianjin dialect

Beijing dialect

+ 37 more

ⓘ

Highlighted tablets ensure support for speech output (audio). Blank tablets are for insertion only or are not guaranteed to output the sound. Confirm your target language pair in the Alibaba Cloud Model Studio documentation before building audio output pipelines.

⚠

The model supports text output in all 60 input languages. Speech output is available in only 29 languages. If your pipeline requires audio delivery and your target language is not on the confirmed list, set the TTS step back.

Key Takeaways

Qwen3.5-LiveTranslate-Flash delivers multimodal real-time translation across 60 input languages and 29 output speech languages at 2.8 seconds of latency.
The model uses advanced recognition – reading lip movements, gestures, and on-screen text – to maintain accuracy in noisy or distorted audio environments.
Real-time voice processing reproduces the voice profile of the original speaker in the translated output using just one spoken sentence.
Semantic unit prediction by processing “learning units” enables continuous output without waiting for full sentences, reducing latency to 2.8 seconds.
Dynamic keyword optimization allows developers to inject domain-specific keywords at runtime, improving translation fidelity for technical, medical, and legal terms.

Check it out Technical details. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 2 hours ago

0 0 7 minutes read