Generative AI

How to check in Voice agents in 2025: In addition to default speech recognition (ASR) and Word Mitte Rate (Wer) and Task Rice Crececial, Barge-In, and the HalkUination Barge

To do well only to recognize defective speech (ASR) and the Word Rice Rate (WER) is not enough for modern, active audiences. Mighty testing should measure the success of the final work, behavior and ethics of Laten-in Aualy, and the Halkination-and noise-on side of Asr, safety, and the following orders. Vookenbench offers multi-facet Communication – Collaboration Benchmark in that general information, the following delivery, safety, and defeating the oral / variety / variations of content, but is not included in the elimination of Barge-In Barge-In Device Task. Slue (and phase-2) TRUE spoken in the spoken language (SLU); Massive and Stoven Squad invests in many languages ​​and is at the point of Quality; DSTC tracks Add Directors, Referentialed to Activity. Mix this for clear barge-in / endpoint test, user-effectiveness – success, and agreements controlled by Noise-Stress Protocol.

Why is WER enough?

Integrity of text in writing, not the quality of communication. Two agents have the same Wer to deviate as a detectoral success because the latency, repentance, recovery, energy loss, an acoustic abuse of the user's experience. The previous work in real plans show the need for testing User's satisfaction including Success of work Specifically-eg, default cortana test is correct for user satisfaction from In-situ signals, not only asror accuracy.

What should you measure (and how)?

1) The end of the end of the end of the work

Metric: TASR's success level with solid processes of each activity (completed purpose of purpose, issues met), and TRIGHTS DECISIONS (TCT) including Curse – Success.
Why. Real helpers are judged by. Contests such as Alexa Prizebot set for the power of the power useless users to end the action (eg cooking, DIY) on ratings and completion.

Protocol.

  • Describe activities with finalized storage (eg, “Combine the shopping list with N and issues”).
  • Use blind people and default logs in Compute TSR / TCT / TECT.
  • With Mulingual / SLU to cover, draw decorating / slots from massive.

2) wrapping and taking repentance

Metric:

  • Barge-in Latency Latency (MS): Time from the user's entry of TTS stress.
  • Traditions of true / falsehood: Ready Disengage vs.
  • EDPOSING LATENCY (MS): ASR's completion period after the user stand.

Why. Small disorder management and speedy storage to decide to respond. Official research on verification of verification and continuous Barge-In conduct. Keeping the latency continues as an active place in distributing ARR.

Protocol.

  • The script promotes when the user interrupts TTS in controlled offers and SNRS.
  • Measure stress and time recognition for upper topic (s)).
  • Enter sound / eCHOIC conditions away. Old and currently provides recovery and strategies to minimize the barge-ins.

3) HALLUCINATION-under-noise (HUN)

Metric. The H H level: The fraction of properly accurate but soundly matching sounds, under the controlled or non-speech sound.
Why. Asr and Audio-LLM stacks can remove “persuasive,” mainly in non-speech categories or noise. The latest work describes and measures Asr Halkucinations; The intended courses reflect disasters caused by non-speech noises.

Protocol.

  • Create audio sets with the sound of the environment (different SNRs), non-speech distractions, and content disorder.
  • Related points of the Mantic (judgment of judgment) and the Compute Hun.
  • Track that low agent actions spread Hallucinations on non-work stairs.

4) The following education, safety, and strong

Metric families.

  • The following accuracy (Format and pressure).
  • Level of Composing Safety to the spoken Afferal.
  • Rovastic Deltas In the Applicant's Year / Pitch, Nature, Reverb, Far-Far-Far-Field), and the content noise (language errors, disulfies' errors).

Why. Clear voolbench aims to axes with spoken instructions (real and activities) taking general information, the following education, and safety; Perturbs Madam Speaker, Nature, and Content Abuse.

Protocol.

  • Use voice widths in diameter to the power of communication. Report combined scores and axis.
  • For SLU data

5) quality of mind speech (TTS and development)

Metric. SUBSIZITY score is points with points with Tu-T P.808 (CcredSed Access / DCR / CCR).
Why. The quality of communication depends on the both Recognition and quality of play. IP.808 provides a guaranteed protocol to cleaning with a source open tool.

Benchmark Vandcape: that covered each

Voicebench (2024)

Scale: Multi-facet Voice Assistant Avissaua For Invalued Installation General Information, Following following, safetybeside diversion across the speaker / nature / variation of content; Uses actual and practical talk.
Limitations: It does Benchmark Barge-in / Endpoissing Latency or the termination of real land services on devices; Focused on the appropriate feedback and security under diversity.

Slue / Slue Phase-2

Scale: Tasks of understanding language: ner, vision, diagram and ZTel, name name, Quality; Designed to study end-up to the end of the Pipeline Sensitivity in ASR errors.
Use: It is good about testing SLU and pipeline in spoken settings.

Massive

Scale: > 1M Visible words help languages ​​in 51-52 languages ​​in the Men Ethics / Slots; He has the right power Many languages the target assessment.
Use: Build a multilingual task suit and measure TSR / Slot F1 under speech conditions (paired TTS or the language).

Scale: The spoken question responds to the assessment of Asr-aspression and multiple powerful stability.
Use: Understanding of testing under the speech of speech; Not agent agent task suite.

DSTC (Heamol Sympecy Technology Challenge) Tracks

Scale: Modeling Robug Modeling with talkwork-in-work data; People's estimates beside default metrics; The latest tracks emphasize the various size, security, and testing.
Use: Compatible with Dialog's quality, DST, and information-based answers under communication situations.

Real-World Task Assistance (Alexa Prize Dirkbot)

Scale: Multi-Step Task Ask Assizi- User ratings and the success process (cooking / DIY).
Use: Golden inspiration of gold to describe TSR and contact KPIS; Social reports describe the focus of assessment and results.

To fill in the gaps: still need to add

  1. Barge-in & EDPOSING KPIS
    Put clear hairleseses. Books provide BARGE – to the strategic plan and process of processing; Spreading Asr Endpoissing Latency has always been a practical article to research. Track Barge-In Retenten Latenten, pressed accuracy, last delays, and false closure.
  2. HALLucination-Under-Noise Protocols (HUN)
    Adord Advert Advel Adri-Challuination definitions and audio / non-speaking tests; Report the reporting and its impact on the action at the bottom of the river.
  3. Interaction of service
    Late Incidence Incyncy receives by ASR Design (eg, transducer variation); Time-to-start-token, the last time, and local processing.
  4. UnitiRic deviation of the Cross-axis deviation
    Add Voicebenc Spanic Spanic
  5. Perceptual quality of recycling
    Use ITU-T P.808 (with Open P.808 Toolkit) To reduce user-recognized TTS quality in your last-to-end loop, not just Asr.

Concrete testing, prominent

  1. Combine Suite
  • Speech – Collaboration Core: Voolbelbel chief, the following education, safety, and Rovursness Axes.
  • The depth of SLU: Slue / Phase 2 functions (NER, DIALLAL NEZTEL, QUAIN, summarizing in order to operate with SLU performance under speech.
  • Multiple integration: Massive of purpose / slot and the pressure of many languages.
  • Understanding under the sound of Asr: Squad / Shysquads for Spoken Que and Reding-Accent Reports.
  1. Put nothing inexactive skills
  • Barge-in / EDPossing Harnes: Scrised Defense at ActoolPed Affesets and SNS; the pressure of the log and false closing; Measure complete delays by distributing ARR.
  • HALLUCINATION-you are under noise: incorruption and sound profits; Appreciate SEMANTIC connection to compute hun.
  • TASK Premiss Block: Scentario duties with the purpose of success; compute tsr, TCT, and turn; Follow the TASKBOT style descriptions.
  • The quality of mind: IP.808 CROWDDSYDCED ACR with Microsoft Toolkit.
  1. Report Modification
  • First Table: TSR / TCT / TUCTS; Barge-in Latency and error standards; To end the latency; H H H. Voolbelch Aggregate and Per-Axis; SLU Metrics; P.808 Mos.
  • Cutters to press: TSR and Han vs. SNR and restoration; Barge-in Latency vs. Time to interrupt time.

Progress

  • Voicbeenct: Expecpet Multi-Face Speech – BencSheckMark Collector of LLM (Information, Safety, Firm). (ar5iv)
  • Slue / Slue Phase-2: The largest ner, Dialoog works, Qa, summarizing; sensitive to Asr errors in pipes. (Arxiv)
  • Massive: 1M + Many languages ​​/ intentions of multilingual languages. (Amazon Science)
  • Stomen-Squad / Heysquad: Speaking question to answer datassets. (GITUB)
  • User tests – Centric for producing corts (Cortana): predicting more satisfaction in Ass. (Mass Amherst)
  • Barge-in and ENDPassing Latency: ASS / Academic Barge-Traffic Management, Microsoft Contrious Barge-in, Recent Findings of SRR. (Arxiv)
  • The definition of Asr Halkucination and non-contributions caused (gossip). (Arxiv)


Michal Sutter is a Master of Science for Science in Data Science from the University of Padova. On the basis of a solid mathematical, machine-study, and data engineering, Excerels in transforming complex information from effective access.

Follow MarkteachPost: We have added like a favorite source to Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button