Engineering

April 19, 2026 · 9 min read

What your audience actually waits for: Automating ear-voice span

A semantic latency metric for live speech translation that measures when meaning arrives, not when audio starts. We automate the 50-year-old gold standard of interpretation research using an LLM as a bilingual phrase aligner.

Yahya Saleh

A live translation system starts speaking 2 seconds after the source speaker does. Fast, right? Maybe, but that number is about one moment. By minute four of a talk, the system may have fallen several seconds behind, or caught back up, or swung between the two. The audience experiences the lag continuously, across every phrase. A single start-time number doesn’t capture any of it.

The metric that captures this lag is Ear-Voice Span (EVS), a 50-year-old gold standard from simultaneous interpretation research (Oléron and Nanpon, 1965). This article explains EVS and how to automate it using an LLM as a bilingual phrase aligner.

What EVS measures

EVS measures what listeners experience: the time between when a source-language phrase is spoken and when its corresponding target-language translation begins to be spoken.

A translation session produces a distribution of per-phrase latencies. We report the median (the typical listener experience) along with the P90 (the slowest 10% of phrases). The gap between them is informative on its own: a small gap means latency is smooth and predictable; a large gap means listeners regularly experience moments much worse than typical.

Reporting both follows Iranzo-Sánchez et al. (2025), who showed that latency distributions for simultaneous speech translation (SimulST) systems are heavily right-skewed: a single central summary like the mean or median can hide unacceptably long delays in the right tail.

Why existing latency metrics fall short

Most latency metrics for simultaneous translation were designed for text-in, text-out systems and operate on discrete read/write operations rather than real-time audio. Here is how they compare:

Metric	What it measures	Unit	Key limitation for speech evaluation
RTF (Real-Time Factor)	wall_clock / audio_duration	Ratio	Measures system speed, not semantic delay. A system with RTF 1.0 processes in real time but says nothing about when meaning arrives.
AP (Average Proportion, Cho and Esipova, 2016)	Mean fraction of source consumed per target token	[0, 1]	Abstract: values depend on sequence lengths even for the same policy. Poor sensitivity.
AL (Average Lagging, Ma et al., 2019)	How much the MT system lags behind an ideal simultaneous translator	Words	Operates on tokens, not seconds. Non-differentiable. Rewards “free writes” after all source is consumed. Designed for training simultaneous text MT, not evaluating speech.
DAL (Differentiable AL, Cherry and Foster, 2019)	Differentiable variant of AL	Words	Smooth, differentiable rewrite of AL that can be used as a training loss. Same conceptual limit: counts tokens, not seconds.
LAAL (Length-Adaptive AL, Papi et al., 2022)	Variant addressing AL’s over-generation reward	Words	Better binary comparisons, but still text-level.
ATD (Average Token Delay, Kano et al., 2023)	Duration-aware delay per partial translation	Seconds	Closest automated metric to EVS, validated by showing highest correlation with EVS among baselines. But operates on streaming chunks, not semantic phrase pairs.

For teams building live speech translation, what matters is when each piece of meaning reaches the listener after it was spoken. EVS measures exactly that, in three concrete ways:

Seconds, not tokens. The metric is directly interpretable as listener experience.
Phrases, not words. It captures when a meaning arrives, not when a single word does.
Semantic alignment, not monotonic. It handles reordering languages (German→English, English→Arabic) that break token-based metrics.

A recent meta-evaluation of SimulST latency metrics (Polak et al., 2025) found that existing metrics produce inconsistent rankings in short-form settings and recommended long-form evaluation, exactly the setting EVS is built for.

From manual to machine

Guo and Han (2024) validated that automated EVS measurement can match human annotation, achieving median EVS error under 0.1 seconds on a 20-hour English-to-Portuguese corpus. However, their pipeline relied on traditional cross-lingual word aligners, which are language-pair-specific, brittle on noisy ASR output, and unable to handle free reordering (e.g., verb-final word order in German subordinate clauses).

LLM as bilingual aligner

We replace the traditional NLP alignment step with an LLM call. We found that large language models work well as bilingual aligners: they understand semantic equivalence across languages, handle word reordering naturally, and degrade gracefully on noisy input.

The pipeline

Pipeline: source and target audio force-aligned to word timestamps, then paired into phrases by an LLM, producing per-pair EVS values aggregated to a median.

Step 1: Force-align both audio tracks. We use Qwen3-ForcedAligner-0.6B for supported languages (DE, EN, ES, FR, IT, JA, KO, PT, RU, YUE, ZH) with WhisperX as a fallback for other languages. This produces per-word start/end timestamps for both the source and target audio, which we use to derive the latency between phrase pairs.

Step 2: Build the LLM prompt. We number each word in the source and target:

Source (Spanish): "Soy científica y les voy a contar la historia de un descubrimiento"
Source entries: [0] Soy [1] científica [2] y [3] les [4] voy [5] a
               [6] contar [7] la [8] historia [9] de [10] un [11] descubrimiento

Target (English): "I am a scientist and I am going to tell you the story of a discovery"
Target entries: [0] I [1] am [2] a [3] scientist [4] and [5] I [6] am
               [7] going [8] to [9] tell [10] you [11] the [12] story
               [13] of [14] a [15] discovery

Step 3: LLM returns phrase pairs. The system prompt instructs the model to output a JSON array of phrase pairs with word indices:

[
  {"source_phrase": "Soy científica",
   "target_phrase": "I am a scientist",
   "source_word_indices": [0, 1],
   "target_word_indices": [0, 1, 2, 3]},
  {"source_phrase": "y les voy a contar",
   "target_phrase": "and I am going to tell you",
   "source_word_indices": [2, 3, 4, 5, 6],
   "target_word_indices": [4, 5, 6, 7, 8, 9, 10]}
]

The system prompt enforces several rules:

Source indices must be monotonic (no overlapping ranges).
Target indices may reorder. Languages like Japanese move verbs to the end; Arabic reorders extensively.
Omissions are allowed: if the translator dropped content, those indices stay unpaired.
No index may appear in more than one pair.
Prefer tight pairs (few words) over long spans.

The full system prompt, with an English→Japanese reordering example, lives in the reference implementation.

Step 4: Compute EVS. For each phrase pair i, let $t_{\text{src},i}$ and $t_{\text{tgt},i}$ be the source and target phrase start times (from Step 1):

\text{EVS}_i = t_{\text{tgt},i} - t_{\text{src},i}

Report the median and P90 of these per-pair values across the session.

Why LLM alignment is better

Compared to traditional word-alignment models:

Handles reordering natively. An LLM understands that “up” in “I picked up the dog” aligns with “auf” at the end of “Ich hob den Hund auf”, even though the German particle sits at a different sentence position. Token-level aligners struggle with this.
Covers many pairs with one prompt. The same prompt works for EN→DE, AR→EN, with no per-language alignment model needed.
Robust to ASR noise. ASR transcripts contain disfluencies, repetitions, and errors. An LLM can still identify semantic correspondence through noise that would confuse token-level matchers.
Produces human-readable pairs. The phrase pairings are interpretable: you can inspect exactly which source phrase maps to which target phrase and verify the alignment makes sense.

Walkthrough: example output

A live Spanish-to-English translation session by VoiceFrom Pro, scored with Gemini 3.1 Pro. The LLM returned 135 phrase pairs for this session. The first five:

$i$	Source phrase	Source start, $t_{\text{src},i}$	Target phrase	Target start, $t_{\text{tgt},i}$	$\text{EVS}_i$
1	”Soy científica”	0.16s	”I am a scientist”	6.48s	6.32s
2	”y les voy a contar”	1.52s	”and I am going to tell you”	9.76s	8.24s
3	”la historia de un descubrimiento”	2.64s	”the story of a discovery”	10.88s	8.24s
4	”Este descubrimiento comenzó”	5.76s	”This discovery began”	12.80s	7.04s
5	”con muchas preguntas”	7.36s	”with many questions”	13.76s	6.40s

Across all 135 pairs, the median speech-EVS is 7.8 seconds: that is the typical delay between a Spanish phrase being spoken and its English translation beginning to play. The P90 is 9.7 seconds, so the slowest 10% of phrases lag roughly two seconds beyond the typical case.

The opening phrase comes in at about 6 seconds: the system needs to accumulate some context before producing a meaningful translation. Most of the session then stabilizes in a 7–9 second band, and the slowest 10% extend past 9.7 seconds.

Two signals: speech-EVS vs. caption-EVS

VoiceFrom also displays on-screen captions, which gives us a parallel metric: caption-EVS, the delay until the translated text appears. Captions arrive faster than speech because TTS adds buffering on top of the translation pipeline.

For this session, the median caption-EVS is 3.8 seconds and the P90 is 4.5 seconds. The caption channel is much more consistent than speech: its P90 sits only 0.7 seconds above the median, compared to 1.9 seconds for speech-EVS.

Decomposing EVS this way pinpoints where the latency lives. The translation pipeline alone is fast and consistent — caption-EVS sits at 3.8s with only 0.7s between median and P90. Voice synthesis roughly doubles the delay and accounts for almost all of the tail variance: speech-EVS spreads 1.9s between median and P90, while caption-EVS spreads only 0.7s.

Reference implementation

The EVS pipeline described above (forced alignment, LLM phrase pairing, per-pair latency computation) is available at VoiceFrom/live-s2st-eval.

References

C. Cherry and G. Foster. 2019. Thinking slow about latency evaluation for simultaneous machine translation. arXiv:1906.00048. https://arxiv.org/abs/1906.00048
K. Cho and M. Esipova. 2016. Can neural machine translation do simultaneous translation? arXiv:1606.02012. https://arxiv.org/abs/1606.02012
M. Guo and L. Han. 2024. From manual to machine: Evaluating automated ear-voice span measurement in simultaneous interpreting. Interpreting 26, 1. https://benjamins.com/catalog/intp.00100.guo
J. Iranzo-Sánchez et al. 2025. Going beyond your expectations in latency metrics for simultaneous speech translation. In Findings of ACL, 18205–18228. https://aclanthology.org/2025.findings-acl.937/
T. Kano et al. 2023. Average token delay: A duration-aware latency metric for simultaneous translation. arXiv:2311.14353. https://arxiv.org/abs/2311.14353
M. Ma et al. 2019. STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proc. ACL.
P. Oléron and H. Nanpon. 1965. Recherches sur la traduction simultanée. Journal de Psychologie Normale et Pathologique 62, 1, 73–94.
S. Papi et al. 2022. Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation. In Proc. AutoSimTrans. https://aclanthology.org/2022.autosimtrans-1.2/
P. Polak et al. 2025. Better late than never: Meta-evaluation of latency metrics for simultaneous speech-to-text translation. arXiv:2509.17349. https://arxiv.org/abs/2509.17349

Model note: This walkthrough uses Gemini 3.1 Pro for alignment scoring where noted.

Yahya Saleh

Applied ML Engineer

Yahya is an applied ML engineer at VoiceFrom. He builds the production-grade live speech-to-speech translation pipeline, turning recent research into systems that actually ship.