Good-Pal-Bench

MIRRORING

A Multilingual Benchmark for Emotional Accommodation in Full-Duplex Speech Models

Despite strong progress on turn-taking and voice naturalness, speech-to-speech models are never evaluated on whether they adapt their emotional tone to match a conversational partner. MIRRORING measures this: given an emotionally expressive human turn, does the model respond with the same emotion a human interlocutor actually produced?

Last updated: May 13, 2026

Overall Rankings

1,005 samples across 12 emotions and 5 languages. Shuffled baselines computed by randomly permuting reference assignments (1,000 iterations). Click headers to re-sort.

# Model Match Rate Shuffled Lift (pp) Cosine Sim Target Score Ref Score N

Match Rate = fraction where model's top emotion matches reference. Cosine Sim = alignment of full 12-emotion profile. Target Score = model's intensity on the expected emotion (human reference ~0.95).

Per-Emotion Breakdown

Match rate (%) by emotion, averaged across models. Positive/high-arousal emotions are mirrored well above the 8.3% uniform baseline; negative-valence emotions remain near zero — models produce a "helpful assistant" register that rarely sounds angry, sad, or impatient.

Per-Language Heatmap

Match rate (%) by language. English yields the highest rates, consistent with English-dominant training data. Arabic is the most challenging — Grok scores only 2.9%. Gemini 3.1 closes this gap (14.2% on Arabic), suggesting reasoning capacity disproportionately benefits non-English emotional accommodation.

Methodology

Source data. 567 two-speaker voice call recordings from Yapdo, Liva AI's ~50,000-hour multilingual conversational speech corpus. All speakers are friends/acquaintances having naturally occurring conversations — not acted or synthetic. Recordings span English, Hindi, Arabic, Italian, and Tagalog with separate speaker channels.

Emotion annotation. Per-segment scores generated by Empathic-Insight-Voice-Small, a 47-dimension vocal emotion classifier. 12 emotions selected for the benchmark: Amusement, Anger, Elation, Impatience, Surprise, Emotional Numbness, Contemplation, Disappointment, Confusion, Pride, Affection, Sadness.

Benchmark construction. Adjacent turn pairs identified where both speakers share the same dominant emotion within a 5-second gap. Speaker A's turn extended backward to ~10 seconds of context. From 11,124 candidate pairs, a balanced subset of 1,005 samples created by capping overrepresented emotions at 150.

Evaluation. Each model receives Speaker A's audio and generates a spoken response, scored by the same classifier. A match = model's top emotion equals Speaker B's reference. Three metrics: top emotion match rate, emotion vector cosine similarity, and target emotion intensity score.

Shuffled baseline. Computed by randomly permuting reference assignments (1,000 iterations) to control for distributional bias. The lift over shuffled baseline isolates genuine emotional accommodation from frequency bias.

Validation. Across 37,173 turn pairs in 200 recordings, 18.6% shared the same dominant emotion vs. 15.8% permutation baseline (p < 0.001), confirming genuine conversational emotional co-occurrence.