Full duplex vs half duplex - the spectrum of AI voice models [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
It seems that there are two ways to build voice AI:
Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today.
Full-duplex: two channels, both sides can talk at any time - no more waiting for your “turn”. ← This is the way humans actually talk.
In fact, there are three crucial things half-duplex voice models can't really do:
- Overlap - talking and listening at the same time without falling apart
- Backchannels - the "mhms," "rights," and "yeahs" you drop in while the other person is still going
- Barge-in - getting interrupted mid-sentence and recovering gracefully
These three features are a big reason why voice agents still feel “robotic” to this day.
But what exactly is the spectrum from half-duplex to full-duplex? Is a Moshi-style architecture the only way to approach full-duplex natural voice conversations? What are ways half-duplex systems could imitate full-duplex?
Would love to hear others' thoughts on this.
[link] [comments]
More from r/MachineLearning
-
How papers are selected for Best Paper, Oral, or Highlight presentation at major ML/CV conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR? [D]
Jul 2
-
BMVC 2026 Review Discussion Thread [D]
Jul 2
-
Has anyone tried this approach with Fast Byte Latent Transformers ? [R]
Jul 2
-
Books/Resources to improve mathematical foundations for ML research [D]
Jul 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.