Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko,
Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg
NVIDIA, USA
kevinhu@nvidia.com

Abstract

Spoken dialogue is the most intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex S2S architecture features continuous user input and codec agent output with channel fusion that can directly model simultaneous user and agent streams. The use of pretrained encoder as input enables the first duplex S2S model without speech pretrain requirement. Separate architectures for agent and user modeling also facilitates Codec finetune toward better agent voices with a low bitrate Codec that halves the previous S2S works (0.6 kbps). Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in capabilities. The model requires much less speech data due to no speech pretrain constraint and significantly simplifies the process of building a duplex S2S model from any LLM backbones. Finally, it is the first open duplex S2S model of both training and inference code with full reproducibility


Listening Examples

Below are some speech-to-speech conversion examples from our model.

Robustness with frequent interruption

Example-1: Travel place suggestion. User interrupts the agent three times in 15 seconds, and leaves limited time for the agent to respond. Note that user did not continue with “world trade center” but “Aspen” because user input is pre-recorded and fixed.


Example-2: Dinner place suggestion. User interrupts the agent two times in 15 seconds.


Example-3: Multi-turn Chat on unseen topic. User selects random topic and interrupts agent for many times.


Example-4: Impatient User. User interrupts agent within 2 second.


Reasoning Problem

Example-5: Q&A with summarization (unseen).Our SFT data does NOT include any explicit QA followed by summary format.


Example-6: Role-play with interruption.