Papers
arxiv:2506.02083

LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention

Published on Jun 2
Authors:
,
,

Abstract

A novel disentanglement learning strategy using prefix-tuned cross-attention improves speaker recognition accuracy in multi-lingual settings by separating language information from speaker embeddings.

AI-generated summary

Speaker recognition models face challenges in multi-lingual settings due to the entanglement of linguistic information within speaker embeddings. The overlap between vocal traits such as accent, vocal anatomy, and a language's phonetic structure complicates separating linguistic and speaker information. Disentangling these components can significantly improve speaker recognition accuracy. To this end, we propose a novel disentanglement learning strategy that integrates joint learning through prefix-tuned cross-attention. This approach is particularly effective when speakers switch between languages. Experimental results show the model generalizes across monolingual and multi-lingual settings, including unseen languages. Notably, the proposed model improves the equal error rate across multiple datasets, highlighting its ability to separate language information from speaker embeddings and enhance recognition in diverse linguistic conditions.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.02083 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.02083 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.02083 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.