LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models
This repository contains LLaSO-Base-3.8B-Instruct, a 3.8B-parameter reference model from the LLaSO framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speechβlanguage modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).
LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.
- Paper: LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
- Code & Project Page: https://github.com/EIT-NLP/LLaSO
π What is LLaSO?
LLaSO is the first fully open, end-to-end stack for large-scale speechβlanguage modeling, unifying data, evaluation, and modeling in one framework.
The framework provides three essential resources:
- LLaSO-Align (12.0M): An ASR-based alignment corpus for grounding speech in textual semantic space.
- LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs): A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
- LLaSO-Eval (15,044): A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
- LLaSO-Base (3.8B): This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.
LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
β¨ Key Features
- Fully Open, End-to-End Stack: Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
- 25.5M Samples, 20 Tasks, 3 Modality Configurations: Supports all major text β audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
- Stratified Evaluation (15,044): Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
- Robust Reference Model (3.8B): Two-stage training (ASR alignment β instruction tuning), easily reproducible and extensible for further research.
- Empirical Insights: Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.
Architecture & Two-Stage Training
π Usage
For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the LLaSO GitHub repository.
π How to Cite
If you use LLaSO in your research or applications, please cite our paper:
@misc{sun2025llaso,
title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model},
author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
year={2025},
eprint={2508.15418},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.15418},
}
- Downloads last month
- 8