LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models

This repository contains LLaSO-Base-3.8B-Instruct, a 3.8B-parameter reference model from the LLaSO framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).

LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.

Paper: LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
Code & Project Page: https://github.com/EIT-NLP/LLaSO

🔍 What is LLaSO?

LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.

The framework provides three essential resources:

LLaSO-Align (12.0M): An ASR-based alignment corpus for grounding speech in textual semantic space.
LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs): A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
LLaSO-Eval (15,044): A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
LLaSO-Base (3.8B): This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.

LLaSO overall performance

LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.

✨ Key Features

Fully Open, End-to-End Stack: Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
25.5M Samples, 20 Tasks, 3 Modality Configurations: Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
Stratified Evaluation (15,044): Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
Robust Reference Model (3.8B): Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
Empirical Insights: Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.

Architecture & Two-Stage Training

🚀 Usage

For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the LLaSO GitHub repository.

📑 How to Cite

If you use LLaSO in your research or applications, please cite our paper:

@misc{sun2025llaso,
      title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model}, 
      author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
      year={2025},
      eprint={2508.15418},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.15418}, 
}

YirongSun
/

LLaSO-Base-3.8B-Instruct

LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models

🔍 What is LLaSO?

✨ Key Features

🚀 Usage

📑 How to Cite

Datasets used to train YirongSun/LLaSO-Base-3.8B-Instruct