Embedding model for VietNamese
Collection
3 items
โข
Updated
Vietnamese_Embedding is an embedding model fine-tuned from the BGE-M3 model (https://huggingface.co/BAAI/bge-m3) to enhance retrieval capabilities for Vietnamese.
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer("AITeamVN/Vietnamese_Embedding")
model.max_seq_length = 2048
sentences_1 = ["Trรญ tuแป nhรขn tแบกo lร gรฌ", "Lแปฃi รญch cแปงa giแบฅc ngแปง"]
sentences_2 = ["Trรญ tuแป nhรขn tแบกo lร cรดng nghแป giรบp mรกy mรณc suy nghฤฉ vร hแปc hแปi nhฦฐ con ngฦฐแปi. Nรณ hoแบกt ฤแปng bแบฑng cรกch thu thแบญp dแปฏ liแปu, nhแบญn diแปn mแบซu vร ฤฦฐa ra quyแบฟt ฤแปnh.",
"Giแบฅc ngแปง giรบp cฦก thแป vร nรฃo bแป nghแป ngฦกi, hแปi phแปฅc nฤng lฦฐแปฃng vร cแบฃi thiแปn trรญ nhแป. Ngแปง ฤแปง giแบฅc giรบp tinh thแบงn tแปnh tรกo vร lร m viแปc hiแปu quแบฃ hฦกn."]
query_embedding = model.encode(sentences_1)
doc_embeddings = model.encode(sentences_2)
similarity = query_embedding @ doc_embeddings.T
print(similarity)
'''
array([[0.66212064, 0.33066642],
[0.25866613, 0.5865289 ]], dtype=float32)
'''
Model | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 | MRR@10 |
---|---|---|---|---|---|
Vietnamese_Reranker | 0.7944 | 0.9324 | 0.9537 | 0.9740 | 0.8672 |
Vietnamese_Embedding_v2 | 0.7262 | 0.8927 | 0.9268 | 0.9578 | 0.8149 |
Vietnamese_Embedding (public) | 0.7274 | 0.8992 | 0.9305 | 0.9568 | 0.8181 |
Vietnamese-bi-encoder (BKAI) | 0.7109 | 0.8680 | 0.9014 | 0.9299 | 0.7951 |
BGE-M3 | 0.5682 | 0.7728 | 0.8382 | 0.8921 | 0.6822 |
Vietnamese_Reranker and Vietnamese_Embedding_v2 was trained on 1100000 triplets.
Although the score on the legal domain drops a bit on Vietnamese_Embedding_v2, since this phase data is much larger, it is very good for other domains.
You can access 2 model via link: Vietnamese_Embedding_v2, Vietnamese_Reranker
You can reproduce the evaluation result by running code python evaluation_model.py (data downloaded from Kaggle).
Email: nguyennhotrung3004@gmail.com
Developer
Member: Nguyแป n Nho Trung, Nguyแป n Nhแบญt Quang, Nguyen Van Huy
@misc{Vietnamese_Embedding,
title={Vietnamese_Embedding: Embedding model in Vietnamese language.},
author={Nguyen Nho Trung, Nguyen Nhat Quang, Nguyen Van Huy},
year={2025},
publisher={Huggingface},
}