Semantic-Ar-Qwen-Embed-0.6B

This is a sentence-transformers model finetuned from Qwen/Qwen3-Embedding-0.6B on STS tasks for better semantic arabic understanding. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Qwen/Qwen3-Embedding-0.6B
  • Maximum Sequence Length: 32768 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Language: ar

📊 Performance Evaluation

This model has been evaluated on Arabic semantic similarity benchmarks using the MTEB framework. Below are Spearman correlation scores for two tasks: STS17, STS22.v2, and their average.

Model STS17 (Spearman) STS22.v2 (Spearman) Average
Qwen3 Embeddings 0.6B 0.7505 0.6520 0.7013
Qwen3 Embeddings 4B 0.7912 0.6669 0.7291
Qwen3 Embeddings 8B 0.8220 0.6680 0.7450
Semantic-Ar-Qwen-Embed-V0.1 0.8300 0.6130 0.7215

STS17: Sentence similarity from classical Arabic benchmarks
🧪 STS22.v2: Diverse, multi-domain Arabic similarity pairs

Performance with other models:

Model Dim # Params. STS17 STS22-v2 Average
Arabic-Triplet-Matryoshka-V2 768 135M 85 64 75
Arabert-all-nli-triplet-Matryoshka 768 135M 83 64 74
GATE-AraBert-V1 767 135M 83 63 73
AraGemma-Embedding-300m 768 303M 84 62 73
Semantic-Ar-Qwen-Embed-0.6B 1024 596M 83 61 72
Marbert-all-nli-triplet-Matryoshka 768 163M 82 61 72
Arabic-labse-Matryoshka 768 471M 82 61 72
AraEuroBert-Small 768 210M 80 61 71
E5-all-nli-triplet-Matryoshka 384 278M 80 60 70
text-embedding-3-large 3072 - 81 59 70
Arabic-all-nli-triplet-Matryoshka 768 135M 82 54 68
AraEuroBert-Mid 1151 610M 83 53 68
paraphrase-multilingual-mpnet-base-v2 768 135M 79 55 67
AraEuroBert-Large 2304 2.1B 79 55 67
text-embedding-ada-002 1536 - 71 62 66
text-embedding-3-small 1536 - 72 57 65

📌 Insights

  • Semantic-Ar-Qwen-Embed-V0.1 leads on STS17, indicating task specialization.
  • Qwen3 8B achieves the highest average and top STS22.v2 score, making it the best all-rounder.
  • Model size clearly correlates with performance across Qwen variants.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen3Model 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Load model from Hugging Face Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B")

# Sentences for embedding (English + Arabic)
sentences = [
    'Left side of a silver train engine.',
    'A close-up of a black train engine.',
    "One idea that's been going around at least since the 80s is that you can distinguish between Holds and Moves.",
    
    "الجانب الأيسر من محرك قطار فضي.",
    "صورة مقربة لمحرك قطار أسود.",
    "إحدى الأفكار المتداولة منذ الثمانينات هي إمكانية التمييز بين الثبات والحركة.",
]

# Generate embeddings
embeddings = model.encode(sentences)
print("Embedding shape:", embeddings.shape)
# Output: (6, 1024)

# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print("Similarity shape:", similarities.shape)
# Output: (6, 6)

# Optionally print similarity scores
import numpy as np
import pandas as pd

df = pd.DataFrame(np.round(similarities, 3), index=sentences, columns=sentences)
print("\nSimilarity matrix:\n")
print(df)

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
141
Safetensors
Model size
596M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B

Finetuned
(51)
this model

Space using Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B 1

Collection including Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B