prasannad28's picture
Update README.md
4a486be verified
---
license: apache-2.0
tags:
- text-generation
- llama.cpp
- gguf
- quantized
- q3_k_s
model_type: llama
inference: false
base_model:
- sarvamai/sarvam-m
---
# sarvam-m-24b - Q3_K_S GGUF
This repository contains the **Q3_K_S** quantized version of sarvam-m-24b in GGUF format.
## Model Details
- **Quantization**: Q3_K_S
- **File Size**: ~9.7GB
- **Description**: Small model with substantial quality loss
- **Format**: GGUF (compatible with llama.cpp)
## Usage
### With llama.cpp
```bash
# Download the model
huggingface-cli download tifin-india/sarvam-m-24b-q3_k_s-gguf
# Run inference
./main -m sarvam-m-24b-Q3_K_S.gguf -p "Your prompt here"
```
### With Python (llama-cpp-python)
```python
from llama_cpp import Llama
# Load the model
llm = Llama(
model_path="./sarvam-m-24b-Q3_K_S.gguf",
n_ctx=2048, # Context length
n_gpu_layers=35, # Adjust based on your GPU
verbose=False
)
# Generate text
response = llm("Your prompt here", max_tokens=100)
print(response['choices'][0]['text'])
```
### With Transformers + AutoGGUF
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_name = "tifin-india/sarvam-m-24b-q3_k_s-gguf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_quantized(model_name)
```
## Performance Characteristics
| Aspect | Rating |
|--------|--------|
| **Speed** | ⭐⭐⭐⭐ |
| **Quality** | ⭐⭐ |
| **Memory** | ⭐⭐⭐⭐ |
## Original Model
This is a quantized version of the original model. For the full-precision version and more details, please refer to the original model repository.
## Quantization Details
This model was quantized using llama.cpp's quantization tools. The Q3_K_S format provides a good balance of model size, inference speed, and output quality for most use cases.
## License
This model follows the same license as the original model (Apache 2.0).
## Citation
If you use this model, please cite the original model authors and acknowledge the quantization.