Model Quantization with llama.cpp
This README explains how to use the quantize_models.sh
script to create quantized versions of your GGUF models.
Prerequisites
- llama.cpp must be cloned and built in this directory
- You need a base GGUF model (default is
osmosis-mcp-4B-BF16.gguf
)
How to Use
- Make sure your base model is in the current directory
- Run the script:
./quantize_models.sh
Supported Quantization Formats
The script will create the following quantized versions:
Format | Description | Approximate Size |
---|---|---|
Q4_K_S | 4-bit quantization, smaller size | ~29% of original |
Q5_K_M | 5-bit quantization, medium size | ~34% of original |
Q5_K_S | 5-bit quantization, smaller size | ~33% of original |
Q6_K | 6-bit quantization, balanced quality and size | ~38% of original |
IQ4_XS | Improved 4-bit non-linear quantization, extra small | ~27% of original |
Q8_0 | 8-bit quantization, highest quality | ~50% of original |
Q2_K | 2-bit quantization, extremely small | ~18% of original |
Q3_K_L | 3-bit quantization, larger size | ~25% of original |
Q3_K_M | 3-bit quantization, medium size | ~23% of original |
Q3_K_S | 3-bit quantization, smaller size | ~21% of original |
Q4_K_M | 4-bit quantization, medium size | ~28% of original |
Customizing the Script
If you want to quantize a different base model, edit the INPUT_MODEL
variable in the script:
# Input model file
INPUT_MODEL="your-model-file.gguf"
Time and Space Requirements
- Quantization can take from several minutes to an hour depending on your hardware
- Make sure you have enough free disk space for all the output models
- The total disk space required will be approximately 3x the size of the original model
Using the Quantized Models
Each quantized model can be used with llama.cpp tools:
llama.cpp/build/bin/llama-cli -m osmosis-mcp-4b.Q4_K_S.gguf -p "Your prompt here"
Choose the quantization format based on your needs:
- Smaller quantization (Q2_K, Q3_K_S) for limited hardware resources
- Medium quantization (Q4_K_M, Q5_K_S) for balanced performance
- Larger quantization (Q6_K, Q8_0) for highest quality with sufficient hardware