Model Quantization with llama.cpp

This README explains how to use the quantize_models.sh script to create quantized versions of your GGUF models.

Prerequisites

./quantize_models.sh

The script will create the following quantized versions:

Format	Description	Approximate Size
Q4_K_S	4-bit quantization, smaller size	~29% of original
Q5_K_M	5-bit quantization, medium size	~34% of original
Q5_K_S	5-bit quantization, smaller size	~33% of original
Q6_K	6-bit quantization, balanced quality and size	~38% of original
IQ4_XS	Improved 4-bit non-linear quantization, extra small	~27% of original
Q8_0	8-bit quantization, highest quality	~50% of original
Q2_K	2-bit quantization, extremely small	~18% of original
Q3_K_L	3-bit quantization, larger size	~25% of original
Q3_K_M	3-bit quantization, medium size	~23% of original
Q3_K_S	3-bit quantization, smaller size	~21% of original
Q4_K_M	4-bit quantization, medium size	~28% of original

If you want to quantize a different base model, edit the INPUT_MODEL variable in the script:

# Input model file
INPUT_MODEL="your-model-file.gguf"

Quantization can take from several minutes to an hour depending on your hardware
Make sure you have enough free disk space for all the output models
The total disk space required will be approximately 3x the size of the original model

Each quantized model can be used with llama.cpp tools:

llama.cpp/build/bin/llama-cli -m osmosis-mcp-4b.Q4_K_S.gguf -p "Your prompt here"

Choose the quantization format based on your needs: