Safetensors
GGUF
qwen3
conversational
osmosis-mcp-4b / README_QUANTIZATION.md
jaketrock
modify scripts
5ff1fa5

Model Quantization with llama.cpp

This README explains how to use the quantize_models.sh script to create quantized versions of your GGUF models.

Prerequisites

  • llama.cpp must be cloned and built in this directory
  • You need a base GGUF model (default is osmosis-mcp-4B-BF16.gguf)

How to Use

  1. Make sure your base model is in the current directory
  2. Run the script:
./quantize_models.sh

Supported Quantization Formats

The script will create the following quantized versions:

Format Description Approximate Size
Q4_K_S 4-bit quantization, smaller size ~29% of original
Q5_K_M 5-bit quantization, medium size ~34% of original
Q5_K_S 5-bit quantization, smaller size ~33% of original
Q6_K 6-bit quantization, balanced quality and size ~38% of original
IQ4_XS Improved 4-bit non-linear quantization, extra small ~27% of original
Q8_0 8-bit quantization, highest quality ~50% of original
Q2_K 2-bit quantization, extremely small ~18% of original
Q3_K_L 3-bit quantization, larger size ~25% of original
Q3_K_M 3-bit quantization, medium size ~23% of original
Q3_K_S 3-bit quantization, smaller size ~21% of original
Q4_K_M 4-bit quantization, medium size ~28% of original

Customizing the Script

If you want to quantize a different base model, edit the INPUT_MODEL variable in the script:

# Input model file
INPUT_MODEL="your-model-file.gguf"

Time and Space Requirements

  • Quantization can take from several minutes to an hour depending on your hardware
  • Make sure you have enough free disk space for all the output models
  • The total disk space required will be approximately 3x the size of the original model

Using the Quantized Models

Each quantized model can be used with llama.cpp tools:

llama.cpp/build/bin/llama-cli -m osmosis-mcp-4b.Q4_K_S.gguf -p "Your prompt here"

Choose the quantization format based on your needs:

  • Smaller quantization (Q2_K, Q3_K_S) for limited hardware resources
  • Medium quantization (Q4_K_M, Q5_K_S) for balanced performance
  • Larger quantization (Q6_K, Q8_0) for highest quality with sufficient hardware