How did you quantize the model into uint8?
Could you share your method? Thanks
only like 94% (approximately, don't have the numbers in front of me right now) of the model's nodes are quantized. i instrumented the model and analyzed it during inference to weed out nodes that were bad candidates for quantization. this increased accuracy of the model but also increased the size a bit.
for quantizing the output, i ran a large batch of calibration data (all 119 languages, etc.) thru the model and logged the range of outputs. from that i calculated the zero point to use and tacked on a QuantizeLinear node onto the original model output.
I want to quantize my fine-tuned transformers qwen3 model to reduce gpu memory usage and increase computing speed.
I would like to know your conversion method. Do you convert the transformers model to an onnx format model first?
yes, i convert to onnx first. i use onnx because it's the best way to run on CPU for me.
there's a few different ways to convert to onnx. i think torch.onnx.export with dynamo is the most accurate way to export, but with this model you will have to implement or bypass KV cache. torch.onnx.export without dynamo is the regular tracing export, which is usually good enough...pay attention to export warnings for model code you might need to touch up. the easiest is using optimum-cli export onnx because it will give you an easier to use model (like this one)