jerryzh168's picture
Update README.md
0882b28 verified
|
raw
history blame
5.38 kB
metadata
library_name: transformers
tags: []

Phi4-mini model quantized with torchao int4 weight only quantization, by PyTorch team.

Quantization Recipe

We used following code to get the quantized model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Int4WeightOnlyConfig
quant_config = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)

# Push to hub
USER_ID = "YOUR_USER_ID"
save_to = "{USER_ID}/{model_id}-int4wo"
quantized_model.push_to_hub(save_to, safe_serialization=False)


# Manual Testing
messages = [
    {"role": "system", "content": "You are a medieval knight and must provide explanations to modern people."},
    {"role": "user", "content": "How should I explain the Internet?"},
]

prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

# Local Benchmark
import torch.utils.benchmark as benchmark
from torchao.utils import benchmark_model
import torchao

def benchmark_fn(f, *args, **kwargs):
    # Manual warmup
    for _ in range(2):
        f(*args, **kwargs)

    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)",
        globals={"args": args, "kwargs": kwargs, "f": f},
        num_threads=torch.get_num_threads(),
    )
    return f"{(t0.blocked_autorange().mean):.3f}"

torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")
print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))

Model Quality

We rely on lm-evaluation-harness to evaluate the quality of the quantized model.

Installing the nightly version to get most recent updates

pip install git+https://github.com/EleutherAI/lm-evaluation-harness

baseline

lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

int4wo-hqq

lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8

TODO: more complete eval results

Benchmark
Phi-4 mini-Ins phi4-mini-int4wo
Popular aggregated benchmark
Reasoning
HellaSwag 54.57 53.54
Multilingual
Math
Overall TODO TODO

Model Performance

Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm. For batch size N, please see our gemlite checkpoint.

Install latest vllm to get the most recent changes

pip install git+https://github.com/vllm-project/vllm.git

Download dataset

Download sharegpt dataset: wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks

benchmark_latency

baseline

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

int4wo-hqq

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/phi4-mini-int4wo-hqq --batch-size 1

benchmark_serving

We also benchmarked the throughput in a serving environment.

baseline

Server:

vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

int4wo-hqq

Server:

vllm serve jerryzh168/phi4-mini-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-int4wo-hqq --num-prompts 1

Serving with vllm

We can use the same command we used in serving benchmarks to serve the model with vllm

vllm serve jerryzh168/phi4-mini-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3