TinyLlama-1.1B-orca-v1.0 - DeepSparse
This repo contains model files for TinyLlama-1.1B-orca-v1.0 optimized for DeepSparse, a CPU inference runtime for sparse models.
This model was quantized and pruned with SparseGPT, using SparseML.
Inference
Install DeepSparse LLM for fast inference on CPUs:
pip install deepsparse-nightly[llm]
Run in a Python pipeline:
from deepsparse import TextGeneration
prompt = "How to make banana bread?"
formatted_prompt = f"<|system|> You are a helpful AI assistant.</s><|user|>{prompt}</s><|assistant|>"
model = TextGeneration(model_path="hf:nm-testing/zyte-1B-pruned50-quant-dsnm-testing/zyte-1B-pruned50-quant-ds")
print(model(formatted_prompt, max_new_tokens=200, repetition_penalty=1.1, do_sample=True).generations[0].text)
"""
Making banana bread involves the following steps:
Mix the ingredients (water, melted sugar) into a mixing bowl, and set it aside. Mix in the melted sugar, then add in the sliced bananas.
Add in flour to knead them well and form them into a rectangle shape with one side larger than the other.
Add water to fill up half of the batter while keeping one-third unfilled portion as extra water for filling it with additional dry ingredients later on.
Divide into three equal parts and place each slice of bread on a plate lined with plastic wrap, covered with an ovenproof pan.
Make sure each slice is cut by two layers, then fold over itself like this image shows it with its sides open and coverings closed by wrapping paper
around it to keep it together tightly inside ovenproof pan while cooking it until done.
Let stand after that for 10 minutes before
"""
Prompt template
<|system|> You are a helpful AI assistant.</s>
<|user|>{prompt}</s>
<|assistant|>
Sparsification
For details on how this model was sparsified, see the recipe.yaml
in this repo and follow the instructions below.
git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py sreeramajay/TinyLlama-1.1B-orca-v1.0 open_platypus --recipe recipe.yaml --save True
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
cp deployment/model.onnx deployment/model-orig.onnx
Run this kv-cache injection to speed up the model at inference by caching the Key and Value states:
import os
import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
input_file = "deployment/model-orig.onnx"
output_file = "deployment/model.onnx"
model = onnx.load(input_file, load_external_data=False)
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
onnx.save(model, output_file)
print(f"Modified model saved to: {output_file}")
Follow the instructions on our One Shot With SparseML page for a step-by-step guide for performing one-shot quantization of large language models.
Slack
For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community
- Downloads last month
- 4
Model tree for nm-testing/TinyLlama-1.1B-orca-v1.0-pruned50-quant-ds
Base model
sreeramajay/TinyLlama-1.1B-orca-v1.0