nm-testing/TinyLlama-1.1B-orca-v1.0-pruned50-quant-ds

TinyLlama-1.1B-orca-v1.0 - DeepSparse

This repo contains model files for TinyLlama-1.1B-orca-v1.0 optimized for DeepSparse, a CPU inference runtime for sparse models.

This model was quantized and pruned with SparseGPT, using SparseML.

Inference

Install DeepSparse LLM for fast inference on CPUs:

pip install deepsparse-nightly[llm]

Run in a Python pipeline:

from deepsparse import TextGeneration

prompt = "How to make banana bread?"
formatted_prompt = f"<|system|> You are a helpful AI assistant.</s><|user|>{prompt}</s><|assistant|>"
model = TextGeneration(model_path="hf:nm-testing/zyte-1B-pruned50-quant-dsnm-testing/zyte-1B-pruned50-quant-ds")
print(model(formatted_prompt, max_new_tokens=200, repetition_penalty=1.1, do_sample=True).generations[0].text)

"""
Making banana bread involves the following steps:
Mix the ingredients (water, melted sugar) into a mixing bowl, and set it aside. Mix in the melted sugar, then add in the sliced bananas.
Add in flour to knead them well and form them into a rectangle shape with one side larger than the other.
Add water to fill up half of the batter while keeping one-third unfilled portion as extra water for filling it with additional dry ingredients later on.
Divide into three equal parts and place each slice of bread on a plate lined with plastic wrap, covered with an ovenproof pan.
Make sure each slice is cut by two layers, then fold over itself like this image shows it with its sides open and coverings closed by wrapping paper
around it to keep it together tightly inside ovenproof pan while cooking it until done.
Let stand after that for 10 minutes before
"""

Prompt template

<|system|> You are a helpful AI assistant.</s>
<|user|>{prompt}</s>
<|assistant|>

Sparsification

For details on how this model was sparsified, see the recipe.yaml in this repo and follow the instructions below.

git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py sreeramajay/TinyLlama-1.1B-orca-v1.0 open_platypus --recipe recipe.yaml --save True
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment 
cp deployment/model.onnx deployment/model-orig.onnx

Run this kv-cache injection to speed up the model at inference by caching the Key and Value states:

import os
import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
input_file = "deployment/model-orig.onnx"
output_file = "deployment/model.onnx"
model = onnx.load(input_file, load_external_data=False)
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
onnx.save(model, output_file)
print(f"Modified model saved to: {output_file}")

Follow the instructions on our One Shot With SparseML page for a step-by-step guide for performing one-shot quantization of large language models.

Slack

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

nm-testing
/

TinyLlama-1.1B-orca-v1.0-pruned50-quant-ds

TinyLlama-1.1B-orca-v1.0 - DeepSparse

Inference

Prompt template

Sparsification

Slack

Model tree for nm-testing/TinyLlama-1.1B-orca-v1.0-pruned50-quant-ds