Model Card

Summary

Base model: h2oai/h2o-danube2-1.8b-base

Usage

To use the model with the transformers library on a machine with GPUs, first make sure you have the transformers library installed.

pip install transformers==4.50.3

Also make sure you are providing your huggingface token to the pipeline if the model is lying in a private repo.

Either leave token=True in the pipeline and login to hugginface_hub by running

import huggingface_hub
huggingface_hub.login(<ACCESS_TOKEN>)

Or directly pass your to token in the pipeline

from transformers import pipeline

generate_text = pipeline(
    model="SaffalPoosh/Nexus-multihop-1B",
    torch_dtype="auto",
    trust_remote_code=True,
    device_map={"": "cuda:0"},
    token=True,
)

# generate configuration can be modified to your needs
# generate_text.model.generation_config.min_new_tokens = 70
# generate_text.model.generation_config.max_new_tokens = 424
# generate_text.model.generation_config.do_sample = True
# generate_text.model.generation_config.num_beams = 2
# generate_text.model.generation_config.temperature = float(0.2)
# generate_text.model.generation_config.repetition_penalty = float(1.0)

messages = [
    {"role": "user", "content": "Hi, how are you?"},
    {"role": "assistant", "content": "I'm doing great, how about you?"},
    {"role": "user", "content": "Why is drinking water so healthy?"},
]

res = generate_text(
    messages,
    renormalize_logits=True
)
print(res[0]["generated_text"][-1]['content'])

You can print a sample prompt after applying chat template to see how it is feed to the tokenizer:

print(generate_text.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
))

You may also construct the pipeline from the loaded model and tokenizer yourself and consider the preprocessing steps:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "SaffalPoosh/Nexus-multihop-1B"  # either local folder or Hugging Face model name
# Important: The prompt needs to be in the same format the model was trained with.
# You can find an example prompt in the experiment logs.
messages = [
    {"role": "user", "content": "Hi, how are you?"},
    {"role": "assistant", "content": "I'm doing great, how about you?"},
    {"role": "user", "content": "Why is drinking water so healthy?"},
]

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map={"": "cuda:0"},
    trust_remote_code=True,
)
model.cuda().eval()

# generate configuration can be modified to your needs
# model.generation_config.min_new_tokens = 70
# model.generation_config.max_new_tokens = 424
# model.generation_config.do_sample = True
# model.generation_config.num_beams = 2
# model.generation_config.temperature = float(0.2)
# model.generation_config.repetition_penalty = float(1.0)

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to("cuda")

tokens = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    renormalize_logits=True
)[0]

tokens = tokens[inputs["input_ids"].shape[1]:]
answer = tokenizer.decode(tokens, skip_special_tokens=True)
print(answer)

Quantization and sharding

You can load the models using quantization by specifying load_in_8bit=True or load_in_4bit=True. Also, sharding on multiple GPUs is possible by setting device_map=auto.

Model Architecture

Model is based on Mistral Architecture, training has been continued from the base model checkpoint.

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 2560, padding_idx=0)
    (layers): ModuleList(
      (0-23): 24 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=False)
          (k_proj): Linear(in_features=2560, out_features=640, bias=False)
          (v_proj): Linear(in_features=2560, out_features=640, bias=False)
          (o_proj): Linear(in_features=2560, out_features=2560, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=2560, out_features=6912, bias=False)
          (up_proj): Linear(in_features=2560, out_features=6912, bias=False)
          (down_proj): Linear(in_features=6912, out_features=2560, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((2560,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((2560,), eps=1e-05)
      )
    )
    (norm): MistralRMSNorm((2560,), eps=1e-05)
    (rotary_emb): MistralRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2560, out_features=32000, bias=False)
)

SaffalPoosh
/

Nexus-multihop-1.8B

Model Card

Summary

Usage

Quantization and sharding

Model Architecture

Model tree for SaffalPoosh/Nexus-multihop-1.8B