lyk2586 commited on
Commit
00a347f
·
verified ·
1 Parent(s): ff1a396

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
.msc ADDED
Binary file (1.38 kB). View file
 
.mv ADDED
@@ -0,0 +1 @@
 
 
1
+ Revision:master,CreatedAt:1753443439
Evaluation Results.png ADDED
README.md CHANGED
@@ -1,6 +1,3 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
  # JT-Math-8B-Instruct
5
 
6
 
@@ -12,13 +9,11 @@ license: apache-2.0
12
  <a href="https://huggingface.co/JT-LM/JT-Math-8B-Instruct" target="_blank">
13
  <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue">
14
  </a>
15
- <a href="./LICENSE" target="_blank">
16
- <img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-yellow.svg">
17
- </a>
18
  </p>
19
 
20
 
21
 
 
22
  We are excited to introduce JT-Math-8B-Instruct, a powerful 8-billion parameter model specialized for mathematical reasoning. It achieves state-of-the-art performance on major math benchmarks among models of its size.
23
  JT-Math-8B-Instruct is fine-tuned from Jiutian-Math-8B-Base and has been optimized through a comprehensive process involving Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to enhance its mathematical problem-solving abilities and instruction-following capabilities.
24
  For full transparency and reproducibility, please refer to our technical report which details our training recipe and pipeline.
@@ -28,15 +23,11 @@ For full transparency and reproducibility, please refer to our technical report
28
 
29
 
30
 
31
-
32
-
33
-
34
-
35
  ## Model Details
36
 
37
 
38
 
39
- 🚀 The **JT-Math-8B-Instruct** is an 8-billion parameter language model built on the **Jiutian LLM architecture** with a **context length of 32,768 tokens**. Its development involved two key stages: initial pre-training of the **JT-Math-8B-Base** model on a diverse corpus of text and mathematical data, followed by a two-stage instruction tuning process. This tuning began with **Supervised Fine-Tuning (SFT)**, where the model was trained on a high-quality, multilingual dataset of mathematical problems and solutions in both English and Chinese to grasp problem-solving patterns. Subsequently, **Reinforcement Learning (RL)** was applied to enhance reasoning accuracy, minimize logical fallacies, and align the model more closely with human preferences for clear and correct mathematical solutions.
40
 
41
 
42
 
@@ -44,11 +35,13 @@ For full transparency and reproducibility, please refer to our technical report
44
 
45
  ## Model Downloads
46
 
47
- We release the following model to support a wide range of applications.
48
 
49
- | Model Name | Length | Download | Notes |
50
- | ------------------- | ------ | ----------------------------------------------------- | ------------------------------------------------------- |
51
- | JT-Math-8B-Instruct | 32K | [🤗](https://huggingface.co/JT-LM/JT-Math-8B-Instruct/tree/main) | The instruction-tuned model, optimized with SFT and RL. |
 
 
52
 
53
 
54
 
@@ -59,18 +52,18 @@ We release the following model to support a wide range of applications.
59
  JT-Math-8B-Instruct demonstrates state-of-the-art performance on key mathematical benchmarks, outperforming other open-source models in the ~8B parameter class.
60
 
61
  Below is a summary of our evaluation results:
 
62
 
63
- **Figure 1: Performance of JT-Math-8B-Instruct on math reasoning benchmarks.**
64
 
65
 
66
  ## How to Get Started
67
 
68
- This example shows how to use the JT-Math-8B-Instruct model to solve math problems.
69
 
70
  ```python
71
  from transformers import AutoModelForCausalLM, AutoTokenizer
72
 
73
- model_name = "Jiutian/JT-Math-8B-Instruct"
74
 
75
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
76
  model = AutoModelForCausalLM.from_pretrained(
@@ -118,4 +111,7 @@ If you find our work useful, please consider citing our paper:
118
  journal={arXiv preprint arXiv:xxxx.xxxxx},
119
  year={2025}
120
  }
121
- ```
 
 
 
 
 
 
 
1
  # JT-Math-8B-Instruct
2
 
3
 
 
9
  <a href="https://huggingface.co/JT-LM/JT-Math-8B-Instruct" target="_blank">
10
  <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue">
11
  </a>
 
 
 
12
  </p>
13
 
14
 
15
 
16
+
17
  We are excited to introduce JT-Math-8B-Instruct, a powerful 8-billion parameter model specialized for mathematical reasoning. It achieves state-of-the-art performance on major math benchmarks among models of its size.
18
  JT-Math-8B-Instruct is fine-tuned from Jiutian-Math-8B-Base and has been optimized through a comprehensive process involving Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to enhance its mathematical problem-solving abilities and instruction-following capabilities.
19
  For full transparency and reproducibility, please refer to our technical report which details our training recipe and pipeline.
 
23
 
24
 
25
 
 
 
 
 
26
  ## Model Details
27
 
28
 
29
 
30
+ 🚀 The **JT-Math-8B-Instruct** is an 8-billion parameter language model built on the **Jiutian LLM architecture** with a **context length of 32,768 tokens**. Its development involved two key stages: initial pre-training of the **JT-Math-8B-Base** model on a diverse corpus of text and mathematical data, followed by a two-stage instruction tuning process. This tuning began with **Supervised Fine-Tuning (SFT)**, where the model was trained on a high-quality, multilingual dataset of mathematical problems and solutions in both English and Chinese to grasp problem-solving patterns. Subsequently, **Reinforcement Learning (RL)** was applied within an 8K context window to enhance reasoning accuracy, minimize logical fallacies, and align the model more closely with human preferences for clear and correct mathematical solutions.
31
 
32
 
33
 
 
35
 
36
  ## Model Downloads
37
 
38
+ We release the following model to support a wide range of applications:
39
 
40
+ | Model Name | Context Length | Hugging Face Link | ModelScope Link | Notes |
41
+ | ------------------- | -------------- | -------------------------------------------------------- | ------------------------------------------------------------ | --------------------------------------------------- |
42
+ | JT-Math-8B-Instruct | 32K | [Link](https://huggingface.co/JT-LM/JT-Math-8B-Instruct) | [Link](https://www.modelscope.cn/models/JiuTian-AI/JT-Math-8B-Instruct) | Instruction-tuned for general math problem-solving. |
43
+
44
+ ------
45
 
46
 
47
 
 
52
  JT-Math-8B-Instruct demonstrates state-of-the-art performance on key mathematical benchmarks, outperforming other open-source models in the ~8B parameter class.
53
 
54
  Below is a summary of our evaluation results:
55
+ ![alt text](<Evaluation Results.png>)
56
 
 
57
 
58
 
59
  ## How to Get Started
60
 
61
+ This example shows how to use the `JT-Math-8B-Instruct model to solve math problems.
62
 
63
  ```python
64
  from transformers import AutoModelForCausalLM, AutoTokenizer
65
 
66
+ model_name = "JT-LM/JT-Math-8B-Instruct"
67
 
68
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
69
  model = AutoModelForCausalLM.from_pretrained(
 
111
  journal={arXiv preprint arXiv:xxxx.xxxxx},
112
  year={2025}
113
  }
114
+ ```
115
+
116
+
117
+
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "JiutianForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_jiutian.JiutianConfig",
8
+ "AutoModelForCausalLM": "modeling_jiutian.JiutianForCausalLM"
9
+ },
10
+ "eos_token_id": 151645,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 4096,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 13312,
15
+ "max_position_embeddings": 32768,
16
+ "model_type": "jiutian",
17
+ "num_attention_heads": 32,
18
+ "num_hidden_layers": 32,
19
+ "num_key_value_heads": 8,
20
+ "pad_token_id": 151645,
21
+ "pretraining_tp": 1,
22
+ "qkv_bias": true,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_scaling": null,
25
+ "rope_theta": 500000.0,
26
+ "tie_word_embeddings": false,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.46.1",
29
+ "use_cache": true,
30
+ "vocab_size": 151808
31
+ }
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"text-generation"}
configuration_jiutian.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+ from transformers.utils import logging
3
+ logger = logging.get_logger(__name__)
4
+
5
+ CM_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
6
+
7
+
8
+ class JiutianConfig(PretrainedConfig):
9
+ model_type = "jiutian"
10
+ keys_to_ignore_at_inference = ["past_key_values"]
11
+
12
+ def __init__(
13
+ self,
14
+ vocab_size=152064,
15
+ hidden_size=8192,
16
+ intermediate_size=13312,
17
+ num_hidden_layers=32,
18
+ num_attention_heads=32,
19
+ num_key_value_heads=8,
20
+ hidden_act="silu",
21
+ max_position_embeddings=8192,
22
+ initializer_range=0.02,
23
+ rms_norm_eps=1e-6,
24
+ use_cache=True,
25
+ pad_token_id=151645,
26
+ bos_token_id=None,
27
+ eos_token_id=151645,
28
+ pretraining_tp=1,
29
+ tie_word_embeddings=False,
30
+ rope_theta=500000,
31
+ rope_scaling=None,
32
+ qkv_bias=True,
33
+ attention_dropout=0.0,
34
+ **kwargs,
35
+ ):
36
+ self.vocab_size = vocab_size
37
+ self.max_position_embeddings = max_position_embeddings
38
+ self.hidden_size = hidden_size
39
+ self.intermediate_size = intermediate_size
40
+ self.num_hidden_layers = num_hidden_layers
41
+ self.num_attention_heads = num_attention_heads
42
+ self.hidden_act = hidden_act
43
+ self.initializer_range = initializer_range
44
+ self.rms_norm_eps = rms_norm_eps
45
+ self.pretraining_tp = pretraining_tp
46
+ self.use_cache = use_cache
47
+ self.rope_theta = rope_theta
48
+ self.rope_scaling = None
49
+ self.qkv_bias = qkv_bias
50
+ self.attention_dropout = attention_dropout
51
+ if num_key_value_heads is None:
52
+ num_key_value_heads = num_attention_heads
53
+ self.num_key_value_heads = num_key_value_heads
54
+
55
+ super().__init__(
56
+ pad_token_id=pad_token_id,
57
+ bos_token_id=bos_token_id,
58
+ eos_token_id=eos_token_id,
59
+ tie_word_embeddings=tie_word_embeddings,
60
+ **kwargs,
61
+ )
62
+
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "eos_token_id": 151645,
4
+ "pad_token_id": 151645,
5
+ "transformers_version": "4.46.1"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3bde6934aabd61236bbbfd3160f3302c32246dab720b5042de03f8e26130ec91
3
+ size 4993602416
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:439d6282d4a646d86deae501e9e9ad58a1ba0c9c93fafbf176a1f43d8b7fe10c
3
+ size 4966416696
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:449f391cfb2a0d738cd9ed0a43e448022546bd3331dab23b9104e85d5e9c288a
3
+ size 4437899424
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14e82a7e166b3a6a4a8b6709f7a8cd18b9965c648849eeda29ac906033c04d0a
3
+ size 1243611264
model.safetensors.index.json ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 15641485312
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00004-of-00004.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
13
+ "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
15
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
16
+ "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
18
+ "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
19
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
20
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
21
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
22
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
23
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
24
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
25
+ "model.layers.1.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
28
+ "model.layers.1.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
29
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
30
+ "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
31
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
32
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
33
+ "model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
34
+ "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
35
+ "model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
36
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
37
+ "model.layers.10.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
38
+ "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
39
+ "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
40
+ "model.layers.10.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
41
+ "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
42
+ "model.layers.10.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
43
+ "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
44
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
45
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
46
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
47
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
48
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
49
+ "model.layers.11.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
50
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
51
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
52
+ "model.layers.11.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
53
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
54
+ "model.layers.11.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
55
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
56
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
57
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
58
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
59
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
60
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
61
+ "model.layers.12.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
62
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
63
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
64
+ "model.layers.12.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
65
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
66
+ "model.layers.12.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
67
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
68
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
69
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
70
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
71
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
72
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
73
+ "model.layers.13.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
74
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
75
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
76
+ "model.layers.13.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
77
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
78
+ "model.layers.13.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
79
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
80
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
81
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
82
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
83
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
84
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
85
+ "model.layers.14.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
86
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
87
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
88
+ "model.layers.14.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
89
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
90
+ "model.layers.14.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
91
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
92
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
93
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
94
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
95
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
96
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
97
+ "model.layers.15.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
98
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
99
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
100
+ "model.layers.15.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
101
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
102
+ "model.layers.15.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
103
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
104
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
105
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
106
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
107
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
108
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
109
+ "model.layers.16.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
110
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
111
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
112
+ "model.layers.16.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
113
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
114
+ "model.layers.16.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
115
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
116
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
117
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
118
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
119
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
120
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
121
+ "model.layers.17.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
122
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
123
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
124
+ "model.layers.17.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
125
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
126
+ "model.layers.17.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
127
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
128
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
129
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
130
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
131
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
132
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
133
+ "model.layers.18.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
134
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
135
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
136
+ "model.layers.18.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
137
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
138
+ "model.layers.18.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
139
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
140
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
141
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
142
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
143
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
144
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
145
+ "model.layers.19.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
146
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
147
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
148
+ "model.layers.19.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
149
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
150
+ "model.layers.19.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
151
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
152
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
153
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
154
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
155
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
156
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
157
+ "model.layers.2.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
158
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
159
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
160
+ "model.layers.2.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
161
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
162
+ "model.layers.2.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
163
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
164
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00004.safetensors",
165
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
166
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
167
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
168
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
169
+ "model.layers.20.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
170
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
171
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
172
+ "model.layers.20.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
173
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
174
+ "model.layers.20.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
175
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
176
+ "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
177
+ "model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
178
+ "model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
179
+ "model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
180
+ "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
181
+ "model.layers.21.self_attn.k_proj.bias": "model-00002-of-00004.safetensors",
182
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
183
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
184
+ "model.layers.21.self_attn.q_proj.bias": "model-00002-of-00004.safetensors",
185
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
186
+ "model.layers.21.self_attn.v_proj.bias": "model-00002-of-00004.safetensors",
187
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
188
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
189
+ "model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
190
+ "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
191
+ "model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
192
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
193
+ "model.layers.22.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
194
+ "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
195
+ "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
196
+ "model.layers.22.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
197
+ "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
198
+ "model.layers.22.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
199
+ "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
200
+ "model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
201
+ "model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
202
+ "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
203
+ "model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
204
+ "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
205
+ "model.layers.23.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
206
+ "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
207
+ "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
208
+ "model.layers.23.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
209
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
210
+ "model.layers.23.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
211
+ "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
212
+ "model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
213
+ "model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
214
+ "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
215
+ "model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
216
+ "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
217
+ "model.layers.24.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
218
+ "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
219
+ "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
220
+ "model.layers.24.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
221
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
222
+ "model.layers.24.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
223
+ "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
224
+ "model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
225
+ "model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
226
+ "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
227
+ "model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
228
+ "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
229
+ "model.layers.25.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
230
+ "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
231
+ "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
232
+ "model.layers.25.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
233
+ "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
234
+ "model.layers.25.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
235
+ "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
236
+ "model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
237
+ "model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
238
+ "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
239
+ "model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
240
+ "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
241
+ "model.layers.26.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
242
+ "model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
243
+ "model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
244
+ "model.layers.26.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
245
+ "model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
246
+ "model.layers.26.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
247
+ "model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
248
+ "model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
249
+ "model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
250
+ "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
251
+ "model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
252
+ "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
253
+ "model.layers.27.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
254
+ "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
255
+ "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
256
+ "model.layers.27.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
257
+ "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
258
+ "model.layers.27.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
259
+ "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
260
+ "model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
261
+ "model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
262
+ "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
263
+ "model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
264
+ "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
265
+ "model.layers.28.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
266
+ "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
267
+ "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
268
+ "model.layers.28.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
269
+ "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
270
+ "model.layers.28.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
271
+ "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
272
+ "model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
273
+ "model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
274
+ "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
275
+ "model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
276
+ "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
277
+ "model.layers.29.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
278
+ "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
279
+ "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
280
+ "model.layers.29.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
281
+ "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
282
+ "model.layers.29.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
283
+ "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
284
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
285
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
286
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
287
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
288
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
289
+ "model.layers.3.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
290
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
291
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
292
+ "model.layers.3.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
293
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
294
+ "model.layers.3.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
295
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
296
+ "model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
297
+ "model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
298
+ "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
299
+ "model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
300
+ "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
301
+ "model.layers.30.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
302
+ "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
303
+ "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
304
+ "model.layers.30.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
305
+ "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
306
+ "model.layers.30.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
307
+ "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
308
+ "model.layers.31.input_layernorm.weight": "model-00003-of-00004.safetensors",
309
+ "model.layers.31.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
310
+ "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
311
+ "model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
312
+ "model.layers.31.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
313
+ "model.layers.31.self_attn.k_proj.bias": "model-00003-of-00004.safetensors",
314
+ "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
315
+ "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
316
+ "model.layers.31.self_attn.q_proj.bias": "model-00003-of-00004.safetensors",
317
+ "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
318
+ "model.layers.31.self_attn.v_proj.bias": "model-00003-of-00004.safetensors",
319
+ "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
320
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
321
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
322
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
323
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
324
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
325
+ "model.layers.4.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
326
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
327
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
328
+ "model.layers.4.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
329
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
330
+ "model.layers.4.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
331
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
332
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
333
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
334
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
335
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
336
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
337
+ "model.layers.5.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
338
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
339
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
340
+ "model.layers.5.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
341
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
342
+ "model.layers.5.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
343
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
344
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
345
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
346
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
347
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
348
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
349
+ "model.layers.6.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
350
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
351
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
352
+ "model.layers.6.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
353
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
354
+ "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
355
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
356
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
357
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
358
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
359
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
360
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
361
+ "model.layers.7.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
362
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
363
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
364
+ "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
365
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
366
+ "model.layers.7.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
367
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
368
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00004.safetensors",
369
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
370
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
371
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
372
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
373
+ "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
374
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
375
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
376
+ "model.layers.8.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
377
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
378
+ "model.layers.8.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
379
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
380
+ "model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
381
+ "model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
382
+ "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
383
+ "model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
384
+ "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
385
+ "model.layers.9.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
386
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
387
+ "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
388
+ "model.layers.9.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
389
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
390
+ "model.layers.9.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
391
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
392
+ "model.norm.weight": "model-00003-of-00004.safetensors"
393
+ }
394
+ }
modeling_jiutian.py ADDED
@@ -0,0 +1,621 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import warnings
2
+ import copy
3
+ from typing import List, Optional, Tuple, Union, Dict
4
+ from threading import Thread
5
+
6
+ import torch
7
+ import torch.nn.functional as F
8
+ import torch.utils.checkpoint
9
+ from torch import nn
10
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
11
+
12
+ from transformers.activations import ACT2FN
13
+ from transformers import GenerationConfig
14
+ from transformers.cache_utils import Cache, DynamicCache
15
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
16
+ from transformers.modeling_utils import PreTrainedModel
17
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_13
18
+ from transformers.utils import (
19
+ add_start_docstrings,
20
+ add_start_docstrings_to_model_forward,
21
+ is_flash_attn_2_available,
22
+ is_flash_attn_greater_or_equal_2_10,
23
+ logging,
24
+ replace_return_docstrings,
25
+ )
26
+ from .configuration_jiutian import JiutianConfig
27
+
28
+ if is_flash_attn_2_available():
29
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
30
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
31
+
32
+
33
+ logger = logging.get_logger(__name__)
34
+
35
+ _CONFIG_FOR_DOC = "JiutianConfig"
36
+
37
+
38
+ class JiutianRMSNorm(nn.Module):
39
+ def __init__(self, hidden_size, eps=1e-5):
40
+ """
41
+ Root Mean Square Layer Normalization
42
+ :param hidden_size: model size
43
+ :param eps: epsilon value, default 1e-5
44
+ """
45
+ super().__init__()
46
+ self.weight = torch.nn.Parameter(torch.ones(hidden_size))
47
+ self.epsilon = eps
48
+ self.d = hidden_size
49
+
50
+ def forward(self, hidden_states):
51
+ input_dtype = hidden_states.dtype
52
+ hidden_states = hidden_states.to(torch.float32)
53
+ norm_states = hidden_states.norm(2, dim=-1, keepdim=True)
54
+ d_states = self.d
55
+ rms_states = norm_states * d_states ** (-1.0 / 2)
56
+ states_normed = hidden_states / (rms_states + self.epsilon)
57
+ return self.weight * states_normed.to(input_dtype)
58
+
59
+
60
+ ALL_LAYERNORM_LAYERS.append(JiutianRMSNorm)
61
+
62
+
63
+ class JiutianRotaryEmbedding(nn.Module):
64
+ def __init__(self, dim, max_position_embeddings=4096, base=10000, device=None):
65
+ super().__init__()
66
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
67
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
68
+ self.seq_len_cached = None
69
+ self.cos_cached = None
70
+ self.sin_cached = None
71
+
72
+ def forward(self, x, seq_len=None):
73
+ # x: [bs, num_attention_heads, seq_len, head_size]
74
+ if self.seq_len_cached is None:
75
+ self.seq_len_cached = 0
76
+ if seq_len > self.seq_len_cached:
77
+ self.seq_len_cached = seq_len
78
+ t = torch.arange(seq_len, device=x.device).type_as(self.inv_freq)
79
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
80
+ emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
81
+ self.cos_cached = emb.float().cos()[:, :]
82
+ self.sin_cached = emb.float().sin()[:, :]
83
+ return (
84
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
85
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
86
+ )
87
+
88
+
89
+ def rotate_half(x):
90
+ x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
91
+ return torch.cat((-x2, x1), dim=-1)
92
+
93
+
94
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
95
+ cos, sin = cos[position_ids].unsqueeze(unsqueeze_dim), sin[position_ids].unsqueeze(unsqueeze_dim)
96
+ q_embed, k_embed = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
97
+ return q_embed, k_embed
98
+
99
+
100
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
101
+ """
102
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
103
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
104
+ """
105
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
106
+ if n_rep == 1:
107
+ return hidden_states
108
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
109
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
110
+
111
+ class JiutianMLP(nn.Module):
112
+ def __init__(self, config):
113
+ super().__init__()
114
+ self.config = config
115
+ self.hidden_size = config.hidden_size
116
+ self.intermediate_size = config.intermediate_size
117
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
118
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
119
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
120
+ self.act_fn = ACT2FN[config.hidden_act]
121
+
122
+ def forward(self, x):
123
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
124
+
125
+
126
+ class JiutianFlashAttention2(nn.Module):
127
+ def __init__(self, config: JiutianConfig, layer_idx: Optional[int] = None):
128
+ super().__init__()
129
+ self.config = config
130
+ self.layer_idx = layer_idx
131
+ self.attention_dropout = config.attention_dropout
132
+ self.hidden_size = config.hidden_size
133
+ self.num_heads = config.num_attention_heads
134
+ self.num_key_value_heads = config.num_key_value_heads
135
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
136
+ self.head_dim = self.hidden_size // self.num_heads
137
+ self.max_position_embeddings = config.max_position_embeddings
138
+ self.rope_theta = config.rope_theta
139
+ self.is_causal = True
140
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
141
+
142
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.qkv_bias)
143
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.qkv_bias)
144
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.qkv_bias)
145
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
146
+ self.rotary_emb = JiutianRotaryEmbedding(
147
+ self.head_dim,
148
+ max_position_embeddings=self.max_position_embeddings,
149
+ base=self.rope_theta,
150
+ )
151
+
152
+ def forward(
153
+ self,
154
+ hidden_states: torch.Tensor,
155
+ attention_mask: Optional[torch.LongTensor] = None,
156
+ position_ids: Optional[torch.LongTensor] = None,
157
+ past_key_value: Optional[Cache] = None,
158
+ use_cache: bool = False,
159
+ **kwargs,
160
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
161
+ # JiutianFlashAttention2 attention does not support output_attentions
162
+ if "padding_mask" in kwargs:
163
+ warnings.warn(
164
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
165
+ )
166
+ # overwrite attention_mask with padding_mask
167
+ attention_mask = kwargs.pop("padding_mask")
168
+ bsz, q_len, _ = hidden_states.size()
169
+
170
+ query_states = self.q_proj(hidden_states)
171
+ key_states = self.k_proj(hidden_states)
172
+ value_states = self.v_proj(hidden_states)
173
+
174
+ # Flash attention requires the input (bsz, sq_len, head_dim, hidden_dim )
175
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
176
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
177
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
178
+ kv_seq_len = key_states.shape[-2]
179
+ if past_key_value is not None:
180
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
181
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
182
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
183
+
184
+ if past_key_value is not None:
185
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
186
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
187
+
188
+
189
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
190
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
191
+
192
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
193
+ # to be able to avoid many of these transpose/reshape/view.
194
+ query_states = query_states.transpose(1, 2)
195
+ key_states = key_states.transpose(1, 2)
196
+ value_states = value_states.transpose(1, 2)
197
+
198
+ dropout_rate = self.attention_dropout if self.training else 0.0
199
+ query_length = q_len
200
+ if not self._flash_attn_uses_top_left_mask:
201
+ causal = self.is_causal
202
+ else:
203
+ causal = self.is_causal and query_length != 1
204
+
205
+ # Contains at least one padding token in the sequence
206
+ if attention_mask is not None:
207
+ batch_size = query_states.shape[0]
208
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
209
+ query_states, key_states, value_states, attention_mask, query_length
210
+ )
211
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
212
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
213
+ attn_output_unpad = flash_attn_varlen_func(
214
+ query_states,
215
+ key_states,
216
+ value_states,
217
+ cu_seqlens_q=cu_seqlens_q,
218
+ cu_seqlens_k=cu_seqlens_k,
219
+ max_seqlen_q=max_seqlen_in_batch_q,
220
+ max_seqlen_k=max_seqlen_in_batch_k,
221
+ dropout_p=dropout_rate,
222
+ causal=causal,
223
+ )
224
+
225
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
226
+ else:
227
+ attn_output = flash_attn_func(
228
+ query_states, key_states, value_states, dropout_rate, causal=causal
229
+ )
230
+
231
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
232
+ attn_output = self.o_proj(attn_output)
233
+ attn_weights = None
234
+
235
+ return attn_output, attn_weights, past_key_value
236
+
237
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
238
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
239
+ indices_k = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
240
+ max_seqlen_in_batch_k = seqlens_in_batch.max().item()
241
+ cu_seqlens_k = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
242
+
243
+ batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
244
+
245
+ key_layer = index_first_axis(
246
+ key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
247
+ )
248
+ value_layer = index_first_axis(
249
+ value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
250
+ )
251
+ if query_length == kv_seq_len:
252
+ query_layer = index_first_axis(
253
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
254
+ )
255
+ cu_seqlens_q = cu_seqlens_k
256
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
257
+ indices_q = indices_k
258
+ elif query_length == 1:
259
+ max_seqlen_in_batch_q = 1
260
+ cu_seqlens_q = torch.arange(
261
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
262
+ ) # There is a memcpy here, that is very bad.
263
+ indices_q = cu_seqlens_q[:-1]
264
+ query_layer = query_layer.squeeze(1)
265
+ else:
266
+ # The -q_len: slice assumes left padding.
267
+ attention_mask = attention_mask[:, -query_length:]
268
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
269
+
270
+ return (
271
+ query_layer,
272
+ key_layer,
273
+ value_layer,
274
+ indices_q,
275
+ (cu_seqlens_q, cu_seqlens_k),
276
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
277
+ )
278
+
279
+
280
+ class JiutianDecoderLayer(nn.Module):
281
+ def __init__(self, config: JiutianConfig, layer_idx: int):
282
+ super().__init__()
283
+ self.hidden_size = config.hidden_size
284
+ self.self_attn = JiutianFlashAttention2(config=config, layer_idx=layer_idx)
285
+ self.mlp = JiutianMLP(config)
286
+ self.input_layernorm = JiutianRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
287
+ self.post_attention_layernorm = JiutianRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
288
+
289
+ def forward(
290
+ self,
291
+ hidden_states: torch.Tensor,
292
+ attention_mask: Optional[torch.Tensor] = None,
293
+ position_ids: Optional[torch.LongTensor] = None,
294
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
295
+ use_cache: Optional[bool] = False,
296
+ **kwargs,
297
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
298
+
299
+ if "padding_mask" in kwargs:
300
+ warnings.warn(
301
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
302
+ )
303
+
304
+ residual = hidden_states
305
+ hidden_states = self.input_layernorm(hidden_states)
306
+
307
+ # Self Attention
308
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
309
+ hidden_states=hidden_states,
310
+ attention_mask=attention_mask,
311
+ position_ids=position_ids,
312
+ past_key_value=past_key_value,
313
+ use_cache=use_cache,
314
+ **kwargs,
315
+ )
316
+ hidden_states = residual + hidden_states
317
+
318
+ # Fully Connected
319
+ residual = hidden_states
320
+ hidden_states = self.post_attention_layernorm(hidden_states)
321
+ hidden_states = self.mlp(hidden_states)
322
+ hidden_states = residual + hidden_states
323
+
324
+ outputs = (hidden_states,)
325
+
326
+ if use_cache:
327
+ outputs += (present_key_value,)
328
+
329
+ return outputs
330
+
331
+
332
+ class JiutianPreTrainedModel(PreTrainedModel):
333
+ config_class = JiutianConfig
334
+ base_model_prefix = "model"
335
+ supports_gradient_checkpointing = True
336
+ _no_split_modules = ["JiutianDecoderLayer"]
337
+ _skip_keys_device_placement = "past_key_values"
338
+ _supports_flash_attn_2 = True
339
+ _supports_cache_class = True
340
+
341
+ def _init_weights(self, module):
342
+ std = self.config.initializer_range
343
+ if isinstance(module, nn.Linear):
344
+ module.weight.data.normal_(mean=0.0, std=std)
345
+ if module.bias is not None:
346
+ module.bias.data.zero_()
347
+ elif isinstance(module, nn.Embedding):
348
+ module.weight.data.normal_(mean=0.0, std=std)
349
+ if module.padding_idx is not None:
350
+ module.weight.data[module.padding_idx].zero_()
351
+
352
+ def _set_gradient_checkpointing(self, module: nn.Module, value: bool = False):
353
+ if isinstance(module, JiutianModel):
354
+ module.gradient_checkpointing = value
355
+
356
+
357
+ class JiutianModel(JiutianPreTrainedModel):
358
+ def __init__(self, config: JiutianConfig):
359
+ super().__init__(config)
360
+ self.padding_idx = config.pad_token_id
361
+ self.vocab_size = config.vocab_size
362
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
363
+ self.layers = nn.ModuleList(
364
+ [JiutianDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
365
+ )
366
+ self.norm = JiutianRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
367
+ self.gradient_checkpointing = False
368
+ # Initialize weights and apply final processing
369
+ self.post_init()
370
+
371
+ def get_input_embeddings(self):
372
+ return self.embed_tokens
373
+
374
+ def set_input_embeddings(self, value):
375
+ self.embed_tokens = value
376
+
377
+ def forward(
378
+ self,
379
+ input_ids: torch.LongTensor = None,
380
+ attention_mask: Optional[torch.Tensor] = None,
381
+ position_ids: Optional[torch.LongTensor] = None,
382
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
383
+ inputs_embeds: Optional[torch.FloatTensor] = None,
384
+ use_cache: Optional[bool] = None,
385
+ output_hidden_states: Optional[bool] = None,
386
+ return_dict: Optional[bool] = None,
387
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
388
+
389
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
390
+
391
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
392
+
393
+ if input_ids is not None:
394
+ batch_size, seq_length = input_ids.shape
395
+ elif inputs_embeds is not None:
396
+ batch_size, seq_length = inputs_embeds.shape
397
+
398
+ if self.gradient_checkpointing and self.training:
399
+ if use_cache:
400
+ use_cache = False
401
+
402
+ past_key_values_length = 0
403
+ if use_cache:
404
+ use_legacy_cache = not isinstance(past_key_values, Cache)
405
+ if use_legacy_cache:
406
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
407
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
408
+
409
+ if position_ids is None:
410
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
411
+ position_ids = torch.arange(
412
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
413
+ )
414
+ position_ids = position_ids.unsqueeze(0)
415
+
416
+ if inputs_embeds is None:
417
+ inputs_embeds = self.embed_tokens(input_ids)
418
+
419
+ # 2d mask is passed through the layers
420
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
421
+
422
+ # embed positions
423
+ hidden_states = inputs_embeds
424
+
425
+ # decoder layers
426
+ all_hidden_states = () if output_hidden_states else None
427
+ all_self_attns = None
428
+ next_decoder_cache = None
429
+
430
+ for decoder_layer in self.layers:
431
+ if output_hidden_states:
432
+ all_hidden_states += (hidden_states,)
433
+
434
+ if self.gradient_checkpointing and self.training:
435
+ def create_custom_forward(module):
436
+ def custom_forward(*inputs):
437
+ return module(*inputs, use_cache=use_cache)
438
+ return custom_forward
439
+ layer_outputs = torch.utils.checkpoint.checkpoint(
440
+ create_custom_forward(decoder_layer),
441
+ hidden_states,
442
+ attention_mask,
443
+ None,
444
+ )
445
+ else:
446
+ layer_outputs = decoder_layer(
447
+ hidden_states,
448
+ attention_mask=attention_mask,
449
+ position_ids=position_ids,
450
+ past_key_value=past_key_values,
451
+ use_cache=use_cache,
452
+ )
453
+
454
+ hidden_states = layer_outputs[0]
455
+
456
+ if use_cache:
457
+ next_decoder_cache = layer_outputs[1]
458
+
459
+ hidden_states = self.norm(hidden_states)
460
+
461
+ # add hidden states from the last decoder layer
462
+ if output_hidden_states:
463
+ all_hidden_states += (hidden_states,)
464
+
465
+ next_cache = None
466
+ if use_cache:
467
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
468
+ if not return_dict:
469
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
470
+ return BaseModelOutputWithPast(
471
+ last_hidden_state=hidden_states,
472
+ past_key_values=next_cache,
473
+ hidden_states=all_hidden_states,
474
+ attentions=all_self_attns,
475
+ )
476
+
477
+
478
+ class JiutianForCausalLM(JiutianPreTrainedModel):
479
+ def __init__(self, config):
480
+ super().__init__(config)
481
+ self.model = JiutianModel(config)
482
+ self.vocab_size = config.vocab_size
483
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
484
+ # Initialize weights and apply final processing
485
+ self.post_init()
486
+
487
+ def get_input_embeddings(self):
488
+ return self.model.embed_tokens
489
+
490
+ def set_input_embeddings(self, value):
491
+ self.model.embed_tokens = value
492
+
493
+ def get_output_embeddings(self):
494
+ return self.lm_head
495
+
496
+ def set_output_embeddings(self, new_embeddings):
497
+ self.lm_head = new_embeddings
498
+
499
+ def set_decoder(self, decoder):
500
+ self.model = decoder
501
+
502
+ def get_decoder(self):
503
+ return self.model
504
+
505
+ def forward(
506
+ self,
507
+ input_ids: torch.LongTensor = None,
508
+ attention_mask: Optional[torch.Tensor] = None,
509
+ position_ids: Optional[torch.LongTensor] = None,
510
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
511
+ inputs_embeds: Optional[torch.FloatTensor] = None,
512
+ labels: Optional[torch.LongTensor] = None,
513
+ use_cache: Optional[bool] = None,
514
+ output_attentions: Optional[bool] = None,
515
+ output_hidden_states: Optional[bool] = None,
516
+ return_dict: Optional[bool] = None,
517
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
518
+
519
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
520
+
521
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
522
+ outputs = self.model(
523
+ input_ids=input_ids,
524
+ attention_mask=attention_mask,
525
+ position_ids=position_ids,
526
+ past_key_values=past_key_values,
527
+ inputs_embeds=inputs_embeds,
528
+ use_cache=use_cache,
529
+ output_hidden_states=output_hidden_states,
530
+ return_dict=return_dict,
531
+ )
532
+ hidden_states = outputs[0]
533
+ logits = self.lm_head(hidden_states)
534
+ logits = logits.float()
535
+
536
+ loss = None
537
+ if labels is not None:
538
+ shift_logits = logits[..., :-1, :].contiguous()
539
+ shift_labels = labels[..., 1:].contiguous()
540
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
541
+ shift_labels = shift_labels.view(-1)
542
+ shift_labels = shift_labels.to(shift_logits.device)
543
+ loss_fct = CrossEntropyLoss()
544
+ loss = loss_fct(shift_logits, shift_labels)
545
+
546
+ if not return_dict:
547
+ output = (logits,) + outputs[1:]
548
+ return (loss,) + output if loss is not None else output
549
+
550
+ return CausalLMOutputWithPast(
551
+ loss=loss,
552
+ logits=logits,
553
+ past_key_values=outputs.past_key_values,
554
+ hidden_states=outputs.hidden_states,
555
+ attentions=outputs.attentions,
556
+ )
557
+
558
+ def prepare_inputs_for_generation(
559
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
560
+ ):
561
+ if past_key_values is not None:
562
+ if isinstance(past_key_values, Cache):
563
+ cache_length = past_key_values.get_seq_length()
564
+ past_length = past_key_values.seen_tokens
565
+ max_cache_length = past_key_values.get_max_length()
566
+ else:
567
+ cache_length = past_length = past_key_values[0][0].shape[2]
568
+ max_cache_length = None
569
+
570
+ # Keep only the unprocessed tokens:
571
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
572
+ # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as
573
+ # input)
574
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
575
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
576
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
577
+ # input_ids based on the past_length.
578
+ elif past_length < input_ids.shape[1]:
579
+ input_ids = input_ids[:, past_length:]
580
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
581
+
582
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
583
+ if (
584
+ max_cache_length is not None
585
+ and attention_mask is not None
586
+ and cache_length + input_ids.shape[1] > max_cache_length
587
+ ):
588
+ attention_mask = attention_mask[:, -max_cache_length:]
589
+
590
+ position_ids = kwargs.get("position_ids", None)
591
+ if attention_mask is not None and position_ids is None:
592
+ # create position_ids on the fly for batch generation
593
+ position_ids = attention_mask.long().cumsum(-1) - 1
594
+ position_ids.masked_fill_(attention_mask == 0, 1)
595
+ if past_key_values:
596
+ position_ids = position_ids[:, -input_ids.shape[1] :]
597
+
598
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
599
+ if inputs_embeds is not None and past_key_values is None:
600
+ model_inputs = {"inputs_embeds": inputs_embeds}
601
+ else:
602
+ model_inputs = {"input_ids": input_ids}
603
+
604
+ model_inputs.update(
605
+ {
606
+ "position_ids": position_ids,
607
+ "past_key_values": past_key_values,
608
+ "use_cache": kwargs.get("use_cache"),
609
+ "attention_mask": attention_mask,
610
+ }
611
+ )
612
+ return model_inputs
613
+
614
+ @staticmethod
615
+ def _reorder_cache(past_key_values, beam_idx):
616
+ reordered_past = ()
617
+ for layer_past in past_key_values:
618
+ reordered_past += (
619
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
620
+ )
621
+ return reordered_past
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'Please reason step by step, and put your final answer within \\\\boxed{}.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nPlease reason step by step, and put your final answer within \\\\boxed{}.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
+ "clean_up_tokenization_spaces": false,
200
+ "eos_token": "<|im_end|>",
201
+ "errors": "replace",
202
+ "extra_special_tokens": {},
203
+ "model_max_length": 131072,
204
+ "pad_token": "<|endoftext|>",
205
+ "padding_side": "right",
206
+ "split_special_tokens": false,
207
+ "tokenizer_class": "Qwen2Tokenizer",
208
+ "unk_token": null
209
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff