shruthan-r commited on
Commit
a9a4831
·
0 Parent(s):

initial commit

Browse files
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - ServiceNow-AI/Apriel-5B-Base
4
+ library_name: transformers
5
+ language:
6
+ - en
7
+ license: mit
8
+ ---
9
+
10
+ # Apriel-5B
11
+
12
+ `/ˈɑː.pri.əl/`
13
+
14
+ ## Table of Contents
15
+
16
+ 1. [Model Summary](#model-summary)
17
+ 2. [Evaluation](#evaluation)
18
+ 3. [Intended Use](#intended-use)
19
+ 4. [Limitations](#limitations)
20
+ 5. [Security and Responsible Use](#security-and-responsible-use)
21
+ 6. [License](#license)
22
+ 7. [Citation](#citation)
23
+
24
+ ## Model Summary
25
+
26
+ Apriel is a family of models built for versatility, offering high throughput and efficiency across a wide range of tasks.
27
+
28
+ ### Apriel-5B-Base
29
+ Apriel-5B-base is a decoder-only transformer trained on 4.5T+ tokens of data. It is the first release in the Apriel model family, designed to support research on foundation models. Apriel-5B-base achieves strong performance across common benchmarks for models under 5B parameters.
30
+
31
+ ### Apriel-5B-Instruct
32
+ [Apriel-5B-Instruct](https://huggingface.co/ServiceNow-AI/Apriel-5B-Instruct) is built on top of [Apriel-5B-base](https://huggingface.co/ServiceNow-AI/Apriel-5B-base) using continual pretraining (CPT), supervised finetuning (SFT), and post-training alignment with DPO and RLVR.
33
+
34
+ Both CPT and SFT stages involved training multiple domain-biased variants with overlapping datasets (e.g., instruction, code, math). These were then merged to form a more general-purpose model before alignment. The final model is aligned for instruction following, reasoning, and safety-aware dialogue.
35
+
36
+ <img src="https://huggingface.co/ServiceNow-AI/Apriel-4.8B-base/resolve/main/eval_vs_latency.png" alt="graph" width="400"/>
37
+
38
+ The y-axis shows average downstream benchmark scores. Throughput (x-axis) was measured using [vLLM](https://github.com/vllm-project/vllm) with batch size 8, 256 input tokens, and 32 output tokens.
39
+
40
+ ### How to Use
41
+
42
+ ```bash
43
+ pip install transformers
44
+ ```
45
+
46
+ #### Running the Base model
47
+ ```python
48
+ import torch
49
+ from transformers import AutoModelForCausalLM, AutoTokenizer
50
+
51
+ checkpoint = "ServiceNow-AI/Apriel-5B-Base"
52
+ device = "cuda" # or "cpu"
53
+
54
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
55
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
56
+
57
+ inputs = tokenizer.encode("Snow is", return_tensors="pt").to(device)
58
+ outputs = model.generate(inputs)
59
+ print(tokenizer.decode(outputs[0]))
60
+ ```
61
+
62
+ ```bash
63
+ >>> print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
64
+ Memory footprint: 9664.14 MB
65
+ ```
66
+
67
+ #### Running the Instruct model
68
+
69
+ ```python
70
+ import torch
71
+ from transformers import AutoModelForCausalLM, AutoTokenizer
72
+
73
+ checkpoint = "ServiceNow-AI/Apriel-5B-Instruct"
74
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
75
+ device = "cuda" if torch.cuda.is_available() else "cpu"
76
+
77
+ model = AutoModelForCausalLM.from_pretrained(
78
+ checkpoint,
79
+ torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32
80
+ ).to(device)
81
+
82
+ messages = [
83
+ {"role": "system", "content": "You are a helpful AI assistant that provides accurate and concise information."},
84
+ {"role": "user", "content": "Tell me about artificial intelligence"}
85
+ ]
86
+
87
+ input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
88
+ inputs = tokenizer(input_text, return_tensors="pt").to(device)
89
+
90
+ generation_params = {
91
+ "max_new_tokens": 512,
92
+ "temperature": 0.2,
93
+ "top_p": 0.9,
94
+ "do_sample": True
95
+ }
96
+
97
+ outputs = model.generate(**inputs, **generation_params)
98
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
99
+ print(response)
100
+ ```
101
+
102
+ ### Chat Template
103
+
104
+ ```
105
+ <|system|>
106
+ System message here (optional)
107
+ <|end|>
108
+ <|user|>
109
+ User message here
110
+ <|end|>
111
+ <|assistant|>
112
+ Assistant response here
113
+ <|end|>
114
+ ```
115
+
116
+ If no system message is provided, the model inserts a blank system prompt to maintain format structure. The model supports structured interaction patterns, including tool calling and reasoning steps for more advanced workflows.
117
+
118
+ ## Evaluation
119
+
120
+ Evaluations were conducted using [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [evalchemy](https://github.com/mlfoundations/evalchemy).
121
+
122
+ ### Apriel-5B-Base
123
+
124
+ | Task Name | Apriel-5B-Base | OLMo-2-1124-7B | Llama-3.1-8B | Mistral-Nemo-Base-2407 |
125
+ |---------------------|------------------|----------------|--------------|-------------------------|
126
+ | **Average** | 58.7 | 58.71 | 61.72 | 66.01 |
127
+ | **ARC Challenge** | 56.7 | 62.7 | 58.2 | 62.9 |
128
+ | **ARC Easy** | 82.4 | 86.0 | 85.7 | 86.7 |
129
+ | **MMMLU** | 44.5 | 35.3 | 47.4 | 54.7 |
130
+ | **Global MMLU** | 57.4 | 52.4 | 61.1 | 68.4 |
131
+ | **GSM8k** | 64.2 | 63.2 | 54.8 | 58.5 |
132
+ | **HellaSwag** | 74.4 | 80.5 | 78.8 | 82.7 |
133
+ | **MUSR** | 39.1 | 39.6 | 38.0 | 39.9 |
134
+ | **MBPP** | 27.6 | 22.4 | 46.0 | 54.6 |
135
+ | **MMLU** | 61.3 | 63.9 | 66.0 | 69.6 |
136
+ | **PIQA** | 78.9 | 81.1 | 81.2 | 82.1 |
137
+
138
+
139
+
140
+ ### Apriel-5B-Instruct
141
+
142
+ | Task Name | Apriel-5B-Instruct | OLMo-2-1124-7B-Instruct | Llama-3.1-8B-Instruct | Mistral-Nemo-Instruct-2407 |
143
+ |--------------|--------------------|--------------------------|------------------------|----------------------------|
144
+ | **Average** | 49.64 | 43.91 | 52.60 | 48.63 |
145
+ | **ARC Challenge** | 59.04 | 61.45 | 64.25 | 66.38 |
146
+ | **GSM8k** | 80.36 | 79.68 | 82.63 | 77.63 |
147
+ | **Hellaswag** | 74.52 | 80.21 | 78.43 | 81.71 |
148
+ | **BBH** | 39.82 | 39.95 | 50.86 | 50.06 |
149
+ | **GPQA** | 28.36 | 27.85 | 29.19 | 29.45 |
150
+ | **IF Eval** | 80.78 | 72.64 | 79.67 | 62.85 |
151
+ | **MMLU Pro** | 29.19 | 26.57 | 37.74 | 35.09 |
152
+ | **MUSR** | 36.77 | 34.39 | 38.36 | 39.02 |
153
+ | **MBPP** | 45.80 | 28.00 | 59.00 | 57.60 |
154
+ | **TruthfulQA** | 56.09 | 56.46 | 55.05 | 57.69 |
155
+ | **Winogrande** | 62.35 | 65.35 | 67.01 | 70.01 |
156
+ | **Minerva Math** | 39.80 | 9.96 | 36.72 | 21.46 |
157
+ | **MATH500** | 53.00 | 31.4 | 45.80 | 34.40 |
158
+ | **AMC23** | 29.00 | 16.4 | 21.00 | 11.50 |
159
+ | **MixEval Hard** | 29.70 | 28.40 | 43.30 | 34.60 |
160
+
161
+ ## Intended Use
162
+
163
+ The Apriel family of models are designed for a variety of general-purpose instruction tasks, including:
164
+
165
+ - Question answering and information retrieval
166
+ - Content generation and summarization
167
+ - Code assistance and generation
168
+ - Logical reasoning and multi-step tasks
169
+ - Creative writing and ideation
170
+
171
+ They are **not intended** for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy.
172
+
173
+ ## Limitations
174
+
175
+ - **Factual accuracy:** May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts.
176
+ - **Bias:** May reflect societal, cultural, or systemic biases present in training data.
177
+ - **Ethics:** Do not use the model to produce harmful, unlawful, or unethical content.
178
+ - **Language:** Strongest performance is in English. Output quality may degrade in underrepresented languages.
179
+ - **Critical use:** Not suitable for medical, legal, financial, or other high-risk applications without safeguards.
180
+
181
+ ## Security and Responsible Use
182
+
183
+ **Security Responsibilities:**
184
+ Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF).
185
+
186
+ **Guidelines for Deployers:**
187
+
188
+ - Regularly conduct robustness assessments to identify and mitigate adversarial inputs.
189
+ - Implement validation and filtering processes to prevent harmful or biased outputs.
190
+ - Continuously perform data privacy checks to guard against unintended data leaks.
191
+ - Document and communicate the model's limitations, intended usage, and known security risks to all end-users.
192
+ - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities.
193
+
194
+ **Guidelines for Users:**
195
+
196
+ - Follow established security policies and usage guidelines provided by deployers.
197
+ - Protect and manage sensitive information when interacting with the model.
198
+ - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers.
199
+ - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions.
200
+
201
+ **Disclaimer:**
202
+ Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
203
+
204
+ ## Pretraining
205
+
206
+ ### Model
207
+
208
+ - **Architecture:** Transformer decoder with grouped-query attention and YARN rotary embeddings
209
+ - **Tokens:** 4.5T
210
+ - **Precision:** bfloat16
211
+ - **Knowledge cutoff:** April 2024
212
+
213
+ ### Hardware
214
+
215
+ - **Compute:** 480 × H100 GPUs
216
+ - **GPU-hours:** ~91,000 H100-hours
217
+
218
+ ### Software
219
+
220
+ - **Training stack:** [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
221
+
222
+ ## License
223
+
224
+ MIT
225
+
226
+ ## Citation
227
+
228
+ ```bibtex
229
+ @misc{Apriel-small-language-models,
230
+ author = {Slam labs team},
231
+ title = {Apriel - a Family of performant small language models},
232
+ howpublished = {https://huggingface.co/ServiceNow-AI/Apriel-5B-Instruct},
233
+ publisher = {SLAM - ServiceNow Language Models Lab}
234
+ year = {2025}
235
+ }
236
+ ```
config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "ServiceNow-AI/Apriel-5B-Instruct",
3
+ "architectures": [
4
+ "AprielForCausalLM"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_apriel.AprielConfig",
8
+ "AutoModelForCausalLM": "modeling_apriel.AprielForCausalLM"
9
+ },
10
+ "attention_bias": false,
11
+ "attention_dropout": 0.0,
12
+ "bos_token_id": 1,
13
+ "eos_token_id": 2,
14
+ "head_dim": 128,
15
+ "hidden_act": "silu",
16
+ "hidden_size": 4096,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 8192,
19
+ "max_position_embeddings": 16384,
20
+ "mlp_bias": false,
21
+ "model_type": "apriel",
22
+ "num_attention_heads": 24,
23
+ "num_hidden_layers": 28,
24
+ "num_key_value_heads": 8,
25
+ "pretraining_tp": 1,
26
+ "rms_norm_eps": 1e-05,
27
+ "rope_scaling": {
28
+ "attention_factor": null,
29
+ "beta_fast": 32.0,
30
+ "beta_slow": 1.0,
31
+ "factor": 32.0,
32
+ "original_max_position_embeddings": 4096,
33
+ "rope_type": "yarn"
34
+ },
35
+ "rope_theta": 1000000.0,
36
+ "tie_word_embeddings": false,
37
+ "torch_dtype": "bfloat16",
38
+ "transformers_version": "4.48.3",
39
+ "use_cache": true,
40
+ "vocab_size": 131072
41
+ }
configuration_apriel.py ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """Apriel model configuration"""
21
+
22
+ import math
23
+ from typing import Optional, Tuple
24
+
25
+ from transformers.configuration_utils import PretrainedConfig
26
+ from transformers.utils import is_torch_available, logging
27
+
28
+ logger = logging.get_logger(__name__)
29
+
30
+ if is_torch_available():
31
+ import torch
32
+
33
+ def _compute_default_rope_parameters(
34
+ config: Optional[PretrainedConfig] = None,
35
+ device: Optional["torch.device"] = None,
36
+ seq_len: Optional[int] = None,
37
+ **rope_kwargs,
38
+ ) -> Tuple["torch.Tensor", float]:
39
+ """
40
+ Computes the inverse frequencies according to the original RoPE implementation
41
+ Args:
42
+ config ([`~transformers.PretrainedConfig`]):
43
+ The model configuration.
44
+ device (`torch.device`):
45
+ The device to use for initialization of the inverse frequencies.
46
+ seq_len (`int`, *optional*):
47
+ The current sequence length. Unused for this type of RoPE.
48
+ rope_kwargs (`Dict`, *optional*):
49
+ BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
50
+ Returns:
51
+ Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
52
+ post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
53
+ """
54
+ if config is not None and len(rope_kwargs) > 0:
55
+ raise ValueError(
56
+ "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
57
+ f"`_compute_default_rope_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
58
+ )
59
+ if len(rope_kwargs) > 0:
60
+ base = rope_kwargs["base"]
61
+ dim = rope_kwargs["dim"]
62
+ elif config is not None:
63
+ base = config.rope_theta
64
+ partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
65
+ head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
66
+ dim = int(head_dim * partial_rotary_factor)
67
+
68
+ attention_factor = 1.0 # Unused in this type of RoPE
69
+
70
+ # Compute the inverse frequencies
71
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).float().to(device) / dim))
72
+ return inv_freq, attention_factor
73
+
74
+ def _compute_yarn_parameters(
75
+ config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
76
+ ) -> Tuple["torch.Tensor", float]:
77
+ """
78
+ Computes the inverse frequencies with NTK scaling. Please refer to the
79
+ [original paper](https://arxiv.org/abs/2309.00071)
80
+ Args:
81
+ config ([`~transformers.PretrainedConfig`]):
82
+ The model configuration.
83
+ device (`torch.device`):
84
+ The device to use for initialization of the inverse frequencies.
85
+ seq_len (`int`, *optional*):
86
+ The current sequence length. Unused for this type of RoPE.
87
+ rope_kwargs (`Dict`, *optional*):
88
+ BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
89
+ Returns:
90
+ Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
91
+ post-processing scaling factor applied to the computed cos/sin.
92
+ """
93
+ # No need to keep BC with yarn, unreleased when this new pattern was created.
94
+ if len(rope_kwargs) > 0:
95
+ raise ValueError(
96
+ f"Unexpected arguments: `**rope_kwargs` should be unset in `_compute_yarn_parameters`, got {rope_kwargs}"
97
+ )
98
+
99
+ base = config.rope_theta
100
+ partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
101
+ head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
102
+ dim = int(head_dim * partial_rotary_factor)
103
+
104
+ # Apriel: Use original max_position_embeddings instead of max_position_embeddings
105
+ max_position_embeddings = config.rope_scaling.get("original_max_position_embeddings", config.max_position_embeddings)
106
+ factor = config.rope_scaling["factor"]
107
+
108
+ # Sets the attention factor as suggested in the paper
109
+ attention_factor = config.rope_scaling.get("attention_factor")
110
+ if attention_factor is None:
111
+ attention_factor = 0.1 * math.log(factor) + 1.0
112
+
113
+ # Optional config options
114
+ # beta_fast/beta_slow: as suggested in the paper, default to 32/1 (correspondingly)
115
+ beta_fast = config.rope_scaling.get("beta_fast") or 32
116
+ beta_slow = config.rope_scaling.get("beta_slow") or 1
117
+
118
+ # Compute the inverse frequencies
119
+ def find_correction_dim(num_rotations, dim, base, max_position_embeddings):
120
+ """Inverse dimension formula to find the dimension based on the number of rotations"""
121
+ return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))
122
+
123
+ def find_correction_range(low_rot, high_rot, dim, base, max_position_embeddings):
124
+ """Find dimension range bounds based on rotations"""
125
+ low = math.floor(find_correction_dim(low_rot, dim, base, max_position_embeddings))
126
+ high = math.ceil(find_correction_dim(high_rot, dim, base, max_position_embeddings))
127
+ return max(low, 0), min(high, dim - 1)
128
+
129
+ def linear_ramp_factor(min, max, dim):
130
+ if min == max:
131
+ max += 0.001 # Prevent singularity
132
+
133
+ linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
134
+ ramp_func = torch.clamp(linear_func, 0, 1)
135
+ return ramp_func
136
+
137
+ # Note on variable naming: "interpolation" comes from the original technique, where we interpolate the position IDs
138
+ # to expand the possible context length. In other words, interpolation = apply scaling factor.
139
+ pos_freqs = base ** (torch.arange(0, dim, 2).float().to(device) / dim)
140
+ inv_freq_extrapolation = 1.0 / pos_freqs
141
+ inv_freq_interpolation = 1.0 / (factor * pos_freqs)
142
+
143
+ low, high = find_correction_range(beta_fast, beta_slow, dim, base, max_position_embeddings)
144
+
145
+ # Get n-dimensional rotational scaling corrected for extrapolation
146
+ inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2).float().to(device)
147
+ inv_freq = (
148
+ inv_freq_interpolation * (1 - inv_freq_extrapolation_factor)
149
+ + inv_freq_extrapolation * inv_freq_extrapolation_factor
150
+ )
151
+
152
+ return inv_freq, attention_factor
153
+
154
+ def _check_received_keys(
155
+ rope_type: str,
156
+ received_keys: set,
157
+ required_keys: set,
158
+ optional_keys: Optional[set] = None,
159
+ ignore_keys: Optional[set] = None,
160
+ ):
161
+
162
+ """Compare the received keys in `config.rope_scaling` against the expected and optional keys"""
163
+ # BC: "rope_type" was originally "type" -- let's check for "rope_type" when "type" is present
164
+ if "type" in received_keys:
165
+ received_keys -= {"type"}
166
+ required_keys.add("rope_type")
167
+
168
+ # Some models need to store model-specific keys, and we don't want to throw warning at them
169
+ if ignore_keys is not None:
170
+ received_keys -= ignore_keys
171
+
172
+ missing_keys = required_keys - received_keys
173
+ if missing_keys:
174
+ raise KeyError(f"Missing required keys in `rope_scaling` for 'rope_type'='{rope_type}': {missing_keys}")
175
+
176
+ if optional_keys is not None:
177
+ unused_keys = received_keys - required_keys - optional_keys
178
+ else:
179
+ unused_keys = received_keys - required_keys
180
+ if unused_keys:
181
+ logger.warning(f"Unrecognized keys in `rope_scaling` for 'rope_type'='{rope_type}': {unused_keys}")
182
+
183
+
184
+ def _validate_default_rope_parameters(config: PretrainedConfig, ignore_keys: Optional[set] = None):
185
+ rope_scaling = config.rope_scaling
186
+ rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None)) # BC: "rope_type" was originally "type"
187
+ required_keys = {"rope_type"}
188
+ received_keys = set(rope_scaling.keys())
189
+ _check_received_keys(rope_type, received_keys, required_keys, ignore_keys=ignore_keys)
190
+
191
+ def _validate_yarn_parameters(config: PretrainedConfig, ignore_keys: Optional[set] = None):
192
+ rope_scaling = config.rope_scaling
193
+ rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None)) # BC: "rope_type" was originally "type"
194
+ required_keys = {"rope_type", "factor", "original_max_position_embeddings"}
195
+ optional_keys = {"attention_factor", "beta_fast", "beta_slow"}
196
+ received_keys = set(rope_scaling.keys())
197
+ _check_received_keys(rope_type, received_keys, required_keys, optional_keys, ignore_keys=ignore_keys)
198
+
199
+ factor = rope_scaling["factor"]
200
+ if factor is None or not isinstance(factor, float) or factor < 1.0:
201
+ logger.warning(f"`rope_scaling`'s factor field must be a float >= 1, got {factor}")
202
+
203
+ attention_factor = rope_scaling.get("attention_factor")
204
+ if attention_factor is not None and (not isinstance(attention_factor, float) or attention_factor < 0):
205
+ logger.warning(
206
+ f"`rope_scaling`'s attention_factor field must be a float greater than 0, got {attention_factor}"
207
+ )
208
+ beta_fast = rope_scaling.get("beta_fast")
209
+ if beta_fast is not None and not isinstance(beta_fast, float):
210
+ logger.warning(f"`rope_scaling`'s beta_fast field must be a float, got {beta_fast}")
211
+ beta_slow = rope_scaling.get("beta_slow")
212
+ if beta_slow is not None and not isinstance(beta_slow, float):
213
+ logger.warning(f"`rope_scaling`'s beta_slow field must be a float, got {beta_slow}")
214
+
215
+ if (beta_fast or 32) < (beta_slow or 1):
216
+ logger.warning(
217
+ f"`rope_scaling`'s beta_fast field must be greater than beta_slow, got beta_fast={beta_fast} "
218
+ f"(defaults to 32 if None) and beta_slow={beta_slow} (defaults to 1 if None)"
219
+ )
220
+ # This maps the "rope_type" string field in rope config to the corresponding function to compute the RoPE parameters
221
+ # from the model config. You can append new {'rope_type': callable} pairs to this dictionary to enable custom RoPE
222
+ # parameterizations, as long as the callable has the same signature.
223
+ ROPE_INIT_FUNCTIONS = {
224
+ "default": _compute_default_rope_parameters,
225
+ "yarn": _compute_yarn_parameters,
226
+ }
227
+
228
+ # Like `ROPE_INIT_FUNCTIONS`, this validation function mapping can be dynamically updated for custom RoPE types.
229
+ ROPE_VALIDATION_FUNCTIONS = {
230
+ "default": _validate_default_rope_parameters,
231
+ "yarn": _validate_yarn_parameters,
232
+ }
233
+
234
+ def rope_config_validation(config: PretrainedConfig, ignore_keys: Optional[set] = None):
235
+ """
236
+ Validate the RoPE config arguments, given a `PretrainedConfig` object
237
+ """
238
+ rope_scaling = getattr(config, "rope_scaling", None) # not a default parameter in `PretrainedConfig`
239
+ if rope_scaling is None:
240
+ return
241
+
242
+ # BC: "rope_type" was originally "type"
243
+ rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", "default"))
244
+ validation_fn = ROPE_VALIDATION_FUNCTIONS.get(rope_type)
245
+ if validation_fn is not None:
246
+ validation_fn(config, ignore_keys=ignore_keys)
247
+ else:
248
+ logger.warning(
249
+ f"Missing validation function mapping in `ROPE_VALIDATION_FUNCTIONS` for 'rope_type'='{rope_type}'"
250
+ )
251
+
252
+ class AprielConfig(PretrainedConfig):
253
+ r"""
254
+ This is the configuration class to store the configuration of a [`AprielModel`]. It is used to instantiate an Apriel
255
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
256
+ defaults will yield a similar configuration to that of the Apriel-5B-Base.
257
+
258
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
259
+ documentation from [`PretrainedConfig`] for more information.
260
+
261
+
262
+ Args:
263
+ vocab_size (`int`, *optional*, defaults to 32000):
264
+ Vocabulary size of the Apriel model. Defines the number of different tokens that can be represented by the
265
+ `inputs_ids` passed when calling [`AprielModel`]
266
+ hidden_size (`int`, *optional*, defaults to 4096):
267
+ Dimension of the hidden representations.
268
+ intermediate_size (`int`, *optional*, defaults to 11008):
269
+ Dimension of the MLP representations.
270
+ num_hidden_layers (`int`, *optional*, defaults to 32):
271
+ Number of hidden layers in the Transformer decoder.
272
+ num_attention_heads (`int`, *optional*, defaults to 32):
273
+ Number of attention heads for each attention layer in the Transformer decoder.
274
+ num_key_value_heads (`int`, *optional*):
275
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
276
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
277
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
278
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
279
+ by meanpooling all the original heads within that group. For more details checkout [this
280
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
281
+ `num_attention_heads`.
282
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
283
+ The non-linear activation function (function or string) in the decoder.
284
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
285
+ The maximum sequence length that this model might ever be used with. Apriel-5B-Base supports up to 16384 tokens.
286
+ initializer_range (`float`, *optional*, defaults to 0.02):
287
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
288
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
289
+ The epsilon used by the rms normalization layers.
290
+ use_cache (`bool`, *optional*, defaults to `True`):
291
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
292
+ relevant if `config.is_decoder=True`.
293
+ pad_token_id (`int`, *optional*):
294
+ Padding token id.
295
+ bos_token_id (`int`, *optional*, defaults to 1):
296
+ Beginning of stream token id.
297
+ eos_token_id (`int`, *optional*, defaults to 2):
298
+ End of stream token id.
299
+ pretraining_tp (`int`, *optional*, defaults to 1):
300
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
301
+ document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
302
+ understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
303
+ results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
304
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
305
+ Whether to tie weight embeddings
306
+ rope_theta (`float`, *optional*, defaults to 10000.0):
307
+ The base period of the RoPE embeddings.
308
+ rope_scaling (`Dict`, *optional*):
309
+ Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
310
+ and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
311
+ accordingly.
312
+ Expected contents:
313
+ `rope_type` (`str`):
314
+ The sub-variant of RoPE to use. Can be one of ['default', 'yarn'], with 'default' being the original RoPE implementation.
315
+ `factor` (`float`, *optional*):
316
+ Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
317
+ most scaling types, a `factor` of x will enable the model to handle sequences of length x *
318
+ original maximum pre-trained length.
319
+ `original_max_position_embeddings` (`int`, *optional*):
320
+ Used with 'yarn', 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
321
+ pretraining.
322
+ `attention_factor` (`float`, *optional*):
323
+ Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
324
+ computation. If unspecified, it defaults to value recommended by the implementation, using the
325
+ `factor` field to infer the suggested value.
326
+ `beta_fast` (`float`, *optional*):
327
+ Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
328
+ ramp function. If unspecified, it defaults to 32.
329
+ `beta_slow` (`float`, *optional*):
330
+ Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
331
+ ramp function. If unspecified, it defaults to 1.
332
+ `short_factor` (`List[float]`, *optional*):
333
+ Only used with 'longrope'. The scaling factor to be applied to short contexts (<
334
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
335
+ size divided by the number of attention heads divided by 2
336
+ `long_factor` (`List[float]`, *optional*):
337
+ Only used with 'longrope'. The scaling factor to be applied to long contexts (<
338
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
339
+ size divided by the number of attention heads divided by 2
340
+ `low_freq_factor` (`float`, *optional*):
341
+ Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
342
+ `high_freq_factor` (`float`, *optional*):
343
+ Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
344
+ attention_bias (`bool`, *optional*, defaults to `False`):
345
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
346
+ attention_dropout (`float`, *optional*, defaults to 0.0):
347
+ The dropout ratio for the attention probabilities.
348
+ mlp_bias (`bool`, *optional*, defaults to `False`):
349
+ Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
350
+ head_dim (`int`, *optional*):
351
+ The attention head dimension. If None, it will default to hidden_size // num_attention_heads
352
+
353
+ ```python
354
+ >>> from transformers import AprielModel, AprielConfig
355
+
356
+ >>> # Initializing an Apriel Apriel-5B-Base style configuration
357
+ >>> configuration = AprielConfig()
358
+
359
+ >>> # Initializing a model from the Apriel-5B-Base style configuration
360
+ >>> model = AprielModel(configuration)
361
+
362
+ >>> # Accessing the model configuration
363
+ >>> configuration = model.config
364
+ ```"""
365
+
366
+ model_type = "apriel"
367
+ keys_to_ignore_at_inference = ["past_key_values"]
368
+ # Default tensor parallel plan for base model `AprielModel`
369
+ base_model_tp_plan = {
370
+ "layers.*.self_attn.q_proj": "colwise",
371
+ "layers.*.self_attn.k_proj": "colwise",
372
+ "layers.*.self_attn.v_proj": "colwise",
373
+ "layers.*.self_attn.o_proj": "rowwise",
374
+ "layers.*.mlp.gate_proj": "colwise",
375
+ "layers.*.mlp.up_proj": "colwise",
376
+ "layers.*.mlp.down_proj": "rowwise",
377
+ }
378
+ base_model_pp_plan = {
379
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
380
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
381
+ "norm": (["hidden_states"], ["hidden_states"]),
382
+ }
383
+
384
+ def __init__(
385
+ self,
386
+ vocab_size=32000,
387
+ hidden_size=4096,
388
+ intermediate_size=11008,
389
+ num_hidden_layers=32,
390
+ num_attention_heads=32,
391
+ num_key_value_heads=None,
392
+ hidden_act="silu",
393
+ max_position_embeddings=2048,
394
+ initializer_range=0.02,
395
+ rms_norm_eps=1e-6,
396
+ use_cache=True,
397
+ pad_token_id=None,
398
+ bos_token_id=1,
399
+ eos_token_id=2,
400
+ pretraining_tp=1,
401
+ tie_word_embeddings=False,
402
+ rope_theta=10000.0,
403
+ rope_scaling=None,
404
+ attention_bias=False,
405
+ attention_dropout=0.0,
406
+ mlp_bias=False,
407
+ head_dim=None,
408
+ **kwargs,
409
+ ):
410
+ self.vocab_size = vocab_size
411
+ self.max_position_embeddings = max_position_embeddings
412
+ self.hidden_size = hidden_size
413
+ self.intermediate_size = intermediate_size
414
+ self.num_hidden_layers = num_hidden_layers
415
+ self.num_attention_heads = num_attention_heads
416
+
417
+ # for backward compatibility
418
+ if num_key_value_heads is None:
419
+ num_key_value_heads = num_attention_heads
420
+
421
+ self.num_key_value_heads = num_key_value_heads
422
+ self.hidden_act = hidden_act
423
+ self.initializer_range = initializer_range
424
+ self.rms_norm_eps = rms_norm_eps
425
+ self.pretraining_tp = pretraining_tp
426
+ self.use_cache = use_cache
427
+ self.rope_theta = rope_theta
428
+ self.rope_scaling = rope_scaling
429
+ self.attention_bias = attention_bias
430
+ self.attention_dropout = attention_dropout
431
+ self.mlp_bias = mlp_bias
432
+ self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
433
+ # Validate the correctness of rotary position embeddings parameters
434
+ # BC: if there is a 'type' field, copy it it to 'rope_type'.
435
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
436
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
437
+ rope_config_validation(self)
438
+
439
+ super().__init__(
440
+ pad_token_id=pad_token_id,
441
+ bos_token_id=bos_token_id,
442
+ eos_token_id=eos_token_id,
443
+ tie_word_embeddings=tie_word_embeddings,
444
+ **kwargs,
445
+ )
446
+
447
+
448
+ __all__ = ["AprielConfig"]
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.47.1"
6
+ }
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:50d15d4b66a506f71676f740bfa928b11678b0cee42dd5673f538db413419970
3
+ size 4966300624
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4de73dd452d8707c9a32b45b473389e341e0b765b60b1e0f32598e5912b95a29
3
+ size 4697872352
model.safetensors.index.json ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 9664143360
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00002-of-00002.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
14
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
15
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
16
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
17
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
18
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
19
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
20
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
21
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
22
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
23
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
24
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
25
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
26
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
27
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
28
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
29
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
30
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
31
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
32
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
33
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
34
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
35
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
36
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
37
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
38
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
39
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
40
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
41
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
42
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
43
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
44
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
45
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
46
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
47
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
48
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
49
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
50
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
51
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
52
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
53
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
54
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
55
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
56
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
57
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
58
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
59
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
60
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
61
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
62
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00002.safetensors",
63
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
64
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
65
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
66
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
67
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
69
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
70
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
71
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00002.safetensors",
72
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
73
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
74
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
75
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
76
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
77
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
78
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
79
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
80
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00002.safetensors",
81
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
82
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
83
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
84
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
85
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
86
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
87
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
88
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
89
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
90
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
91
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
92
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
93
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
94
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
95
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
96
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
97
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
98
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
99
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
100
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
101
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
102
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
103
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
104
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
105
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
106
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
107
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
108
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
109
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
110
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
111
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
112
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
113
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
114
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
115
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
116
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
117
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
118
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
119
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
120
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
121
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
122
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
123
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
124
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
125
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
126
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
127
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
128
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
129
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
130
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
131
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
132
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
133
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
134
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
135
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
136
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
137
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
138
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
139
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
140
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
141
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
142
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
143
+ "model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
144
+ "model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
145
+ "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
146
+ "model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
147
+ "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
148
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
149
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
150
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
151
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
152
+ "model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
153
+ "model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
154
+ "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
155
+ "model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
156
+ "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
157
+ "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
158
+ "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
159
+ "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
160
+ "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
161
+ "model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
162
+ "model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
163
+ "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
164
+ "model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
165
+ "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
166
+ "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
167
+ "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
168
+ "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
169
+ "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
170
+ "model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
171
+ "model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
172
+ "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
173
+ "model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
174
+ "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
175
+ "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
176
+ "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
177
+ "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
178
+ "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
179
+ "model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
180
+ "model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
181
+ "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
182
+ "model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
183
+ "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
184
+ "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
185
+ "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
186
+ "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
187
+ "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
188
+ "model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
189
+ "model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
190
+ "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
191
+ "model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
192
+ "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
193
+ "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
194
+ "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
195
+ "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
196
+ "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
197
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
198
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
199
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
200
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
201
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
202
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
203
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
204
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
205
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
206
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
207
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
208
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
209
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
210
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
211
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
212
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
213
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
214
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
215
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
216
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
217
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
218
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
219
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
220
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
221
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
222
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
223
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
224
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
225
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
226
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
227
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
228
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
229
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
230
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
231
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
232
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
233
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
234
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
235
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
236
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
237
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
238
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
239
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
240
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
241
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
242
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
243
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
244
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
245
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
246
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
247
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
248
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
249
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
250
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
251
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
252
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
253
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
254
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
255
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
256
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
257
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
258
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
259
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
260
+ "model.norm.weight": "model-00002-of-00002.safetensors"
261
+ }
262
+ }
modeling_apriel.py ADDED
@@ -0,0 +1,1165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ from typing import Callable, List, Optional, Tuple, Union
21
+
22
+ import torch
23
+ import torch.utils.checkpoint
24
+ from torch import nn
25
+
26
+ from transformers.activations import ACT2FN
27
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
28
+ from transformers.generation import GenerationMixin
29
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
30
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
31
+ from transformers.modeling_outputs import (
32
+ BaseModelOutputWithPast,
33
+ CausalLMOutputWithPast,
34
+ QuestionAnsweringModelOutput,
35
+ SequenceClassifierOutputWithPast,
36
+ TokenClassifierOutput,
37
+ )
38
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
39
+ from transformers.processing_utils import Unpack
40
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
41
+ from transformers.utils import (
42
+ LossKwargs,
43
+ add_code_sample_docstrings,
44
+ add_start_docstrings,
45
+ add_start_docstrings_to_model_forward,
46
+ logging,
47
+ replace_return_docstrings,
48
+ )
49
+ from transformers.utils.deprecation import deprecate_kwarg
50
+ from .configuration_apriel import AprielConfig
51
+ from .configuration_apriel import ROPE_INIT_FUNCTIONS
52
+
53
+
54
+ logger = logging.get_logger(__name__)
55
+
56
+ _CHECKPOINT_FOR_DOC = "ServiceNow-AI/Apriel-5B-Instruct"
57
+ _CONFIG_FOR_DOC = "AprielConfig"
58
+
59
+
60
+ class AprielRMSNorm(nn.Module):
61
+ def __init__(self, hidden_size, eps=1e-6):
62
+ """
63
+ AprielRMSNorm is equivalent to T5LayerNorm
64
+ """
65
+ super().__init__()
66
+ self.weight = nn.Parameter(torch.ones(hidden_size))
67
+ self.variance_epsilon = eps
68
+
69
+ def forward(self, hidden_states):
70
+ input_dtype = hidden_states.dtype
71
+ hidden_states = hidden_states.to(torch.float32)
72
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
73
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
74
+ return self.weight * hidden_states.to(input_dtype)
75
+
76
+ def extra_repr(self):
77
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
78
+
79
+
80
+ ALL_LAYERNORM_LAYERS.append(AprielRMSNorm)
81
+
82
+
83
+ class AprielRotaryEmbedding(nn.Module):
84
+ def __init__(self, config: AprielConfig, device=None):
85
+ super().__init__()
86
+ # BC: "rope_type" was originally "type"
87
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
88
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
89
+ else:
90
+ self.rope_type = "default"
91
+ self.max_seq_len_cached = config.max_position_embeddings
92
+ self.original_max_seq_len = config.max_position_embeddings
93
+
94
+ self.config = config
95
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
96
+
97
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
98
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
99
+ self.original_inv_freq = self.inv_freq
100
+
101
+ def _dynamic_frequency_update(self, position_ids, device):
102
+ """
103
+ dynamic RoPE layers should recompute `inv_freq` in the following situations:
104
+ 1 - growing beyond the cached sequence length (allow scaling)
105
+ 2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
106
+ """
107
+ seq_len = torch.max(position_ids) + 1
108
+ if seq_len > self.max_seq_len_cached: # growth
109
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, seq_len=seq_len)
110
+ self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation
111
+ self.max_seq_len_cached = seq_len
112
+
113
+ if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset
114
+ # This .to() is needed if the model has been moved to a device after being initialized (because
115
+ # the buffer is automatically moved, but not the original copy)
116
+ self.original_inv_freq = self.original_inv_freq.to(device)
117
+ self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
118
+ self.max_seq_len_cached = self.original_max_seq_len
119
+
120
+ @torch.no_grad()
121
+ def forward(self, x, position_ids):
122
+ if "dynamic" in self.rope_type:
123
+ self._dynamic_frequency_update(position_ids, device=x.device)
124
+
125
+ # Core RoPE block
126
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
127
+ position_ids_expanded = position_ids[:, None, :].float()
128
+ # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
129
+ device_type = x.device.type
130
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
131
+ with torch.autocast(device_type=device_type, enabled=False):
132
+ freqs = (inv_freq_expanded.float().to(x.device) @ position_ids_expanded.float()).transpose(1, 2)
133
+ emb = torch.cat((freqs, freqs), dim=-1)
134
+ cos = emb.cos()
135
+ sin = emb.sin()
136
+
137
+ # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
138
+ cos = cos * self.attention_scaling
139
+ sin = sin * self.attention_scaling
140
+
141
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
142
+
143
+
144
+ def rotate_half(x):
145
+ """Rotates half the hidden dims of the input."""
146
+ x1 = x[..., : x.shape[-1] // 2]
147
+ x2 = x[..., x.shape[-1] // 2 :]
148
+ return torch.cat((-x2, x1), dim=-1)
149
+
150
+
151
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
152
+ """Applies Rotary Position Embedding to the query and key tensors.
153
+
154
+ Args:
155
+ q (`torch.Tensor`): The query tensor.
156
+ k (`torch.Tensor`): The key tensor.
157
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
158
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
159
+ position_ids (`torch.Tensor`, *optional*):
160
+ Deprecated and unused.
161
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
162
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
163
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
164
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
165
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
166
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
167
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
168
+ Returns:
169
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
170
+ """
171
+ cos = cos.unsqueeze(unsqueeze_dim)
172
+ sin = sin.unsqueeze(unsqueeze_dim)
173
+ q_embed = (q * cos) + (rotate_half(q) * sin)
174
+ k_embed = (k * cos) + (rotate_half(k) * sin)
175
+ return q_embed, k_embed
176
+
177
+
178
+ class AprielMLP(nn.Module):
179
+ def __init__(self, config):
180
+ super().__init__()
181
+ self.config = config
182
+ self.hidden_size = config.hidden_size
183
+ self.intermediate_size = config.intermediate_size
184
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
185
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
186
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
187
+ self.act_fn = ACT2FN[config.hidden_act]
188
+
189
+ def forward(self, x):
190
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
191
+ return down_proj
192
+
193
+
194
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
195
+ """
196
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
197
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
198
+ """
199
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
200
+ if n_rep == 1:
201
+ return hidden_states
202
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
203
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
204
+
205
+
206
+ def eager_attention_forward(
207
+ module: nn.Module,
208
+ query: torch.Tensor,
209
+ key: torch.Tensor,
210
+ value: torch.Tensor,
211
+ attention_mask: Optional[torch.Tensor],
212
+ scaling: float,
213
+ dropout: float = 0.0,
214
+ **kwargs,
215
+ ):
216
+ key_states = repeat_kv(key, module.num_key_value_groups)
217
+ value_states = repeat_kv(value, module.num_key_value_groups)
218
+
219
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
220
+ if attention_mask is not None:
221
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
222
+ attn_weights = attn_weights + causal_mask
223
+
224
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
225
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
226
+ attn_output = torch.matmul(attn_weights, value_states)
227
+ attn_output = attn_output.transpose(1, 2).contiguous()
228
+
229
+ return attn_output, attn_weights
230
+
231
+
232
+ class AprielAttention(nn.Module):
233
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
234
+
235
+ def __init__(self, config: AprielConfig, layer_idx: int):
236
+ super().__init__()
237
+ self.config = config
238
+ self.layer_idx = layer_idx
239
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
240
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
241
+ self.scaling = self.head_dim**-0.5
242
+ self.attention_dropout = config.attention_dropout
243
+ self.is_causal = True
244
+
245
+ self.q_proj = nn.Linear(
246
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
247
+ )
248
+ self.k_proj = nn.Linear(
249
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
250
+ )
251
+ self.v_proj = nn.Linear(
252
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
253
+ )
254
+ self.o_proj = nn.Linear(
255
+ config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
256
+ )
257
+
258
+ def forward(
259
+ self,
260
+ hidden_states: torch.Tensor,
261
+ position_embeddings: Tuple[torch.Tensor, torch.Tensor],
262
+ attention_mask: Optional[torch.Tensor],
263
+ past_key_value: Optional[Cache] = None,
264
+ cache_position: Optional[torch.LongTensor] = None,
265
+ **kwargs: Unpack[FlashAttentionKwargs],
266
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
267
+ input_shape = hidden_states.shape[:-1]
268
+ hidden_shape = (*input_shape, -1, self.head_dim)
269
+
270
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
271
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
272
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
273
+
274
+ cos, sin = position_embeddings
275
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
276
+
277
+ if past_key_value is not None:
278
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
279
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
280
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
281
+
282
+ attention_interface: Callable = eager_attention_forward
283
+ if self.config._attn_implementation != "eager":
284
+ if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
285
+ logger.warning_once(
286
+ "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
287
+ 'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
288
+ )
289
+ else:
290
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
291
+
292
+ attn_output, attn_weights = attention_interface(
293
+ self,
294
+ query_states,
295
+ key_states,
296
+ value_states,
297
+ attention_mask,
298
+ dropout=0.0 if not self.training else self.attention_dropout,
299
+ scaling=self.scaling,
300
+ **kwargs,
301
+ )
302
+
303
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
304
+ attn_output = self.o_proj(attn_output)
305
+ return attn_output, attn_weights
306
+
307
+
308
+ class AprielDecoderLayer(nn.Module):
309
+ def __init__(self, config: AprielConfig, layer_idx: int):
310
+ super().__init__()
311
+ self.hidden_size = config.hidden_size
312
+
313
+ self.self_attn = AprielAttention(config=config, layer_idx=layer_idx)
314
+
315
+ self.mlp = AprielMLP(config)
316
+ self.input_layernorm = AprielRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
317
+ self.post_attention_layernorm = AprielRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
318
+
319
+ def forward(
320
+ self,
321
+ hidden_states: torch.Tensor,
322
+ attention_mask: Optional[torch.Tensor] = None,
323
+ position_ids: Optional[torch.LongTensor] = None,
324
+ past_key_value: Optional[Cache] = None,
325
+ output_attentions: Optional[bool] = False,
326
+ use_cache: Optional[bool] = False,
327
+ cache_position: Optional[torch.LongTensor] = None,
328
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
329
+ **kwargs: Unpack[FlashAttentionKwargs],
330
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
331
+ residual = hidden_states
332
+
333
+ hidden_states = self.input_layernorm(hidden_states)
334
+
335
+ # Self Attention
336
+ hidden_states, self_attn_weights = self.self_attn(
337
+ hidden_states=hidden_states,
338
+ attention_mask=attention_mask,
339
+ position_ids=position_ids,
340
+ past_key_value=past_key_value,
341
+ output_attentions=output_attentions,
342
+ use_cache=use_cache,
343
+ cache_position=cache_position,
344
+ position_embeddings=position_embeddings,
345
+ **kwargs,
346
+ )
347
+ hidden_states = residual + hidden_states
348
+
349
+ # Fully Connected
350
+ residual = hidden_states
351
+ hidden_states = self.post_attention_layernorm(hidden_states)
352
+ hidden_states = self.mlp(hidden_states)
353
+ hidden_states = residual + hidden_states
354
+
355
+ outputs = (hidden_states,)
356
+ if output_attentions:
357
+ outputs += (self_attn_weights,)
358
+
359
+ return outputs
360
+
361
+
362
+ APRIEL_START_DOCSTRING = r"""
363
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
364
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
365
+ etc.)
366
+
367
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
368
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
369
+ and behavior.
370
+
371
+ Parameters:
372
+ config ([`AprielConfig`]):
373
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
374
+ load the weights associated with the model, only the configuration. Check out the
375
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
376
+ """
377
+
378
+
379
+ @add_start_docstrings(
380
+ "The bare Apriel Model outputting raw hidden-states without any specific head on top.",
381
+ APRIEL_START_DOCSTRING,
382
+ )
383
+ class AprielPreTrainedModel(PreTrainedModel):
384
+ config_class = AprielConfig
385
+ base_model_prefix = "model"
386
+ supports_gradient_checkpointing = True
387
+ _no_split_modules = ["AprielDecoderLayer"]
388
+ _skip_keys_device_placement = ["past_key_values"]
389
+ _supports_flash_attn_2 = True
390
+ _supports_sdpa = True
391
+ _supports_flex_attn = True
392
+ _supports_cache_class = True
393
+ _supports_quantized_cache = True
394
+ _supports_static_cache = True
395
+ _supports_attention_backend = True
396
+
397
+ def _init_weights(self, module):
398
+ std = self.config.initializer_range
399
+ if isinstance(module, nn.Linear):
400
+ module.weight.data.normal_(mean=0.0, std=std)
401
+ if module.bias is not None:
402
+ module.bias.data.zero_()
403
+ elif isinstance(module, nn.Embedding):
404
+ module.weight.data.normal_(mean=0.0, std=std)
405
+ if module.padding_idx is not None:
406
+ module.weight.data[module.padding_idx].zero_()
407
+
408
+
409
+ APRIEL_INPUTS_DOCSTRING = r"""
410
+ Args:
411
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
412
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
413
+ it.
414
+
415
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
416
+ [`PreTrainedTokenizer.__call__`] for details.
417
+
418
+ [What are input IDs?](../glossary#input-ids)
419
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
420
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
421
+
422
+ - 1 for tokens that are **not masked**,
423
+ - 0 for tokens that are **masked**.
424
+
425
+ [What are attention masks?](../glossary#attention-mask)
426
+
427
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
428
+ [`PreTrainedTokenizer.__call__`] for details.
429
+
430
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
431
+ `past_key_values`).
432
+
433
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
434
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
435
+ information on the default strategy.
436
+
437
+ - 1 indicates the head is **not masked**,
438
+ - 0 indicates the head is **masked**.
439
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
440
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
441
+ config.n_positions - 1]`.
442
+
443
+ [What are position IDs?](../glossary#position-ids)
444
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
445
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
446
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
447
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
448
+
449
+ Two formats are allowed:
450
+ - a [`~cache_utils.Cache`] instance, see our
451
+ [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache);
452
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
453
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
454
+ cache format.
455
+
456
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
457
+ legacy cache format will be returned.
458
+
459
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
460
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
461
+ of shape `(batch_size, sequence_length)`.
462
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
463
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
464
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
465
+ model's internal embedding lookup matrix.
466
+ use_cache (`bool`, *optional*):
467
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
468
+ `past_key_values`).
469
+ output_attentions (`bool`, *optional*):
470
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
471
+ tensors for more detail.
472
+ output_hidden_states (`bool`, *optional*):
473
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
474
+ more detail.
475
+ return_dict (`bool`, *optional*):
476
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
477
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
478
+ Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
479
+ this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
480
+ the complete sequence length.
481
+ """
482
+
483
+
484
+ @add_start_docstrings(
485
+ "The bare Apriel Model outputting raw hidden-states without any specific head on top.",
486
+ APRIEL_START_DOCSTRING,
487
+ )
488
+ class AprielModel(AprielPreTrainedModel):
489
+ """
490
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`AprielDecoderLayer`]
491
+
492
+ Args:
493
+ config: AprielConfig
494
+ """
495
+
496
+ def __init__(self, config: AprielConfig):
497
+ super().__init__(config)
498
+ self.padding_idx = config.pad_token_id
499
+ self.vocab_size = config.vocab_size
500
+
501
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
502
+ self.layers = nn.ModuleList(
503
+ [AprielDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
504
+ )
505
+ self.norm = AprielRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
506
+ self.rotary_emb = AprielRotaryEmbedding(config=config)
507
+ self.gradient_checkpointing = False
508
+
509
+ # Initialize weights and apply final processing
510
+ self.post_init()
511
+
512
+ def get_input_embeddings(self):
513
+ return self.embed_tokens
514
+
515
+ def set_input_embeddings(self, value):
516
+ self.embed_tokens = value
517
+
518
+ @add_start_docstrings_to_model_forward(APRIEL_INPUTS_DOCSTRING)
519
+ def forward(
520
+ self,
521
+ input_ids: torch.LongTensor = None,
522
+ attention_mask: Optional[torch.Tensor] = None,
523
+ position_ids: Optional[torch.LongTensor] = None,
524
+ past_key_values: Optional[Cache] = None,
525
+ inputs_embeds: Optional[torch.FloatTensor] = None,
526
+ use_cache: Optional[bool] = None,
527
+ output_attentions: Optional[bool] = None,
528
+ output_hidden_states: Optional[bool] = None,
529
+ return_dict: Optional[bool] = None,
530
+ cache_position: Optional[torch.LongTensor] = None,
531
+ **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
532
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
533
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
534
+ output_hidden_states = (
535
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
536
+ )
537
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
538
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
539
+
540
+ if (input_ids is None) ^ (inputs_embeds is not None):
541
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
542
+
543
+ if self.gradient_checkpointing and self.training and use_cache:
544
+ logger.warning_once(
545
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
546
+ )
547
+ use_cache = False
548
+
549
+ if inputs_embeds is None:
550
+ inputs_embeds = self.embed_tokens(input_ids)
551
+
552
+ if use_cache and past_key_values is None:
553
+ past_key_values = DynamicCache()
554
+
555
+ if cache_position is None:
556
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
557
+ cache_position = torch.arange(
558
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
559
+ )
560
+
561
+ if position_ids is None:
562
+ position_ids = cache_position.unsqueeze(0)
563
+
564
+ causal_mask = self._update_causal_mask(
565
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
566
+ )
567
+
568
+ hidden_states = inputs_embeds
569
+
570
+ # create position embeddings to be shared across the decoder layers
571
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
572
+
573
+ # decoder layers
574
+ all_hidden_states = () if output_hidden_states else None
575
+ all_self_attns = () if output_attentions else None
576
+
577
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
578
+ if output_hidden_states:
579
+ all_hidden_states += (hidden_states,)
580
+
581
+ if self.gradient_checkpointing and self.training:
582
+ layer_outputs = self._gradient_checkpointing_func(
583
+ decoder_layer.__call__,
584
+ hidden_states,
585
+ causal_mask,
586
+ position_ids,
587
+ past_key_values,
588
+ output_attentions,
589
+ use_cache,
590
+ cache_position,
591
+ position_embeddings,
592
+ )
593
+ else:
594
+ layer_outputs = decoder_layer(
595
+ hidden_states,
596
+ attention_mask=causal_mask,
597
+ position_ids=position_ids,
598
+ past_key_value=past_key_values,
599
+ output_attentions=output_attentions,
600
+ use_cache=use_cache,
601
+ cache_position=cache_position,
602
+ position_embeddings=position_embeddings,
603
+ **flash_attn_kwargs,
604
+ )
605
+
606
+ hidden_states = layer_outputs[0]
607
+
608
+ if output_attentions:
609
+ all_self_attns += (layer_outputs[1],)
610
+
611
+ hidden_states = self.norm(hidden_states)
612
+
613
+ # add hidden states from the last decoder layer
614
+ if output_hidden_states:
615
+ all_hidden_states += (hidden_states,)
616
+
617
+ output = BaseModelOutputWithPast(
618
+ last_hidden_state=hidden_states,
619
+ past_key_values=past_key_values if use_cache else None,
620
+ hidden_states=all_hidden_states,
621
+ attentions=all_self_attns,
622
+ )
623
+ return output if return_dict else output.to_tuple()
624
+
625
+ def _update_causal_mask(
626
+ self,
627
+ attention_mask: torch.Tensor,
628
+ input_tensor: torch.Tensor,
629
+ cache_position: torch.Tensor,
630
+ past_key_values: Cache,
631
+ output_attentions: bool,
632
+ ):
633
+ if self.config._attn_implementation == "flash_attention_2":
634
+ if attention_mask is not None and (attention_mask == 0.0).any():
635
+ return attention_mask
636
+ return None
637
+
638
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
639
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
640
+ # to infer the attention mask.
641
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
642
+ using_static_cache = isinstance(past_key_values, StaticCache)
643
+
644
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
645
+ if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
646
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
647
+ attention_mask,
648
+ inputs_embeds=input_tensor,
649
+ past_key_values_length=past_seen_tokens,
650
+ is_training=self.training,
651
+ ):
652
+ return None
653
+
654
+ dtype, device = input_tensor.dtype, input_tensor.device
655
+ sequence_length = input_tensor.shape[1]
656
+ if using_static_cache:
657
+ target_length = past_key_values.get_max_cache_shape()
658
+ else:
659
+ target_length = (
660
+ attention_mask.shape[-1]
661
+ if isinstance(attention_mask, torch.Tensor)
662
+ else past_seen_tokens + sequence_length + 1
663
+ )
664
+
665
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
666
+ causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
667
+ attention_mask,
668
+ sequence_length=sequence_length,
669
+ target_length=target_length,
670
+ dtype=dtype,
671
+ device=device,
672
+ cache_position=cache_position,
673
+ batch_size=input_tensor.shape[0],
674
+ )
675
+
676
+ if (
677
+ self.config._attn_implementation == "sdpa"
678
+ and attention_mask is not None
679
+ and attention_mask.device.type in ["cuda", "xpu"]
680
+ and not output_attentions
681
+ ):
682
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
683
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
684
+ # Details: https://github.com/pytorch/pytorch/issues/110213
685
+ min_dtype = torch.finfo(dtype).min
686
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
687
+
688
+ return causal_mask
689
+
690
+ @staticmethod
691
+ def _prepare_4d_causal_attention_mask_with_cache_position(
692
+ attention_mask: torch.Tensor,
693
+ sequence_length: int,
694
+ target_length: int,
695
+ dtype: torch.dtype,
696
+ device: torch.device,
697
+ cache_position: torch.Tensor,
698
+ batch_size: int,
699
+ **kwargs,
700
+ ):
701
+ """
702
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
703
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
704
+
705
+ Args:
706
+ attention_mask (`torch.Tensor`):
707
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
708
+ `(batch_size, 1, query_length, key_value_length)`.
709
+ sequence_length (`int`):
710
+ The sequence length being processed.
711
+ target_length (`int`):
712
+ The target length: when generating with static cache, the mask should be as long as the static cache,
713
+ to account for the 0 padding, the part of the cache that is not filled yet.
714
+ dtype (`torch.dtype`):
715
+ The dtype to use for the 4D attention mask.
716
+ device (`torch.device`):
717
+ The device to place the 4D attention mask on.
718
+ cache_position (`torch.Tensor`):
719
+ Indices depicting the position of the input sequence tokens in the sequence.
720
+ batch_size (`torch.Tensor`):
721
+ Batch size.
722
+ """
723
+ if attention_mask is not None and attention_mask.dim() == 4:
724
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
725
+ causal_mask = attention_mask
726
+ else:
727
+ min_dtype = torch.finfo(dtype).min
728
+ causal_mask = torch.full(
729
+ (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
730
+ )
731
+ if sequence_length != 1:
732
+ causal_mask = torch.triu(causal_mask, diagonal=1)
733
+ causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
734
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
735
+ if attention_mask is not None:
736
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
737
+ mask_length = attention_mask.shape[-1]
738
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
739
+ causal_mask.device
740
+ )
741
+ padding_mask = padding_mask == 0
742
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
743
+ padding_mask, min_dtype
744
+ )
745
+
746
+ return causal_mask
747
+
748
+
749
+ class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
750
+
751
+
752
+ class AprielForCausalLM(AprielPreTrainedModel, GenerationMixin):
753
+ _tied_weights_keys = ["lm_head.weight"]
754
+ _tp_plan = {"lm_head": "colwise_rep"}
755
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
756
+
757
+ def __init__(self, config):
758
+ super().__init__(config)
759
+ self.model = AprielModel(config)
760
+ self.vocab_size = config.vocab_size
761
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
762
+
763
+ # Initialize weights and apply final processing
764
+ self.post_init()
765
+
766
+ def get_input_embeddings(self):
767
+ return self.model.embed_tokens
768
+
769
+ def set_input_embeddings(self, value):
770
+ self.model.embed_tokens = value
771
+
772
+ def get_output_embeddings(self):
773
+ return self.lm_head
774
+
775
+ def set_output_embeddings(self, new_embeddings):
776
+ self.lm_head = new_embeddings
777
+
778
+ def set_decoder(self, decoder):
779
+ self.model = decoder
780
+
781
+ def get_decoder(self):
782
+ return self.model
783
+
784
+ @deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
785
+ @add_start_docstrings_to_model_forward(APRIEL_INPUTS_DOCSTRING)
786
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
787
+ def forward(
788
+ self,
789
+ input_ids: torch.LongTensor = None,
790
+ attention_mask: Optional[torch.Tensor] = None,
791
+ position_ids: Optional[torch.LongTensor] = None,
792
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
793
+ inputs_embeds: Optional[torch.FloatTensor] = None,
794
+ labels: Optional[torch.LongTensor] = None,
795
+ use_cache: Optional[bool] = None,
796
+ output_attentions: Optional[bool] = None,
797
+ output_hidden_states: Optional[bool] = None,
798
+ return_dict: Optional[bool] = None,
799
+ cache_position: Optional[torch.LongTensor] = None,
800
+ logits_to_keep: Union[int, torch.Tensor] = 0,
801
+ **kwargs: Unpack[KwargsForCausalLM],
802
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
803
+ r"""
804
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
805
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
806
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
807
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
808
+
809
+ logits_to_keep (`int` or `torch.Tensor`, *optional*):
810
+ If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
811
+ `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
812
+ token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
813
+ If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
814
+ This is useful when using packed tensor format (single dimension for batch and sequence length).
815
+
816
+ Returns:
817
+
818
+ Example:
819
+
820
+ ```python
821
+ >>> from transformers import AutoTokenizer, AprielForCausalLM
822
+
823
+ >>> model = AprielForCausalLM.from_pretrained("ServiceNow-AI/Apriel-5B-Instruct")
824
+ >>> tokenizer = AutoTokenizer.from_pretrained("ServiceNow-AI/Apriel-5B-Instruct")
825
+
826
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
827
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
828
+
829
+ >>> # Generate
830
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
831
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
832
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
833
+ ```"""
834
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
835
+ output_hidden_states = (
836
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
837
+ )
838
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
839
+
840
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
841
+ outputs = self.model(
842
+ input_ids=input_ids,
843
+ attention_mask=attention_mask,
844
+ position_ids=position_ids,
845
+ past_key_values=past_key_values,
846
+ inputs_embeds=inputs_embeds,
847
+ use_cache=use_cache,
848
+ output_attentions=output_attentions,
849
+ output_hidden_states=output_hidden_states,
850
+ return_dict=return_dict,
851
+ cache_position=cache_position,
852
+ **kwargs,
853
+ )
854
+
855
+ hidden_states = outputs[0]
856
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
857
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
858
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
859
+
860
+ loss = None
861
+ if labels is not None:
862
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
863
+
864
+ if not return_dict:
865
+ output = (logits,) + outputs[1:]
866
+ return (loss,) + output if loss is not None else output
867
+
868
+ return CausalLMOutputWithPast(
869
+ loss=loss,
870
+ logits=logits,
871
+ past_key_values=outputs.past_key_values,
872
+ hidden_states=outputs.hidden_states,
873
+ attentions=outputs.attentions,
874
+ )
875
+
876
+
877
+ @add_start_docstrings(
878
+ """
879
+ The Apriel Model transformer with a sequence classification head on top (linear layer).
880
+
881
+ [`AprielForSequenceClassification`] uses the last token in order to do the classification, as other causal models
882
+ (e.g. GPT-2) do.
883
+
884
+ Since it does classification on the last token, it requires to know the position of the last token. If a
885
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
886
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
887
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
888
+ each row of the batch).
889
+ """,
890
+ APRIEL_START_DOCSTRING,
891
+ )
892
+ class AprielForSequenceClassification(AprielPreTrainedModel):
893
+ def __init__(self, config):
894
+ super().__init__(config)
895
+ self.num_labels = config.num_labels
896
+ self.model = AprielModel(config)
897
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
898
+
899
+ # Initialize weights and apply final processing
900
+ self.post_init()
901
+
902
+ def get_input_embeddings(self):
903
+ return self.model.embed_tokens
904
+
905
+ def set_input_embeddings(self, value):
906
+ self.model.embed_tokens = value
907
+
908
+ @add_start_docstrings_to_model_forward(APRIEL_INPUTS_DOCSTRING)
909
+ def forward(
910
+ self,
911
+ input_ids: Optional[torch.LongTensor] = None,
912
+ attention_mask: Optional[torch.Tensor] = None,
913
+ position_ids: Optional[torch.LongTensor] = None,
914
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
915
+ inputs_embeds: Optional[torch.FloatTensor] = None,
916
+ labels: Optional[torch.LongTensor] = None,
917
+ use_cache: Optional[bool] = None,
918
+ output_attentions: Optional[bool] = None,
919
+ output_hidden_states: Optional[bool] = None,
920
+ return_dict: Optional[bool] = None,
921
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
922
+ r"""
923
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
924
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
925
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
926
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
927
+ """
928
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
929
+
930
+ transformer_outputs = self.model(
931
+ input_ids,
932
+ attention_mask=attention_mask,
933
+ position_ids=position_ids,
934
+ past_key_values=past_key_values,
935
+ inputs_embeds=inputs_embeds,
936
+ use_cache=use_cache,
937
+ output_attentions=output_attentions,
938
+ output_hidden_states=output_hidden_states,
939
+ return_dict=return_dict,
940
+ )
941
+ hidden_states = transformer_outputs[0]
942
+ logits = self.score(hidden_states)
943
+
944
+ if input_ids is not None:
945
+ batch_size = input_ids.shape[0]
946
+ else:
947
+ batch_size = inputs_embeds.shape[0]
948
+
949
+ if self.config.pad_token_id is None and batch_size != 1:
950
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
951
+ if self.config.pad_token_id is None:
952
+ last_non_pad_token = -1
953
+ elif input_ids is not None:
954
+ # To handle both left- and right- padding, we take the rightmost token that is not equal to pad_token_id
955
+ non_pad_mask = (input_ids != self.config.pad_token_id).to(logits.device, torch.int32)
956
+ token_indices = torch.arange(input_ids.shape[-1], device=logits.device)
957
+ last_non_pad_token = (token_indices * non_pad_mask).argmax(-1)
958
+ else:
959
+ last_non_pad_token = -1
960
+ logger.warning_once(
961
+ f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
962
+ "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
963
+ )
964
+
965
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), last_non_pad_token]
966
+
967
+ loss = None
968
+ if labels is not None:
969
+ loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)
970
+
971
+ if not return_dict:
972
+ output = (pooled_logits,) + transformer_outputs[1:]
973
+ return ((loss,) + output) if loss is not None else output
974
+
975
+ return SequenceClassifierOutputWithPast(
976
+ loss=loss,
977
+ logits=pooled_logits,
978
+ past_key_values=transformer_outputs.past_key_values,
979
+ hidden_states=transformer_outputs.hidden_states,
980
+ attentions=transformer_outputs.attentions,
981
+ )
982
+
983
+
984
+ @add_start_docstrings(
985
+ """
986
+ The Apriel Model transformer with a span classification head on top for extractive question-answering tasks like
987
+ SQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
988
+ """,
989
+ APRIEL_START_DOCSTRING,
990
+ )
991
+ class AprielForQuestionAnswering(AprielPreTrainedModel):
992
+ base_model_prefix = "transformer"
993
+
994
+ def __init__(self, config):
995
+ super().__init__(config)
996
+ self.transformer = AprielModel(config)
997
+ self.qa_outputs = nn.Linear(config.hidden_size, 2)
998
+
999
+ # Initialize weights and apply final processing
1000
+ self.post_init()
1001
+
1002
+ def get_input_embeddings(self):
1003
+ return self.transformer.embed_tokens
1004
+
1005
+ def set_input_embeddings(self, value):
1006
+ self.transformer.embed_tokens = value
1007
+
1008
+ @add_start_docstrings_to_model_forward(APRIEL_INPUTS_DOCSTRING)
1009
+ def forward(
1010
+ self,
1011
+ input_ids: Optional[torch.LongTensor] = None,
1012
+ attention_mask: Optional[torch.FloatTensor] = None,
1013
+ position_ids: Optional[torch.LongTensor] = None,
1014
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
1015
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1016
+ start_positions: Optional[torch.LongTensor] = None,
1017
+ end_positions: Optional[torch.LongTensor] = None,
1018
+ output_attentions: Optional[bool] = None,
1019
+ output_hidden_states: Optional[bool] = None,
1020
+ return_dict: Optional[bool] = None,
1021
+ **kwargs,
1022
+ ) -> Union[Tuple, QuestionAnsweringModelOutput]:
1023
+ r"""
1024
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1025
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1026
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1027
+ are not taken into account for computing the loss.
1028
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1029
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1030
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1031
+ are not taken into account for computing the loss.
1032
+ """
1033
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1034
+
1035
+ outputs = self.transformer(
1036
+ input_ids,
1037
+ attention_mask=attention_mask,
1038
+ position_ids=position_ids,
1039
+ past_key_values=past_key_values,
1040
+ inputs_embeds=inputs_embeds,
1041
+ output_attentions=output_attentions,
1042
+ output_hidden_states=output_hidden_states,
1043
+ return_dict=return_dict,
1044
+ )
1045
+
1046
+ sequence_output = outputs[0]
1047
+
1048
+ logits = self.qa_outputs(sequence_output)
1049
+ start_logits, end_logits = logits.split(1, dim=-1)
1050
+ start_logits = start_logits.squeeze(-1).contiguous()
1051
+ end_logits = end_logits.squeeze(-1).contiguous()
1052
+
1053
+ loss = None
1054
+ if start_positions is not None and end_positions is not None:
1055
+ loss = self.loss_function(start_logits, end_logits, start_positions, end_positions, **kwargs)
1056
+
1057
+ if not return_dict:
1058
+ output = (start_logits, end_logits) + outputs[2:]
1059
+ return ((loss,) + output) if loss is not None else output
1060
+
1061
+ return QuestionAnsweringModelOutput(
1062
+ loss=loss,
1063
+ start_logits=start_logits,
1064
+ end_logits=end_logits,
1065
+ hidden_states=outputs.hidden_states,
1066
+ attentions=outputs.attentions,
1067
+ )
1068
+
1069
+
1070
+ @add_start_docstrings(
1071
+ """
1072
+ The Apriel Model transformer with a token classification head on top (a linear layer on top of the hidden-states
1073
+ output) e.g. for Named-Entity-Recognition (NER) tasks.
1074
+ """,
1075
+ APRIEL_START_DOCSTRING,
1076
+ )
1077
+ class AprielForTokenClassification(AprielPreTrainedModel):
1078
+ def __init__(self, config):
1079
+ super().__init__(config)
1080
+ self.num_labels = config.num_labels
1081
+ self.model = AprielModel(config)
1082
+ if getattr(config, "classifier_dropout", None) is not None:
1083
+ classifier_dropout = config.classifier_dropout
1084
+ elif getattr(config, "hidden_dropout", None) is not None:
1085
+ classifier_dropout = config.hidden_dropout
1086
+ else:
1087
+ classifier_dropout = 0.1
1088
+ self.dropout = nn.Dropout(classifier_dropout)
1089
+ self.score = nn.Linear(config.hidden_size, config.num_labels)
1090
+
1091
+ # Initialize weights and apply final processing
1092
+ self.post_init()
1093
+
1094
+ def get_input_embeddings(self):
1095
+ return self.model.embed_tokens
1096
+
1097
+ def set_input_embeddings(self, value):
1098
+ self.model.embed_tokens = value
1099
+
1100
+ @add_start_docstrings_to_model_forward(APRIEL_INPUTS_DOCSTRING)
1101
+ @add_code_sample_docstrings(
1102
+ checkpoint=_CHECKPOINT_FOR_DOC,
1103
+ output_type=TokenClassifierOutput,
1104
+ config_class=_CONFIG_FOR_DOC,
1105
+ )
1106
+ def forward(
1107
+ self,
1108
+ input_ids: Optional[torch.LongTensor] = None,
1109
+ attention_mask: Optional[torch.Tensor] = None,
1110
+ position_ids: Optional[torch.LongTensor] = None,
1111
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1112
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1113
+ labels: Optional[torch.LongTensor] = None,
1114
+ use_cache: Optional[bool] = None,
1115
+ output_attentions: Optional[bool] = None,
1116
+ output_hidden_states: Optional[bool] = None,
1117
+ return_dict: Optional[bool] = None,
1118
+ ) -> Union[Tuple, TokenClassifierOutput]:
1119
+ r"""
1120
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1121
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1122
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1123
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1124
+ """
1125
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1126
+
1127
+ outputs = self.model(
1128
+ input_ids,
1129
+ attention_mask=attention_mask,
1130
+ position_ids=position_ids,
1131
+ past_key_values=past_key_values,
1132
+ inputs_embeds=inputs_embeds,
1133
+ use_cache=use_cache,
1134
+ output_attentions=output_attentions,
1135
+ output_hidden_states=output_hidden_states,
1136
+ return_dict=return_dict,
1137
+ )
1138
+ sequence_output = outputs[0]
1139
+ sequence_output = self.dropout(sequence_output)
1140
+ logits = self.score(sequence_output)
1141
+
1142
+ loss = None
1143
+ if labels is not None:
1144
+ loss = self.loss_function(logits, labels, self.config)
1145
+
1146
+ if not return_dict:
1147
+ output = (logits,) + outputs[2:]
1148
+ return ((loss,) + output) if loss is not None else output
1149
+
1150
+ return TokenClassifierOutput(
1151
+ loss=loss,
1152
+ logits=logits,
1153
+ hidden_states=outputs.hidden_states,
1154
+ attentions=outputs.attentions,
1155
+ )
1156
+
1157
+
1158
+ __all__ = [
1159
+ "AprielForCausalLM",
1160
+ "AprielModel",
1161
+ "AprielPreTrainedModel",
1162
+ "AprielForSequenceClassification",
1163
+ "AprielForQuestionAnswering",
1164
+ "AprielForTokenClassification",
1165
+ ]
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0240ce510f08e6c2041724e9043e33be9d251d1e4a4d94eb68cd47b954b61d2
3
+ size 17078292
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff