Linsong-C commited on
Commit
ca1cc08
·
verified ·
1 Parent(s): 0192254

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +325 -3
README.md CHANGED
@@ -1,3 +1,325 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ ## Model Details
6
+ <img alt="Bamba Logo" src="https://cdn-uploads.huggingface.co/production/uploads/64b6c638ac6d20bae0b93219/GOzs8o4G1apceun92ZC4d.png" width="242px" style="margin-left:'auto' margin-right:'auto' display:'block'">
7
+
8
+ # Model Card for Bamba 9B
9
+ We introduce Bamba-9B, a decoder-only language model based on the [Mamba-2](https://github.com/state-spaces/mamba) architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
10
+
11
+ | Model | Params | # Layers | Hidden Dim. | Attention Heads | GQA | KV Heads | Context Length | Tied Embeddings |
12
+ |-------------------|--------------|----------|-------------|-----------------|-----|----------|----------------|------------------|
13
+ | Bamba | 9B (9.78B) | 32 | 4096 | 32 | Yes | 8 | 4096 | False |
14
+
15
+
16
+ The current release includes the following models:
17
+ | **Stage** | **Bamba 9B** | **Quantized** | **Note** |
18
+ |----------------------|----------------------------------------------------------------------|-------------------------------------------------------------------------|-------------------------------------------------------------------|
19
+ | **Base Model** | [ibm-fms/Bamba-9B](https://huggingface.co/ibm-fms/Bamba-9B) | [ibm-fms/Bamba-9B-fp8](https://huggingface.co/ibm-fms/Bamba-9B-fp8) | Stage 2 pretraining |
20
+ | **Base Model** | [ibm-fms/Bamba-9B-2T](https://huggingface.co/ibm-fms/Bamba-9B-2T) | [ibm-fms/Bamba-9B-fp8](https://huggingface.co/ibm-fms/Bamba-9B-fp8) | Stage 1 pretraining |
21
+ | **Base Model** | [ibm-fms/Bamba-9B-1.8T](https://huggingface.co/ibm-fms/Bamba-9B-1.8T)| [ibm-fms/Bamba-9B-fp8](https://huggingface.co/ibm-fms/Bamba-9B-fp8) | Intermediate checkpoints during Stage 1, more to come |
22
+ | **SFT** | coming soon | coming soon | to be released in the next drop |
23
+ | **DPO** | coming soon | coming soon | to be released in the next drop |
24
+
25
+ ## Installation
26
+
27
+ Besides [PyTorch](https://pytorch.org/), you would need a few [extra dependencies](https://github.com/state-spaces/mamba?tab=readme-ov-file#installation) for
28
+ Mamba models.
29
+
30
+ We found some of these dependencies picky on PyTorch versions when doing pip install, so
31
+ the best way is to build from source for all Mamba dependencies if you hit dependency
32
+ issue with your env:
33
+ ```bash
34
+ git clone https://github.com/Dao-AILab/causal-conv1d.git
35
+ cd causal-conv1d && pip install . && cd ..
36
+ git clone https://github.com/state-spaces/mamba.git
37
+ cd mamba && pip install . && cd ..
38
+ git clone https://github.com/Dao-AILab/flash-attention.git
39
+ cd flash-attention && pip install . && cd ..
40
+ ```
41
+
42
+ ## Inference
43
+ You can utilize our newly contributed HF integration to run inference on our Bamba models:
44
+ ```python
45
+ from transformers import AutoModelForCausalLM, AutoTokenizer
46
+
47
+ model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B")
48
+ tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")
49
+
50
+ message = ["Mamba is a snake with following properties "]
51
+ inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
52
+ response = model.generate(**inputs, max_new_tokens=64)
53
+ print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
54
+
55
+ ```
56
+
57
+
58
+ ## Training
59
+
60
+ We trained our Bamba model with FSDP using our training repo [here](https://github.com/foundation-model-stack/fms-fsdp/tree/mamba-new).
61
+ Note that this training effort was started before FSDP2 and also long before we contributed
62
+ `Mamba2-Hybrid` to HF, so we were doing FSDP1 training with [official Mamba implementation](https://github.com/state-spaces/mamba).
63
+ For users trying to reproduce the training you now have much more options with our newly
64
+ contributed [HF-version of Mamba2-Hybrid]() (TODO: add link once live).
65
+
66
+
67
+ ## Benchmark scores
68
+
69
+ ### Base pretrained models
70
+
71
+ <table>
72
+ <tr>
73
+ <td><strong>Category</strong>
74
+ </td>
75
+ <td><strong>Benchmark</strong>
76
+ </td>
77
+ <td><strong>Setting</strong></td>
78
+ <td><strong>Metric</strong></td>
79
+ <td><strong>Bamba 9B (2.2T)</strong>
80
+ </td>
81
+ </tr>
82
+ <tr>
83
+ <td rowspan="8" >General
84
+ </td>
85
+ <td>MMLU
86
+ </td>
87
+ <td>5-shot</td>
88
+ <td>Accuracy</td>
89
+ <td>60.77
90
+ </td>
91
+ </tr>
92
+ <tr>
93
+ <td>ARC-C
94
+ </td>
95
+ <td>25-shot</td>
96
+ <td>Accuracy normalized</td>
97
+ <td>63.23
98
+ </td>
99
+ </tr>
100
+ <tr>
101
+ <td>GSM8K
102
+ </td>
103
+ <td>5-shot</td>
104
+ <td>exact match</td>
105
+ <td>36.77
106
+ </td>
107
+ </tr>
108
+ <tr>
109
+ <td>Hellaswag
110
+ </td>
111
+ <td>10-shot</td>
112
+ <td>Accuracy normalized</td>
113
+ <td>81.8
114
+ </td>
115
+ </tr>
116
+ <tr>
117
+ <td>OpenbookQA
118
+ </td>
119
+ <td>5-shot</td>
120
+ <td>Accuracy normalized</td>
121
+ <td>47.6
122
+ </td>
123
+ </tr>
124
+ <tr>
125
+ <td>Piqa
126
+ </td>
127
+ <td>5-shot</td>
128
+ <td>Accuracy normalized</td>
129
+ <td>82.26
130
+ </td>
131
+ </tr>
132
+ <tr>
133
+ <td>TruthfulQA
134
+ </td>
135
+ <td>0-shot</td>
136
+ <td>Accuracy</td>
137
+ <td>49.21
138
+ </td>
139
+ </tr>
140
+ <tr>
141
+ <td>Winogrande
142
+ </td>
143
+ <td>5-shot</td>
144
+ <td>Accuracy</td>
145
+ <td>76.87
146
+ </td>
147
+ </tr>
148
+ <tr>
149
+ <td rowspan="6" >HF LLM- V2
150
+ </td>
151
+ <td>MMLU-PRO
152
+ </td>
153
+ <td>5-shot</td>
154
+ <td>Accuracy</td>
155
+ <td>17.53
156
+ </td>
157
+ </tr>
158
+ <tr>
159
+ <td>BBH
160
+ </td>
161
+ <td>3-shot</td>
162
+ <td>Accuracy normalized</td>
163
+ <td>17.4
164
+ </td>
165
+ </tr>
166
+ <tr>
167
+ <td>GPQA
168
+ </td>
169
+ <td>0-shot</td>
170
+ <td>Accuracy normalized</td>
171
+ <td>4.14
172
+ </td>
173
+ </tr>
174
+ <tr>
175
+ <td>IFEval
176
+ </td>
177
+ <td>0-shot</td>
178
+ <td>inst_level_strict_acc + prompt_level_strict_acc</td>
179
+ <td>15.16
180
+ </td>
181
+ </tr>
182
+ <tr>
183
+ <td>MATH Lvl 5
184
+ </td>
185
+ <td>4-shot</td>
186
+ <td>Exact match</td>
187
+ <td>1.66
188
+ </td>
189
+ </tr>
190
+ <tr>
191
+ <td>MuSR
192
+ </td>
193
+ <td>0-shot</td>
194
+ <td>Accuracy normalized</td>
195
+ <td>9.59
196
+ </td>
197
+ </tr>
198
+ <tr>
199
+ <td rowspan="4" >Safety Tasks
200
+ </td>
201
+ <td>PopQA
202
+ </td>
203
+ <td>5-shot, generation</td>
204
+ <td>Accuracy</td>
205
+ <td>20.5
206
+ </td>
207
+ </tr>
208
+ <tr>
209
+ <td>Toxigen
210
+ </td>
211
+ <td>5-shot, logits</td>
212
+ <td>Accuracy</td>
213
+ <td>57.4
214
+ </td>
215
+ </tr>
216
+ <tr>
217
+ <td>BBQ
218
+ </td>
219
+ <td>5-shot, generation</td>
220
+ <td>Accuracy</td>
221
+ <td>44.2
222
+ </td>
223
+ </tr>
224
+ <tr>
225
+ <td>Crows-pairs_english
226
+ </td>
227
+ <td>5-shot, generation</td>
228
+ <td>pct_stereotype (lower is better)</td>
229
+ <td>70.78
230
+ </td>
231
+ </tr>
232
+ </table>
233
+
234
+
235
+ ## Fine-tuning
236
+
237
+ This [example](https://github.com/foundation-model-stack/bamba/blob/main/tuning/Fine-tuning.md) shows how to fine tune the bamba model for a specific task using [SFT Trainer](https://huggingface.co/docs/trl/en/sft_trainer#supervised-fine-tuning-trainer).
238
+
239
+
240
+ ## Quantization
241
+ We can create a (FP8) quantized model using [`fms-model-optimizer`](https://github.com/foundation-model-stack/fms-model-optimizer/), which will make the storage and inference even more efficient.
242
+ ```python
243
+ python -m fms_mo.run_quant \
244
+ --model_name_or_path <"path_to_original_model"> \
245
+ --quant_method fp8 \
246
+ --torch_dtype bfloat16 \
247
+ --output_dir <"path_to_save_new_model">
248
+ ```
249
+ Model size comparison before and after FP8:
250
+ ||original|quantized |
251
+ |:----:|----:|----:|
252
+ |memory (total)|39.12 GB|10.83 GB|
253
+ |memory (break-down)|`torch.float32` 39.12 GB|`torch.bfloat16` 2.10 GB<br>`torch.float8_e4m3fn` 8.73 GB|
254
+
255
+ More details about `fms-model-optimizer` can be found [here](https://github.com/foundation-model-stack/fms-model-optimizer/tree/main/examples/FP8_QUANT#quickstart).
256
+
257
+ ## Evaluation
258
+
259
+
260
+ ## Llama.cpp
261
+ There is preliminary work to enable running Bamba architecture models using [llama.cpp](https://github.com/ggerganov/llama.cpp). This is work-in-progress, so should only be used as a guide for the adventurous!
262
+
263
+ ### Known Limitations
264
+
265
+ * Currently, inference is only supported on CPUs
266
+ * Models quantized with `llama-quantize` exhibit bad performance
267
+
268
+ ### Setup
269
+ To enable Bamba support, you'll need to build from source using [Gabe's fork](https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitecture).
270
+
271
+ ```sh
272
+ git clone --branch BambaArchitecture [email protected]:gabe-l-hart/llama.cpp.git
273
+ cd llama.cpp
274
+ mkdir build
275
+ cd build
276
+ # NOTE: To build with debug symbols and extra logging, use CMAKE_BUILD_TYPE=Debug
277
+ cmake .. -DCMAKE_BUILD_TYPE=Release
278
+ make -j
279
+ ```
280
+
281
+ ### Conversion to GGUF
282
+ You can use a pre-converted GGUF file from Huggingface (e.g. [bamba-9b.gguf](https://huggingface.co/ibm-fms/Bamba-9B/blob/main/bamba-9b.gguf)). If one doesn't exist, you can use the [convert_hf_to_gguf.py](https://github.com/gabe-l-hart/llama.cpp/blob/BambaArchitecture/convert_hf_to_gguf.py) script from Gabe's fork to perform the conversion manually.
283
+
284
+ ```sh
285
+ # Install the python dependencies
286
+ cd /path/to/llama.cpp
287
+ pip install -r requirements/requirements-convert_hf_to_gguf.txt
288
+
289
+ # Perform the conversion
290
+ ./convert_hf_to_gguf.py /path/to/bamba-model --outfile /path/to/bamba-model/bamba-model.gguf
291
+ ```
292
+
293
+ ### Run with llama-cli
294
+
295
+ ```sh
296
+ # Run the model with no layers on the GPU (CPU-only)
297
+ cd /path/to/llama.cpp
298
+ ./bin/llama-cli -ngl 0 -m /path/to/bamba-model/bamba-model.gguf -p "Tell me a story about a developer and their dog"
299
+ ```
300
+
301
+ ### Quantization with llama-quantize
302
+ You can (optionally) quantize the GGUF model using `llama.cpp`'s built in quantizaiton tool `llama-quantize`.
303
+
304
+ ```sh
305
+ # Run the quantization (see llama-quantize --help for all quant types)
306
+ cd /path/to/llama.cpp
307
+ ./build/bin/llama-quantize /path/to/bamba-model/bamba-model.gguf Q4_K_M
308
+ ```
309
+
310
+ ## Contributors
311
+
312
+ * **Data collection and curation**: We acknowledge and thank AllenAI team for making a high quality open source dataset Dolma as well as Hugging Face data team for making FineWeb-edu and Cosmopedia available. These are tremendous contributions and enable us to create the model today.
313
+ * **Data preprocessing**: We thank IBM's internal data preprocessing team, specifically Tuan Hoang Trong, Syed Zawad, Jay Gala, and Ryan Gordon for helping tokenize the data at scale. The code for tokenization is available [here](https://github.com/IBM/data-prep-kit).
314
+ * **Model architecture**: The model architecture design was jointly done by Princeton, CMU, IBM, and UIUC and involved the following folks: Tri Dao (Princeton), Albert Gu (CMU), Linsong Chu (IBM), Davis Wertheimer (IBM), Minjia Zhang (UIUC), Mudhakar Srivatsa (IBM), and Raghu Ganti (IBM).
315
+ * **Model training**: Model training was performed primarily by the IBM team using the Mamba2 kernels and layer implementation from Tri Dao and Albert Gu. The following folks from IBM were primarily involved: Linsong Chu, Divya Kumari, Davis Wertheimer, Raghu Ganti, and Dakshi Agrawal.
316
+ * **Model tuning**: Tuning of the model was enabled and verified in [TRL](https://github.com/huggingface/trl) by the IBM team, involving Sukriti Sharma and Anh Uong.
317
+ * **Model inference**: Model inference in `transformers`, `vLLM`, and `llama.cpp` builds on the kernels written by Princeton and CMU. The IBM team is working with the community to enable it in various ecosystems, the team includes Fabian Lim, Antoni viros i Martin, Adnan Hoque, Jamie Yang, Nelson Nimura Gomez, Joshua Rosenkranz, Nick Hill, and Gabe Goodhart.
318
+ * **Quantization**: Quantization is led by the IBM team \- Naigang Wang and Charlie Liu.
319
+ * **Evaluations**: Evaluations are led by a team in IBM with long context evaluations being performed by UIUC, involving the following folks: Yotam Perlitz, Ofir Arviv, Michal Shmueli-Scheuer (IBM), Haoechen Shen, and Minjia Zhang (UIUC).
320
+
321
+ Finally, we would like to thank our leadership for their support in this effort \- Priya Nagpurkar, David Cox, Sriram Raghavan, Aya Soffer, and Mukesh Khare.
322
+
323
+ We would also like to thank the community, in particular Pablo Montalvo-Leroux and Vaibhav Srivastav from Hugging Face who provided valuable feedback to this blog and the PRs into transformers. Further, we would like to thank Tyler Michael Smith from Neural Magic, who is shepherding the integration with vLLM.
324
+
325
+ A huge shoutout to Meta PyTorch, AllenAI, and Hugging Face teams for their contributions to the open initative, FSDP allowed us to smoothly train this model and the data from Dolma and Fineweb/Cosmopedia made this model today!