royleibov commited on
Commit
6640b34
·
verified ·
1 Parent(s): fc0c277

Update README to emphasize its a clone and correctly use ZipNN

Browse files
Files changed (1) hide show
  1. README.md +43 -20
README.md CHANGED
@@ -7,6 +7,24 @@ tags:
7
  - moe
8
  base_model: ai21labs/Jamba-v0.1
9
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  # Model Card for Jamba
12
 
@@ -16,8 +34,6 @@ Jamba is the first production-scale Mamba implementation, which opens up interes
16
 
17
  This model card is for the base version of Jamba. It’s a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and a total of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
18
 
19
- This fork is compressed using **ZipNN**. To use the model, decompress the model tensors as discribed below and load the **local** weights.
20
-
21
  For full details of this model please read the [white paper](https://arxiv.org/abs/2403.19887) and the [release blog post](https://www.ai21.com/blog/announcing-jamba).
22
 
23
  ## Model Details
@@ -43,25 +59,17 @@ You also have to have the model on a CUDA device.
43
 
44
  You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.
45
 
46
- You need to [clone this repository](https://huggingface.co/royleibov/Jamba-v0.1-ZipNN-Compressed?clone=true) to decompress the model.
47
-
48
- Then:
49
- ```bash
50
- cd Jamba-v0.1-ZipNN-Compressed
51
- ```
52
-
53
 
54
  ### Run the model
55
- First decompress the model weights:
56
- ```bash
57
- python3 zipnn_decompress_path.py --path .
58
- ```
59
 
60
  ```python
61
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
62
 
63
- model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL") # "." if in directory
64
- tokenizer = AutoTokenizer.from_pretrained("PATH_TO_MODEL") # "." if in directory
65
 
66
  input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
67
 
@@ -81,15 +89,23 @@ Please note that if you're using `transformers<4.40.0`, `trust_remote_code=True`
81
  ```python
82
  from transformers import AutoModelForCausalLM
83
  import torch
84
- model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory,
 
 
 
 
85
  torch_dtype=torch.bfloat16) # you can also use torch_dtype=torch.float16
86
  ```
87
 
88
  When using half precision, you can enable the [FlashAttention2](https://github.com/Dao-AILab/flash-attention) implementation of the Attention blocks. In order to use it, you also need the model on a CUDA device. Since in this precision the model is to big to fit on a single 80GB GPU, you'll also need to parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index):
89
  ```python
90
  from transformers import AutoModelForCausalLM
 
 
 
 
91
  import torch
92
- model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory
93
  torch_dtype=torch.bfloat16,
94
  attn_implementation="flash_attention_2",
95
  device_map="auto")
@@ -102,9 +118,13 @@ model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in direct
102
 
103
  ```python
104
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig
 
 
 
 
105
  quantization_config = BitsAndBytesConfig(load_in_8bit=True,
106
  llm_int8_skip_modules=["mamba"])
107
- model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory
108
  torch_dtype=torch.bfloat16,
109
  attn_implementation="flash_attention_2",
110
  quantization_config=quantization_config)
@@ -120,9 +140,12 @@ from datasets import load_dataset
120
  from trl import SFTTrainer, SFTConfig
121
  from peft import LoraConfig
122
  from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
 
 
 
123
 
124
- tokenizer = AutoTokenizer.from_pretrained("PATH_TO_MODEL") # "." if in directory
125
- model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory
126
  device_map='auto', torch_dtype=torch.bfloat16)
127
 
128
  lora_config = LoraConfig(
 
7
  - moe
8
  base_model: ai21labs/Jamba-v0.1
9
  ---
10
+ # Disclaimer and Requirements
11
+
12
+ This model is a clone of [ai21labs/Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1) compressed using ZipNN. Compressed losslessly to 67% its original size, ZipNN saved ~35GB in storage and potentially ~1PB in data transer **monthly**.
13
+
14
+ ## Requirement
15
+
16
+ In order to use the model, ZipNN is necessary:
17
+ ```bash
18
+ pip install zipnn
19
+ ```
20
+
21
+ Then simply add at the beginning of the file
22
+ ```python
23
+ from zipnn import zipnn_hf_patch
24
+
25
+ zipnn_hf_patch()
26
+ ```
27
+ And continue as usual. The patch will take care of decompressing the model correctly and safely.
28
 
29
  # Model Card for Jamba
30
 
 
34
 
35
  This model card is for the base version of Jamba. It’s a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and a total of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
36
 
 
 
37
  For full details of this model please read the [white paper](https://arxiv.org/abs/2403.19887) and the [release blog post](https://www.ai21.com/blog/announcing-jamba).
38
 
39
  ## Model Details
 
59
 
60
  You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.
61
 
 
 
 
 
 
 
 
62
 
63
  ### Run the model
 
 
 
 
64
 
65
  ```python
66
  from transformers import AutoModelForCausalLM, AutoTokenizer
67
+ from zipnn import zipnn_hf_patch
68
+
69
+ zipnn_hf_patch()
70
 
71
+ model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed")
72
+ tokenizer = AutoTokenizer.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed")
73
 
74
  input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
75
 
 
89
  ```python
90
  from transformers import AutoModelForCausalLM
91
  import torch
92
+ from zipnn import zipnn_hf_patch
93
+
94
+ zipnn_hf_patch()
95
+
96
+ model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
97
  torch_dtype=torch.bfloat16) # you can also use torch_dtype=torch.float16
98
  ```
99
 
100
  When using half precision, you can enable the [FlashAttention2](https://github.com/Dao-AILab/flash-attention) implementation of the Attention blocks. In order to use it, you also need the model on a CUDA device. Since in this precision the model is to big to fit on a single 80GB GPU, you'll also need to parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index):
101
  ```python
102
  from transformers import AutoModelForCausalLM
103
+ from zipnn import zipnn_hf_patch
104
+
105
+ zipnn_hf_patch()
106
+
107
  import torch
108
+ model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
109
  torch_dtype=torch.bfloat16,
110
  attn_implementation="flash_attention_2",
111
  device_map="auto")
 
118
 
119
  ```python
120
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig
121
+ from zipnn import zipnn_hf_patch
122
+
123
+ zipnn_hf_patch()
124
+
125
  quantization_config = BitsAndBytesConfig(load_in_8bit=True,
126
  llm_int8_skip_modules=["mamba"])
127
+ model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
128
  torch_dtype=torch.bfloat16,
129
  attn_implementation="flash_attention_2",
130
  quantization_config=quantization_config)
 
140
  from trl import SFTTrainer, SFTConfig
141
  from peft import LoraConfig
142
  from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
143
+ from zipnn import zipnn_hf_patch
144
+
145
+ zipnn_hf_patch()
146
 
147
+ tokenizer = AutoTokenizer.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed")
148
+ model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
149
  device_map='auto', torch_dtype=torch.bfloat16)
150
 
151
  lora_config = LoraConfig(