royleibov
/

Jamba-v0.1-ZipNN-Compressed

@@ -7,6 +7,24 @@ tags:
 - moe
 base_model: ai21labs/Jamba-v0.1
 ---
 # Model Card for Jamba
@@ -16,8 +34,6 @@ Jamba is the first production-scale Mamba implementation, which opens up interes
 This model card is for the base version of Jamba. It’s a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and a total of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
-This fork is compressed using **ZipNN**. To use the model, decompress the model tensors as discribed below and load the **local** weights.
 For full details of this model please read the [white paper](https://arxiv.org/abs/2403.19887) and the [release blog post](https://www.ai21.com/blog/announcing-jamba).
 ## Model Details
@@ -43,25 +59,17 @@ You also have to have the model on a CUDA device.
 You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.
-You need to [clone this repository](https://huggingface.co/royleibov/Jamba-v0.1-ZipNN-Compressed?clone=true) to decompress the model.
-Then:
-```bash
-cd Jamba-v0.1-ZipNN-Compressed
-```
 ### Run the model
-First decompress the model weights:
-```bash
-python3 zipnn_decompress_path.py --path .
-```
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL") # "." if in directory
-tokenizer = AutoTokenizer.from_pretrained("PATH_TO_MODEL") # "." if in directory
 input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
@@ -81,15 +89,23 @@ Please note that if you're using `transformers<4.40.0`, `trust_remote_code=True`
 ```python
 from transformers import AutoModelForCausalLM
 import torch
-model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory,
                                              torch_dtype=torch.bfloat16)    # you can also use torch_dtype=torch.float16
 ```
 When using half precision, you can enable the [FlashAttention2](https://github.com/Dao-AILab/flash-attention) implementation of the Attention blocks. In order to use it, you also need the model on a CUDA device. Since in this precision the model is to big to fit on a single 80GB GPU, you'll also need to parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index):
 ```python
 from transformers import AutoModelForCausalLM
 import torch
-model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory
                                              torch_dtype=torch.bfloat16,
                                              attn_implementation="flash_attention_2",
                                              device_map="auto")
@@ -102,9 +118,13 @@ model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in direct
 ```python
 from transformers import AutoModelForCausalLM, BitsAndBytesConfig
 quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                          llm_int8_skip_modules=["mamba"])
-model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory
                                              torch_dtype=torch.bfloat16,
                                              attn_implementation="flash_attention_2",
                                              quantization_config=quantization_config)
@@ -120,9 +140,12 @@ from datasets import load_dataset
 from trl import SFTTrainer, SFTConfig
 from peft import LoraConfig
 from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
-tokenizer = AutoTokenizer.from_pretrained("PATH_TO_MODEL") # "." if in directory
-model = AutoModelForCausalLM.from_pretrained("PATH_TO_MODEL", # "." if in directory
             device_map='auto', torch_dtype=torch.bfloat16)
 lora_config = LoraConfig(

 - moe
 base_model: ai21labs/Jamba-v0.1
 ---
+# Disclaimer and Requirements
+This model is a clone of [ai21labs/Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1) compressed using ZipNN. Compressed losslessly to 67% its original size, ZipNN saved ~35GB in storage and potentially ~1PB in data transer **monthly**.
+## Requirement
+In order to use the model, ZipNN is necessary:
+```bash
+pip install zipnn
+```
+Then simply add at the beginning of the file
+```python
+from zipnn import zipnn_hf_patch
+zipnn_hf_patch()
+```
+And continue as usual. The patch will take care of decompressing the model correctly and safely.
 # Model Card for Jamba
 This model card is for the base version of Jamba. It’s a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and a total of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
 For full details of this model please read the [white paper](https://arxiv.org/abs/2403.19887) and the [release blog post](https://www.ai21.com/blog/announcing-jamba).
 ## Model Details
 You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model.
 ### Run the model
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+from zipnn import zipnn_hf_patch
+zipnn_hf_patch()
+model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed")
+tokenizer = AutoTokenizer.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed")
 input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
 ```python
 from transformers import AutoModelForCausalLM
 import torch
+from zipnn import zipnn_hf_patch
+zipnn_hf_patch()
+model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
                                              torch_dtype=torch.bfloat16)    # you can also use torch_dtype=torch.float16
 ```
 When using half precision, you can enable the [FlashAttention2](https://github.com/Dao-AILab/flash-attention) implementation of the Attention blocks. In order to use it, you also need the model on a CUDA device. Since in this precision the model is to big to fit on a single 80GB GPU, you'll also need to parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index):
 ```python
 from transformers import AutoModelForCausalLM
+from zipnn import zipnn_hf_patch
+zipnn_hf_patch()
 import torch
+model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
                                              torch_dtype=torch.bfloat16,
                                              attn_implementation="flash_attention_2",
                                              device_map="auto")
 ```python
 from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+from zipnn import zipnn_hf_patch
+zipnn_hf_patch()
 quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                          llm_int8_skip_modules=["mamba"])
+model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
                                              torch_dtype=torch.bfloat16,
                                              attn_implementation="flash_attention_2",
                                              quantization_config=quantization_config)
 from trl import SFTTrainer, SFTConfig
 from peft import LoraConfig
 from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
+from zipnn import zipnn_hf_patch
+zipnn_hf_patch()
+tokenizer = AutoTokenizer.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed")
+model = AutoModelForCausalLM.from_pretrained("royleibov/Jamba-v0.1-ZipNN-Compressed",
             device_map='auto', torch_dtype=torch.bfloat16)
 lora_config = LoraConfig(