--- base_model: - meta-llama/Llama-3.1-405B-Instruct library_name: transformers license: llama3.1 --- # This model has been xMADified! This repository contains [`meta-llama/Meta-Llama-3.1-405B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) quantized, using xMAD.ai proprietary technology, from 16-bit floats to 4-bit integers. # How to Run Model Loading the model checkpoint of this xMADified model requires < 200 GiB of VRAM. Hence it can be efficiently run on **1 node** of **8 x V100-32GB** GPUs, or **3 x A100-80GB** GPUs. **Fine-tuning**: Our 405B model is fine-tunable over the same reduced (< 200 GB) hardware in mere 3-clicks. Watch our product demo [here](https://www.youtube.com/watch?v=S0wX32kT90s&list=TLGGL9fvmJ-d4xsxODEwMjAyNA) **Package prerequisites**: Run the following commands to install the required packages. ```bash pip install -q --upgrade transformers accelerate optimum pip install -q --no-build-isolation auto-gptq ``` **Sample Inference Code** ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model_id = "xmadai/Llama-3.1-405B-Instruct-xMADai-4bit" prompt = [ {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."}, {"role": "user", "content": "What's Deep Learning?"}, ] tokenizer = AutoTokenizer.from_pretrained(model_id) inputs = tokenizer.apply_chat_template( prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to("cuda") model = AutoGPTQForCausalLM.from_quantized( model_id, device_map='auto', trust_remote_code=True, ) outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ``` # Model Quality We report the zero-shot accuracy of this xMADified model on popular benchmarks below. The results are obtained using [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness). | Model | Arc Challenge | Arc Easy | LAMBADA OpenAI | LAMBADA Standard | MMLU Humanities | MMLU STEM | WinoGrande | |---|---|---|---|---|---|---|---| | xMADified Llama-3.1-405B-Instruct | 64.76 | 88.26 | 77.08 | 73.32 | 82.74 | 82.18 | 81.22 | Other xMADified models and their GPU memory requirements are listed below. For additional xMADified models, access to fine-tuning, and general questions, please contact us at support@xmad.ai and join our waiting list.