Satwik11 commited on
Commit
ca0d26a
·
verified ·
1 Parent(s): 8f0dff4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama2
3
+ tags:
4
+ - llama
5
+ - text-generation
6
+ - causal-lm
7
+ - instruct
8
+ - quantization
9
+ - gptq
10
+ - 4-bit
11
+ - autoregressive
12
+ datasets:
13
+ - meta-llama/Llama-3.3-70B-Instruct
14
+ library_name: transformers
15
+ ---
16
+
17
+ # Llama 3.3 70B Instruct (AutoRound GPTQ 4-bit)
18
+
19
+ This repository provides a 4-bit quantized version of the **Llama 3.3 70B Instruct** model using the [AutoRound](https://github.com/jllllll/auto-round) method and GPTQ quantization. This process results in a significantly smaller model footprint with negligible degradation in performance (as measured by MMLU zero-shot evaluations).
20
+
21
+ ## Model Description
22
+
23
+ **Base Model:** [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
24
+
25
+ **Quantization:** 4-bit GPTQ with AutoRound
26
+
27
+ **Group Size:** 128
28
+ **Symmetry:** Enabled (`sym=True`)
29
+
30
+ This quantized model aims to preserve the capabilities and accuracy of the original Llama 3.3 70B Instruct model while drastically reducing the model size and computational overhead. By converting weights into a 4-bit representation with carefully selected quantization parameters, the model maintains near-original performance levels on challenging benchmarks.
31
+
32
+ ## Performance and Results
33
+
34
+ ### MMLU Zero-Shot Performance
35
+
36
+ - **Original Model (FP16):** ~81.82%
37
+ - **4-bit Quantized Model:** ~81.93%
38
+
39
+ As shown above, the 4-bit quantized model achieved an MMLU zero-shot accuracy of **81.93%**, which is effectively on par with the original FP16 model’s **81.82%**. Thus, the quantization process did not cause performance degradation based on this evaluation metric.
40
+
41
+ ### Model Size Reduction
42
+
43
+ - **Original FP16 Size:** ~141.06 GB
44
+ - **4-bit Quantized Size:** ~39.77 GB
45
+
46
+ The quantized model is approximately **3.5x smaller** than the original. This reduction significantly lowers storage requirements and can enable faster inference on more modest hardware.
47
+
48
+ ## Intended Use
49
+
50
+ **Primary Use Cases:**
51
+ - Instruction following and content generation.
52
+ - Conversational AI interfaces, virtual assistants, and chatbots.
53
+ - Research and experimentation on large language models with reduced resource requirements.
54
+
55
+ **Out-of-Scope Use Cases:**
56
+ - High-stakes decision-making without human review.
57
+ - Scenarios requiring guaranteed factual correctness (e.g., medical or legal advice).
58
+ - Generation of malicious or harmful content.
59
+
60
+ ## Limitations and Biases
61
+
62
+ Like the original Llama models, this quantized variant may exhibit:
63
+ - Hallucinations: The model can produce factually incorrect or nonsensical outputs.
64
+ - Biases: The model may reflect cultural, social, or other biases present in its training data.
65
+
66
+ Users should ensure proper oversight and consider the model’s responses critically. It’s not suitable for authoritative or mission-critical applications without additional safeguards.
67
+
68
+ ## How to Use
69
+
70
+ You can load the model using `transformers`:
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer, AutoModelForCausalLM
74
+ import torch
75
+
76
+ model_name = "Satwik11/Llama-3.3-70B-Instruct-AutoRound-GPTQ-4bit"
77
+
78
+ tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
79
+ model = AutoModelForCausalLM.from_pretrained(
80
+ model_name,
81
+ torch_dtype=torch.float16, # or torch.bfloat16 if supported
82
+ device_map="auto"
83
+ )
84
+
85
+ prompt = "Explain the concept of gravity to a 10-year-old."
86
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
87
+ outputs = model.generate(**inputs, max_length=200)
88
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
89
+
90
+ ```