Model Card for Kikinoking/MNLP_M3_quantized_model
A 4-bit double-quantized (NF4 + nested quant) version of the MNLP_M3_mcqa_model, compressed with bitsandbytes. This model answers multiple-choice questions (MCQA) with minimal GPU memory usage.
Model Details
- Model ID:
Kikinoking/MNLP_M3_quantized_model
- Quantization: 4-bit NF4 + nested quantization (
bnb_4bit_use_double_quant=True
) - Base model:
aidasvenc/MNLP_M3_mcqa_model
- Library: Transformers + bitsandbytes
- Task: Multiple-choice question answering (MCQA)
Usage
Load and run inference in just a few lines:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "Kikinoking/MNLP_M3_quantized_model"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
low_cpu_mem_usage=True
).eval()
prompt = "What is the capital of France ?\nA) Lyon B) Marseille C) Paris D) Toulouse\nAnswer: "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=1)
print("Answer:", tokenizer.decode(output[0], skip_special_tokens=True))
##How It Was Built
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
base_id = "aidasvenc/MNLP_M3_mcqa_model"
qcfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
base_id,
quantization_config=qcfg,
device_map="auto",
torch_dtype="auto"
)
# Push to Hugging Face Hub
model.push_to_hub("Kikinoking/MNLP_M3_quantized_model", private=True)
tokenizer.push_to_hub("Kikinoking/MNLP_M3_quantized_model")
print("VRAM used (MiB):", torch.cuda.memory_reserved()/1024**2)
- Downloads last month
- 17
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support