Mistral-7B v0.1 — 8-bit (bitsandbytes)
This repository provides an 8-bit (bitsandbytes) quantized version of mistralai/Mistral-7B-v0.1
for efficient inference on consumer GPUs. The architecture and tokenizer are unchanged; only the weight format/loader is adapted for memory savings.
Note: This is not an instruction-tuned or fine-tuned model—just an 8-bit loading of the base model for inference.
Model Details
- Developed by: Mistral AI (base model); quantized and shared by the repo maintainer
- Model type: Decoder-only causal language model (7B parameters)
- Languages: English (primarily), may generalize to other languages to a limited extent
- License: Apache-2.0 (inherits from base model)
- Finetuned from: No – this is the base model loaded in 8-bit (bnb.int8)
Model Sources
- Base model: https://huggingface.co/mistralai/Mistral-7B-v0.1
- This repo: https://huggingface.co/KavinduHansaka/mistral-7b-v0.1-8bit
What’s Included
config.json
,generation_config.json
- Weights (sharded safetensors):
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
- Tokenizer files:
tokenizer.model
,tokenizer.json
,tokenizer_config.json
,special_tokens_map.json
Uses
Direct Use
- General text generation, drafting, exploration, and research.
- Running on limited-VRAM GPUs (≈8–12 GB) where FP16 is not feasible.
Downstream Use
- Use as a base for experimentation (prompting, lightweight adapters) or as a drop-in 8-bit runtime in apps.
Out-of-Scope Use
- Safety-critical decisions, disallowed content generation, or applications requiring guaranteed factual correctness.
- Tasks requiring instruction tuning or alignment (this is not an instruct model).
Bias, Risks, and Limitations
- Bias & Safety: Inherits biases and safety limitations of the base model. It may produce inaccurate, offensive, or harmful content.
- Quality vs FP16: 8-bit loading can introduce minor quality regression versus FP16 on some prompts/long contexts.
- Long Context Behavior: Extremely long generations may degrade faster in quantized modes.
Recommendations
- Add guardrails (filtering, human-in-the-loop) for user-facing deployments.
- Evaluate on your domain tasks before production use.
- Consider an instruction-tuned variant if you need chat-style behavior out of the box.
How to Get Started
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
REPO_ID = "KavinduHansaka/mistral-7b-v0.1-8bit"
tok = AutoTokenizer.from_pretrained(REPO_ID, use_fast=True)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
tok.padding_side = "left" # or "right"
model = AutoModelForCausalLM.from_pretrained(
REPO_ID,
device_map="auto",
load_in_8bit=True, # bitsandbytes 8-bit
torch_dtype="auto"
)
gen = pipeline("text-generation", model=model, tokenizer=tok, device_map="auto")
print(gen("Explain attention mechanisms simply.", max_new_tokens=200)[0]["generated_text"])
Suggested dependencies
pip install -U "transformers>=4.41" "accelerate>=0.28" "bitsandbytes>=0.43.3" sentencepiece
Training Details
- Training data / procedure: N/A (no additional training; this is the original Mistral-7B v0.1 loaded with 8-bit weights).
- Precision: Runtime bnb.int8 quantization for inference (not GPTQ/AWQ/QLoRA).
- Speeds & Sizes: Depends on GPU; typical 8–12 GB VRAM allows single-prompt generations with moderate lengths.
Technical Specifications
Architecture
- Decoder-only Transformer, ~7B parameters (see base model for specifics).
Compute
- Hardware: Single NVIDIA GPU recommended (≥8 GB VRAM).
- Software: Python,
transformers
,accelerate
,bitsandbytes
,sentencepiece
, CUDA toolkit/driver for your GPU.
Citation
If you use this model, please cite Mistral AI and the libraries you rely on:
@software{mistral_7b_v0_1,
title = {Mistral 7B v0.1},
author = {Mistral AI},
year = {2023},
url = {https://huggingface.co/mistralai/Mistral-7B-v0.1}
}
- Downloads last month
- 16
Model tree for KavinduHansaka/mistral-7b-v0.1-8bit
Base model
mistralai/Mistral-7B-v0.1