Alright — let’s make this DeepSeek LoRA fine-tuning script run in a conda virtual environment with CUDA GPU.
I’ll walk you through:
Conda environment setup (with CUDA-enabled PyTorch)
Installing all dependencies
Running the fine-tuning script
1️⃣ Create and activate the conda environment
Open your terminal (or Anaconda Prompt on Windows):
conda create -n deepseek_lora python=3.10 -y
conda activate deepseek_lora
2️⃣ Install PyTorch with CUDA support
Check your GPU's CUDA version:
nvidia-smi
If it says CUDA 12.1, run:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
For CUDA 11.8:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
3️⃣ Install dependencies for LoRA fine-tuning
pip install transformers accelerate datasets peft bitsandbytes sentencepiece einops
✅ Why each is needed
transformers → Model + tokenizer
accelerate → Efficient multi-GPU & mixed-precision
datasets → Easy JSONL loading
peft → LoRA training adapters
bitsandbytes → 8-bit GPU loading (saves VRAM)
sentencepiece → Tokenization for models like Qwen/DeepSeek
einops → Model reshaping utilities
4️⃣ Save the fine-tuning script
Save this as finetune_deepseek_lora.py in your project folder. (This is already CUDA-ready; it will auto-detect GPU via device_map="auto".)
import argparse
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, TaskType
import torch
parser = argparse.ArgumentParser()
parser.add_argument("--train_file", type=str, required=True)
parser.add_argument("--output_dir", type=str, default="lora_deepseek")
parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument("--micro_batch_size", type=int, default=1)
parser.add_argument("--learning_rate", type=float, default=2e-4)
parser.add_argument("--max_length", type=int, default=1024)
args = parser.parse_args()
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
# Load model in 8-bit for CUDA
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
load_in_8bit=True,
torch_dtype=torch.float16
)
model = prepare_model_for_kbit_training(model)
# Apply LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
# Load dataset
dataset = load_dataset("json", data_files={"train": args.train_file})
def preprocess(example):
prompt = example["prompt"].strip()
response = example["response"].strip()
text = prompt + "\n\n" + response
tokenized = tokenizer(text, truncation=True, max_length=args.max_length, padding="max_length")
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
dataset = dataset["train"].map(preprocess, remove_columns=dataset["train"].column_names)
# Training arguments
training_args = TrainingArguments(
output_dir=args.output_dir,
per_device_train_batch_size=args.micro_batch_size,
gradient_accumulation_steps=args.batch_size // args.micro_batch_size,
num_train_epochs=args.epochs,
learning_rate=args.learning_rate,
fp16=True,
optim="paged_adamw_32bit",
logging_steps=50,
save_strategy="epoch",
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer
)
trainer.train()
model.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
5️⃣ Prepare your dataset
Create data/train.jsonl:
{"prompt":"What is Python?","response":"Python is a high-level programming language."}
{"prompt":"Explain recursion.","response":"Recursion is a function calling itself until a base condition is met."}
6️⃣ Run fine-tuning in CUDA environment
python finetune_deepseek_lora.py --train_file data/train.jsonl --output_dir lora_deepseek --epochs 3 --batch_size 4 --micro_batch_size 1
This will:
Use CUDA automatically
Load DeepSeek model in 8-bit for VRAM savings
Train only LoRA adapters
Save results in lora_deepseek/
If you want, I can also add a CUDA memory-efficient accelerate config so you can run this even on an 8GB VRAM GPU without crashing. Do you want me to do that next?
- Downloads last month
- -
Model tree for ankitkushwaha90/deepseek-lora-finetune-template
Base model
deepseek-ai/DeepSeek-R1