base_model: unsloth/llama-3.2-1b-instruct-bnb-4bit
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - llama
  - gguf
license: apache-2.0
language:
  - en
Uploaded model
- Developed by: forestav
- License: apache-2.0
- Finetuned from model: unsloth/llama-3.2-1b-instruct-bnb-4bit
Model description
This model is a refined version of a LoRA adapter trained on the unsloth/Llama-3.2-3B-Instruct model using the FineTome-100k dataset. The finetuned model uses fewer parameters (1B vs. 3B) to achieve faster training and improved adaptability for specific tasks, such as medical applications.
Key adjustments:
- Reduced Parameter Count: The model was downsized to 1B parameters to improve training efficiency and ease customization.
- Adjusted Learning Rate: A smaller learning rate was used to prevent overfitting and mitigate catastrophic forgetting. This ensures the model retains its general pretraining knowledge while learning new tasks effectively.
The finetuning dataset, ruslanmv/ai-medical-chatbot, contains only 257k rows, which necessitated careful hyperparameter tuning to avoid over-specialization.
Hyperparameters and explanations
- Learning rate: - 2e-5
 A smaller learning rate reduces the risk of overfitting and catastrophic forgetting, particularly when working with models containing fewer parameters.
- Warm-up steps: - 5
 Warm-up allows the optimizer to gather gradient statistics before training at the full learning rate, improving stability.
- Per device train batch size: - 2
 Each GPU processes 2 training samples per step. This setup is suitable for resource-constrained environments.
- Gradient accumulation steps: - 4
 Gradients are accumulated over 4 steps to simulate a larger batch size (effective batch size: 8) without exceeding memory limits.
- Optimizer: - AdamW with 8-bit Quantization- AdamW: Adds weight decay to prevent overfitting.
- 8-bit Quantization: Reduces memory usage by compressing optimizer states, facilitating faster training.
 
- Weight decay: - 0.01
 Standard weight decay value effective across various training scenarios.
- Learning rate scheduler type: - Linear
 Gradually decreases the learning rate from the initial value to zero over the course of training.
Quantization details
The model is saved in 16-bit GGUF format, which:
- Ensures 100% accuracy retention.
- Trades off speed and memory for improved precision.
Training optimization
Training was accelerated by 2x using Unsloth in combination with Hugging Face's TRL library.
