medical_model / README.md
forestav's picture
Update README.md
aa50a79 verified
metadata
base_model: unsloth/llama-3.2-1b-instruct-bnb-4bit
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - llama
  - gguf
license: apache-2.0
language:
  - en

Uploaded model

Model description

This model is a refined version of a LoRA adapter trained on the unsloth/Llama-3.2-3B-Instruct model using the FineTome-100k dataset. The finetuned model uses fewer parameters (1B vs. 3B) to achieve faster training and improved adaptability for specific tasks, such as medical applications.

Key adjustments:

  1. Reduced Parameter Count: The model was downsized to 1B parameters to improve training efficiency and ease customization.
  2. Adjusted Learning Rate: A smaller learning rate was used to prevent overfitting and mitigate catastrophic forgetting. This ensures the model retains its general pretraining knowledge while learning new tasks effectively.

The finetuning dataset, ruslanmv/ai-medical-chatbot, contains only 257k rows, which necessitated careful hyperparameter tuning to avoid over-specialization.


Hyperparameters and explanations

  • Learning rate: 2e-5
    A smaller learning rate reduces the risk of overfitting and catastrophic forgetting, particularly when working with models containing fewer parameters.

  • Warm-up steps: 5
    Warm-up allows the optimizer to gather gradient statistics before training at the full learning rate, improving stability.

  • Per device train batch size: 2
    Each GPU processes 2 training samples per step. This setup is suitable for resource-constrained environments.

  • Gradient accumulation steps: 4
    Gradients are accumulated over 4 steps to simulate a larger batch size (effective batch size: 8) without exceeding memory limits.

  • Optimizer: AdamW with 8-bit Quantization

    • AdamW: Adds weight decay to prevent overfitting.
    • 8-bit Quantization: Reduces memory usage by compressing optimizer states, facilitating faster training.
  • Weight decay: 0.01
    Standard weight decay value effective across various training scenarios.

  • Learning rate scheduler type: Linear
    Gradually decreases the learning rate from the initial value to zero over the course of training.


Quantization details

The model is saved in 16-bit GGUF format, which:

  • Ensures 100% accuracy retention.
  • Trades off speed and memory for improved precision.

Training optimization

Training was accelerated by 2x using Unsloth in combination with Hugging Face's TRL library.