Train a Llama model from scratch
In this tutorial, we'll walk through the process of training a language model using the Llama model architecture and the Transformers library.
1. Installing the Required Libraries
We'll start by installing the necessary libraries using pip:
!pip install -q datasets accelerate evaluate trl accelerate transformers jinja2
2. Logging into Hugging Face Hub
Next, we'll log into the Hugging Face Hub to access the required models and datasets:
from huggingface_hub import notebook_login
notebook_login()
3. Loading the Necessary Libraries and Models
We'll import the required libraries and load the Llama model and tokenizer:
this part is pretty complicated, so stay with me.
from datasets import load_dataset
dataset = load_dataset("your_dataset_name", split="train") # load the dataset
Here, we'll get the corpus to past to the tokenizer
def get_training_corpus():
for i in range(0, len(dataset), 1000):
yield dataset[i : i + 1000]["text"]
training_corpus = get_training_corpus()
The base tokenizer is up to you, I'm using a blank one, but a lot of people opt for different ones, such as gpt2.
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(
training_corpus,
vocab_size=3200,
min_frequency=2,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>", "<|user|>", "<|bot|>", "<|end|>"] # you can pick the last two or three, as you'll see next
)
Next, we'll define the tokenizer special tokens and chat template.
from transformers import PreTrainedTokenizerFast
special_tokens = {
"bos_token": "<s>",
"eos_token": "</s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>",
"additional_special_tokens": ["<|user|>", "<|bot|>", "<|end|>"] # same here
}
tokenizer.add_special_tokens(special_tokens)
tokenizer.user_token_id = tokenizer.convert_tokens_to_ids("<|user|>") # here
tokenizer.assistant_token_id = tokenizer.convert_tokens_to_ids("<|bot|>") # too
chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + '<|end|>\n' }}{% elif message['role'] == 'assistant' %}{{ '<|bot|>\n' + message['content'] + '<|end|>\n' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}{{ eos_token }}" # this is where you define the chat template, so you can go crazy here. Something a lot of people do for whatever reason is add seamingly random newline characters
tokenizer.chat_template = chat_template
Now, finally, we'll define the model.
from transformers import LlamaConfig, LlamaForCausalLM
print(tokenizer.apply_chat_template([{"role": "user", "content": "Why is the sky blue?"}, {"role": "assistant", "content": "Due to rayleigh scattering."}], tokenize=False)) # test to see if the chat template worked
config = LlamaConfig(
vocab_size=tokenizer.vocab_size,
hidden_size=512,
intermediate_size=1024,
num_hidden_layers=8,
num_attention_heads=8,
max_position_embeddings=512,
rms_norm_eps=1e-6,
initializer_range=0.02,
use_cache=True,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
tie_word_embeddings=False,
)
model = LlamaForCausalLM(config)
4. Formatting the Dataset
We'll define a function to format the prompts in the dataset and map the dataset:
def format_prompts(examples):
"""
Define the format for your dataset
This function should return a dictionary with a 'text' key containing the formatted prompts.
"""
pass
dataset = dataset.map(format_prompts, batched=True)
print(dataset['text'][2]) # Check to see if the fields were formatted correctly
5. Setting Up the Training Arguments
Define the training args:
from transformers import TrainingArguments
args = TrainingArguments(
output_dir="your_output_dir",
num_train_epochs=4, # replace this, depending on your dataset
per_device_train_batch_size=16,
learning_rate=1e-4,
optim="sgd" # sgd, my beloved
)
6. Creating the Trainer
We'll create an instance of the SFTTrainer
from the trl
library:
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=args,
train_dataset=dataset,
dataset_text_field='text',
max_seq_length=512
)
7. Training the Model
Finally, we'll start the training process:
trainer.train()
8. Pushing the Trained Model to Hugging Face Hub
After the training is complete, you can push the trained model to the Hugging Face Hub using the following command:
trainer.push_to_hub()
This will upload the model to your Hugging Face Hub account, making it available for future use or sharing.
That's it!