Train a Llama model from scratch

Community Article Published July 29, 2024

1. Installing the Required Libraries

2. Logging into Hugging Face Hub

3. Loading the Necessary Libraries and Models

4. Formatting the Dataset

5. Setting Up the Training Arguments

6. Creating the Trainer

7. Training the Model

8. Pushing the Trained Model to Hugging Face Hub

This script is deprecated! Many updates to transformers have happened since its release!

In this tutorial, we'll walk through the process of training a language model using the Llama model architecture and the Transformers library.

1. Installing the Required Libraries

We'll start by installing the necessary libraries using pip:

!pip install -q datasets accelerate evaluate trl accelerate transformers jinja2

2. Logging into Hugging Face Hub

Next, we'll log into the Hugging Face Hub to access the required models and datasets:

from huggingface_hub import notebook_login

notebook_login()

3. Loading the Necessary Libraries and Models

We'll import the required libraries and load the Llama model and tokenizer:

this part is pretty complicated, so stay with me.

from datasets import load_dataset

dataset = load_dataset("your_dataset_name", split="train") # load the dataset

Here, we'll get the corpus to past to the tokenizer

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

training_corpus = get_training_corpus()

The base tokenizer is up to you, I'm using a blank one, but a lot of people opt for different ones, such as gpt2.

from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator(
    training_corpus,
    vocab_size=3200,
    min_frequency=2,
    special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>", "<|user|>", "<|bot|>", "<|end|>"] # you can pick the last two or three, as you'll see next
)

Next, we'll define the tokenizer special tokens and chat template.

from transformers import PreTrainedTokenizerFast

special_tokens = {
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "mask_token": "<mask>",
    "additional_special_tokens": ["<|user|>", "<|bot|>", "<|end|>"] # same here
}
tokenizer.add_special_tokens(special_tokens)

tokenizer.user_token_id = tokenizer.convert_tokens_to_ids("<|user|>") # here
tokenizer.assistant_token_id = tokenizer.convert_tokens_to_ids("<|bot|>") # too

chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '<|user|>\n' + message['content'] + '<|end|>\n' }}{% elif message['role'] == 'assistant' %}{{ '<|bot|>\n' + message['content'] + '<|end|>\n' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}{{ eos_token }}" # this is where you define the chat template, so you can go crazy here. Something a lot of people do for whatever reason is add seamingly random newline characters

tokenizer.chat_template = chat_template

Now, finally, we'll define the model.

from transformers import LlamaConfig, LlamaForCausalLM

print(tokenizer.apply_chat_template([{"role": "user", "content": "Why is the sky blue?"}, {"role": "assistant", "content": "Due to rayleigh scattering."}], tokenize=False)) # test to see if the chat template worked

config = LlamaConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=512,
    intermediate_size=1024,
    num_hidden_layers=8,
    num_attention_heads=8,
    max_position_embeddings=512,
    rms_norm_eps=1e-6,
    initializer_range=0.02,
    use_cache=True,
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    tie_word_embeddings=False,
)

model = LlamaForCausalLM(config)

4. Formatting the Dataset

We'll define a function to format the prompts in the dataset and map the dataset:

def format_prompts(examples):
    """
    Define the format for your dataset
    This function should return a dictionary with a 'text' key containing the formatted prompts.
    """
    pass

dataset = dataset.map(format_prompts, batched=True)

print(dataset['text'][2]) # Check to see if the fields were formatted correctly

5. Setting Up the Training Arguments

Define the training args:

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="your_output_dir",
    num_train_epochs=4, # replace this, depending on your dataset
    per_device_train_batch_size=16,
    learning_rate=1e-4,
    optim="sgd" # sgd, my beloved
)

6. Creating the Trainer

We'll create an instance of the SFTTrainer from the trl library:

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=dataset,
    dataset_text_field='text',
    max_seq_length=512
)

7. Training the Model

Finally, we'll start the training process:

trainer.train()

8. Pushing the Trained Model to Hugging Face Hub

After the training is complete, you can push the trained model to the Hugging Face Hub using the following command:

trainer.push_to_hub()

This will upload the model to your Hugging Face Hub account, making it available for future use or sharing.

That's it!

Community

yaanhaan

Feb 22

•

edited Feb 22

Dear Roggendorff,
thank you for sharing :),

it showed me an error TypeError: Input must be a List[Union[str, AddedToken]] when I was using add_special_tokens.

I just noticed that the tokenizer is an instance of the class ByteLevelBPETokenizer, and the method add_special_tokens, the argument should be in type List[Union[str, AddedToken]].
link. Is this an update of the API? or should we directly use transformers.LlamaTokenizer instead, which inherite from PreTrainedTokenizer?

specifically, it is

from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
special_tokens = {
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "mask_token": "<mask>",
    "additional_special_tokens": ["<|user|>", "<|bot|>", "<|end|>"] # same here
}
new_tokenizer.add_special_tokens(special_tokens)
new_tokenizer.save_pretrained("./tokenizer")

and the type of the new_tokenizer is transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast
is it correct?

thanks
Han

nroggendorff

Article author Feb 24

•

edited Feb 24

Dear Han,
Apologies for the inconvenience.

Indeed, bpetokenizers were deprecated quite some time ago.

Perhaps a look at the documentation, or my GitHub may solve your problem?

you’re welcome
Noa

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote