Question and Fine tuning script

#1
by almugabo - opened

Hi NickyNicky,
Thanks for sharing this !
I have a question:
if you fine-tune the
togethercomputer-LLaMA-2-7B-32K base model on a dataset with short contexts (length of input short), would you then expect it to perform well when given longer inputs ?

(I am assuming you trained it with a dataset in which the length of input was less than 3000 tokens. )

Thanks in advance for your reply.

P.S: also, would it be possible to share the script you use ( I guess QLoRa)
I tried it one one RTX 4090 (24GB) but get out of memory errors even using batch_size 1 .
I then shortened to 6k. It is still training but I can see already a very strange loss curve :-(

Update:
it did not work.
When I tried it on inputs which are longer than the maximum tokens I trained it on, it gives non sensical replies.
I guess one would need a bigger GPU to exhaust the full length (or at least get something like 16k).
How many RAM did you used ?

I saw in the threads below that the gentleman needed 2 A600s (2 x 48GB) for the xgen_7B_8k to finetune it with (QLoRa)
https://www.reddit.com/r/LocalLLaMA/comments/1546kiv/xgen_7b_8k_context_finetuned_on_guanaco/

credits to:

values to train:
per_device_train_batch_size=14
trust_remote_code=False

After training and joining the weights you can enable flash attention.

Thank you !
Great resources.
I will try your model to see how it behave when given a long input (I see that philschmid's script uses max_seq_length = 2048)

yes , more tokens ->> more time train.

Sign up or log in to comment