Please Provide Chat Format and Temp Settings

by MysteriousPlane - opened 20 days ago

20 days ago

This is a fantastic model, thanks for doing this!

Unfortunately, after tinkering with it for hours and hours literally, I'm unable to make it stop its generation. For the first couple of messages it might work, but then it just keeps on generating until it hits the 2048 token limit.

Please, if you can, let me know the proper chat format, and temp settings. I'm using LLama 3 Instruct in Instruct Template and Context Template in SIllyTavern but it's not working. The model I've downloaded is the Q4_K_M quant through KoboldCPP with Q8 KV cache. Temp is 0.7, Top K 40. I'm splitting 25 layers to my VRAM (16GB) and the rest to my system RAM (64GB).

Tarek07

Tarek's Graveyard org 20 days ago

Hi thanks!

Unfortunately this was done before I realized I could add the chat template into my config file. It is setup as a text completion model which might explain the issues you are getting. To get it to work for chat would mean playing with the configs which you wouldn't be able to manipulate on the GGUF.

A work around may be to add 'assistant' to your stop sequence. But that is all I can think of. Unfortunatley I only use text completion not chat at all so I am not 100% sure.

As for the settings. I use (with KoboldCPP and silly tavern, not sure if chat uses different samplers):

MysteriousPlane

20 days ago

Thanks a lot for your detailed reply! I will use your suggestions, the screenshot helps a ton! I am using it as a text completion model haha. I do have Llam@ception 1.5, so I will download the version you're using to see if it's better. I will report back soon. :D

MysteriousPlane

19 days ago

Here's my settings, at temp 1 it's talking kind of flowery and medieval-like, and at temp 0.7 it's really getting defensive while calling me defensive and yapping a lot hahah. Please let me know if I'm doing this right. Llam@ception 1.5 seems to be working, it's not generating nonstop until the token limit.

Tarek07

Tarek's Graveyard org 19 days ago

Awesome! Personally the only samplers I ever mess around with are Temp (between 0.7 - 1.1) and Min P (between 0.01 - 0.05) depending on the model of course. The rest are usual safe to leave as is. As for the meme Samplers like smoothing and XTC, I only use DRY as you have it there.

As for what they do, lowering temp makes the models replies more deterministic, while raising it makes the model more creative when its choosing its next token. Min P on the other hand gets more creative the lower you go.

Bonus: If you like this model you should try some of my mainline models, I'll link some below:
Tarek07/Legion-V2.1-LLaMa-70B
Tarek07/Dungeonmaster-V2.2-Expanded-LLaMa-70B
Tarek07/Dungeonmaster-V2.4-Expanded-LLaMa-70B

Thanks for reaching out and sharing! Have fun!

MysteriousPlane

19 days ago

Thanks for getting back again! Makes sense, I'll play with Min P more now! I'll check your other models out as well, thanks for listing them. I actually came to your model because it's currently ranking at the UGI leaderboard (https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard) as the highest "Willingness to answer" (9.5) model with over 50% of total score. Anyways, feel free to close this thread if you want, as this is practically resolved right. The model's able to stop when it wants to. Thanks again for making this model!

Tarek07 changed discussion status to closed 19 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment