Improve language tag (#1)

Browse files

- Improve language tag (2b61c26317bf1af7936b746d95cba077900f65ec)

Co-authored-by: Loïck BOURDOIS <[email protected]>

Files changed (1) hide show

README.md +50 -38

README.md CHANGED Viewed

@@ -1,38 +1,50 @@
----
-license: apache-2.0
-datasets:
-- NewEden/Roleplay-Logs-Sharegpt-Ngram-cleaned
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-7B
-pipeline_tag: text-generation
-tags:
-- unsloth
-- dialogue
----
-ChatML as always. Full-precision this time. Quants will come later.
-An experiment to make causal models more conversational. Many of them can already chat, but suffer from problems like occasional dullness, incoherence and verbatim repetition. This model DOES NOT follow assistant-style instructions and IS NOT INTENDED TO.
-## Findings
-The default ChatML template causes the model to sometimes identify as an AI assistant, which we'd consider undesirable. This is probably due to the assistant/user/system markers. Future iterations will likely use our own format.
-We fine tuned the Qwen2.5-7B causal model as a 32-rank LoRA on the [Roleplay-Logs-Sharegpt-Ngram-Cleaned](https://huggingface.co/datasets/NewEden/Roleplay-Logs-Sharegpt-Ngram-cleaned) dataset by NewEden, specifically the first 500 rows for 3 epochs. Despite the name, it includes non-RP conversation as well.
-This dataset probably isn't the best, it has a few problems:
-- It's not the most coherent thing ever.
-- It contains some examples of "brainrot" and nonsensical phrase repetition, which we don't see as bad, but it seems to confuse the model a bit.
-- It's still partially synthetic, based on [character.ai](https://character.ai) logs, so it's bound to contain some cliched phrases from their model, which is not ideal. The goal is NOT TO replicate the character.ai model, but to build a unique conversational model that is fun to interact with.
-However:
-- It also contains a lot of interesting conversational patterns which corporate instruct models would never spit out.
-- After training, the model is usable and very fun to interact with. It still feels a bit undercooked, so we plan to address that.
-We plan to keep this dataset in future iterations (however in moderation). We plan to include dialogue scraped from Reddit, and the [Discord-Data](https://www.kaggle.com/datasets/jef1056/discord-data) dataset, and probably some other things if we consider them interesting. The next iteration will include this data.
-We do not plan to include instructions or synthetic data from models like GPT-4 or Claude, as those have been fine-tuned for agreeability and professional tone. Moreso, when attempting to prompt the model to write more casually, they tend to stick too hard to the guidelines provided (when a lot of them are included), or write in a stilted, cheesy and unnatural way (when the instructions provided are vague).
-However, we do plan to experiment with instruction following in the future 😊

+---
+license: apache-2.0
+datasets:
+- NewEden/Roleplay-Logs-Sharegpt-Ngram-cleaned
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-7B
+pipeline_tag: text-generation
+tags:
+- unsloth
+- dialogue
+---
+ChatML as always. Full-precision this time. Quants will come later.
+An experiment to make causal models more conversational. Many of them can already chat, but suffer from problems like occasional dullness, incoherence and verbatim repetition. This model DOES NOT follow assistant-style instructions and IS NOT INTENDED TO.
+## Findings
+The default ChatML template causes the model to sometimes identify as an AI assistant, which we'd consider undesirable. This is probably due to the assistant/user/system markers. Future iterations will likely use our own format.
+We fine tuned the Qwen2.5-7B causal model as a 32-rank LoRA on the [Roleplay-Logs-Sharegpt-Ngram-Cleaned](https://huggingface.co/datasets/NewEden/Roleplay-Logs-Sharegpt-Ngram-cleaned) dataset by NewEden, specifically the first 500 rows for 3 epochs. Despite the name, it includes non-RP conversation as well.
+This dataset probably isn't the best, it has a few problems:
+- It's not the most coherent thing ever.
+- It contains some examples of "brainrot" and nonsensical phrase repetition, which we don't see as bad, but it seems to confuse the model a bit.
+- It's still partially synthetic, based on [character.ai](https://character.ai) logs, so it's bound to contain some cliched phrases from their model, which is not ideal. The goal is NOT TO replicate the character.ai model, but to build a unique conversational model that is fun to interact with.
+However:
+- It also contains a lot of interesting conversational patterns which corporate instruct models would never spit out.
+- After training, the model is usable and very fun to interact with. It still feels a bit undercooked, so we plan to address that.
+We plan to keep this dataset in future iterations (however in moderation). We plan to include dialogue scraped from Reddit, and the [Discord-Data](https://www.kaggle.com/datasets/jef1056/discord-data) dataset, and probably some other things if we consider them interesting. The next iteration will include this data.
+We do not plan to include instructions or synthetic data from models like GPT-4 or Claude, as those have been fine-tuned for agreeability and professional tone. Moreso, when attempting to prompt the model to write more casually, they tend to stick too hard to the guidelines provided (when a lot of them are included), or write in a stilted, cheesy and unnatural way (when the instructions provided are vague).
+However, we do plan to experiment with instruction following in the future 😊