Update README.md
Browse files
README.md
CHANGED
@@ -163,6 +163,8 @@ Operation completed successfully (ignore any 'segmentation fault' that follows!!
|
|
163 |
|
164 |
**NOTE**: Due to the non-standard tokenizer, this needs the `--trust-remote-code` option.
|
165 |
|
|
|
|
|
166 |
## 2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens:
|
167 |
|
168 |
- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
|
|
|
163 |
|
164 |
**NOTE**: Due to the non-standard tokenizer, this needs the `--trust-remote-code` option.
|
165 |
|
166 |
+
**NOTE**: I had to manually delete `"pad_token_id": 163839` from `config.json` to get it to match the tokeniser when used in `llama.cpp` as a draft model.
|
167 |
+
|
168 |
## 2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens:
|
169 |
|
170 |
- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
|