YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Mistral擴充詞表只包含與教育部常用4808字的交集
後面補了25個dummy token,補到64的倍數可以增加訓練效率
未來可以作為special token的預留空間
- 移除dummy token
- 增加
<|func_start|>
,<|func_end|>
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
'ocisd4/mistral_tokenizer_ext',
pad_token='<unk>',
add_bos_token=True,
add_eos_token=False
)
print('vocab size:', tokenizer.vocab_size)
#vocab size: 35686
print(tokenizer.tokenize('今天天氣真好!'))
#['▁', '今', '天', '天', '氣', '真', '好', '!']
print(tokenizer.encode('今天天氣真好!'))
#[1, 28705, 30316, 29354, 29354, 32004, 29974, 29530, 29267]
print(tokenizer.decode(tokenizer.encode('今天天氣真好!')))
#<s> 今天天氣真好!
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support