You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Mistral擴充詞表只包含與教育部常用4808字的交集

後面補了25個dummy token,補到64的倍數可以增加訓練效率 未來可以作為special token的預留空間

  • 移除dummy token
  • 增加<|func_start|>, <|func_end|>
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(  
        'ocisd4/mistral_tokenizer_ext',  
        pad_token='<unk>',  
        add_bos_token=True,  
        add_eos_token=False  
)  

print('vocab size:', tokenizer.vocab_size)   
#vocab size: 35686

print(tokenizer.tokenize('今天天氣真好!'))   
#['▁', '今', '天', '天', '氣', '真', '好', '!']

print(tokenizer.encode('今天天氣真好!'))  
#[1, 28705, 30316, 29354, 29354, 32004, 29974, 29530, 29267]

print(tokenizer.decode(tokenizer.encode('今天天氣真好!')))  
#<s> 今天天氣真好!
Downloads last month
1
Safetensors
Model size
7.27B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support