ArlowGPT Tokenizer
This repository contains a custom-trained BPE tokenizer for ArlowGPT, created by Yuchen Xie.
Tokenizer Details
- Type: BPE (Byte-Pair Encoding)
- Vocabulary Size: 131,072 tokens
- Special Tokens:
- Start of Text: <|startoftext|>
- End of Text: <|endoftext|>
- Padding: <|pad|>
- Unknown: <|unk|>
- Mask: <|mask|>
- Message Start: <|im_start|>
- Message End: <|im_end|>
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yuchenxie/arlowgpt-tokenizer-v2")
Training Details
This tokenizer was trained on the 10B GPT-2 randomly shuffled tokens under a custom script composed by Yuchen Xie. This tokenizer is compatible with HuggingFace Transformer's Auto Tokenizer
class.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no pipeline_tag.
Model tree for yuchenxie/arlowgpt-tokenizer-v2
Base model
yuchenxie/ArlowGPT-Tokenizer