ArlowGPT Tokenizer

This repository contains a custom-trained BPE tokenizer for ArlowGPT, created by Yuchen Xie.

Tokenizer Details

  • Type: BPE (Byte-Pair Encoding)
  • Vocabulary Size: 131,072 tokens
  • Special Tokens:
    • Start of Text: <|startoftext|>
    • End of Text: <|endoftext|>
    • Padding: <|pad|>
    • Unknown: <|unk|>
    • Mask: <|mask|>
    • Message Start: <|im_start|>
    • Message End: <|im_end|>

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yuchenxie/arlowgpt-tokenizer-v2")

Training Details

This tokenizer was trained on the 10B GPT-2 randomly shuffled tokens under a custom script composed by Yuchen Xie. This tokenizer is compatible with HuggingFace Transformer's Auto Tokenizer class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for yuchenxie/arlowgpt-tokenizer-v2

Finetuned
(1)
this model

Dataset used to train yuchenxie/arlowgpt-tokenizer-v2